On 07.09.2015 13:54, Dan Kenigsberg wrote:
On Mon, Sep 07, 2015 at 11:47:48AM +0200, Patrick Hurrelmann wrote:
> On 06.09.2015 11:30, Dan Kenigsberg wrote:
>> On Fri, Sep 04, 2015 at 10:26:39AM +0200, Patrick Hurrelmann wrote:
>>> Hi all,
>>>
>>> I just updated my existing oVirt 3.5.3 installation (iSCSI hosted-engine on
>>> CentOS 7.1). The engine update went fine. Updating the hosts succeeds until
the
>>> first reboot. After a reboot the host does not come up again. It is missing
all
>>> network configuration. All network cfgs in /etc/sysconfig/network-scripts
are
>>> missing except ifcfg-lo. The host boots up without working networking. Using
>>> IPMI and config backups, I was able to restore the lost network configs.
Once
>>> these are restored and the host is rebooted again all seems to be back to
good.
>>> This has now happend to 2 updated hosts (this installation has a total of 4
>>> hosts, so 2 more to debug/try). I'm happy to assist in furter debugging.
>>>
>>> Before updating the second host, I gathered some information. All these
hosts
>>> have 3 physical nics. One is used for the ovirtmgmt bridge and the other 2
are
>>> used for iSCSI storage vlans.
>>>
>>> ifcfgs before update:
>>>
>>> /etc/sysconfig/network-scripts/ifcfg-em1
>>> # Generated by VDSM version 4.16.20-0.el7.centos
>>> DEVICE=em1
>>> HWADDR=d0:67:e5:f0:e5:c6
>>> BRIDGE=ovirtmgmt
>>> ONBOOT=yes
>>> NM_CONTROLLED=no
>> /etc/sysconfig/network-scripts/ifcfg-lo
>>> DEVICE=lo
>>> IPADDR=127.0.0.1
>>> NETMASK=255.0.0.0
>>> NETWORK=127.0.0.0
>>> # If you're having problems with gated making 127.0.0.0/8 a martian,
>>> # you can change this to something else (255.255.255.255, for example)
>>> BROADCAST=127.255.255.255
>>> ONBOOT=yes
>>> NAME=loopback
>>>
>>> /etc/sysconfig/network-scripts/ifcfg-ovirtmgmt
>>> # Generated by VDSM version 4.16.20-0.el7.centos
>>> DEVICE=ovirtmgmt
>>> TYPE=Bridge
>>> DELAY=0
>>> STP=off
>>> ONBOOT=yes
>>> IPADDR=1.2.3.16
>>> NETMASK=255.255.255.0
>>> GATEWAY=1.2.3.11
>>> BOOTPROTO=none
>>> DEFROUTE=yes
>>> NM_CONTROLLED=no
>>> HOTPLUG=no
>>>
>>> /etc/sysconfig/network-scripts/ifcfg-p4p1
>>> # Generated by VDSM version 4.16.20-0.el7.centos
>>> DEVICE=p4p1
>>> HWADDR=68:05:ca:01:bc:0c
>>> ONBOOT=no
>>> IPADDR=4.5.7.102
>>> NETMASK=255.255.255.0
>>> BOOTPROTO=none
>>> MTU=9000
>>> DEFROUTE=no
>>> NM_CONTROLLED=no
>>>
>>> /etc/sysconfig/network-scripts/ifcfg-p3p1
>>> # Generated by VDSM version 4.16.20-0.el7.centos
>>> DEVICE=p3p1
>>> HWADDR=68:05:ca:18:86:45
>>> ONBOOT=no
>>> IPADDR=4.5.6.102
>>> NETMASK=255.255.255.0
>>> BOOTPROTO=none
>>> MTU=9000
>>> DEFROUTE=no
>>> NM_CONTROLLED=no
>>>
>>> /etc/sysconfig/network-scripts/ifcfg-lo
>>>
>>>
>>> ip link before update:
>>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
mode DEFAULT
>>> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>> 2: bond0: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN
mode DEFAULT
>>> link/ether 46:50:22:7a:f3:9d brd ff:ff:ff:ff:ff:ff
>>> 3: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master
ovirtmgmt state UP mode DEFAULT qlen 1000
>>> link/ether d0:67:e5:f0:e5:c6 brd ff:ff:ff:ff:ff:ff
>>> 4: p3p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc pfifo_fast
state UP mode DEFAULT qlen 1000
>>> link/ether 68:05:ca:18:86:45 brd ff:ff:ff:ff:ff:ff
>>> 5: p4p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc pfifo_fast
state UP mode DEFAULT qlen 1000
>>> link/ether 68:05:ca:01:bc:0c brd ff:ff:ff:ff:ff:ff
>>> 7: ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue
state UP mode DEFAULT
>>> link/ether d0:67:e5:f0:e5:c6 brd ff:ff:ff:ff:ff:ff
>>> 8: ;vdsmdummy;: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
mode DEFAULT
>>> link/ether ce:0f:16:49:a7:da brd ff:ff:ff:ff:ff:ff
>>>
>>> vdsm files before update:
>>> /var/lib/vdsm
>>> /var/lib/vdsm/bonding-defaults.json
>>> /var/lib/vdsm/netconfback
>>> /var/lib/vdsm/netconfback/ifcfg-ovirtmgmt
>>> /var/lib/vdsm/netconfback/ifcfg-em1
>>> /var/lib/vdsm/netconfback/route-ovirtmgmt
>>> /var/lib/vdsm/netconfback/rule-ovirtmgmt
>>> /var/lib/vdsm/netconfback/ifcfg-p4p1
>>> /var/lib/vdsm/netconfback/ifcfg-p3p1
>>> /var/lib/vdsm/persistence
>>> /var/lib/vdsm/persistence/netconf
>>> /var/lib/vdsm/persistence/netconf.1416666697752319079
>>> /var/lib/vdsm/persistence/netconf.1416666697752319079/nets
>>> /var/lib/vdsm/persistence/netconf.1416666697752319079/nets/san1
>>> /var/lib/vdsm/persistence/netconf.1416666697752319079/nets/san2
>>> /var/lib/vdsm/persistence/netconf.1416666697752319079/nets/ovirtmgmt
>>> /var/lib/vdsm/upgrade
>>> /var/lib/vdsm/upgrade/upgrade-unified-persistence
>>> /var/lib/vdsm/transient
>>>
>>>
>>> File in /var/lib/vdsm/netconfback each only contained a comment:
>>> # original file did not exist
>> This is quite peculiar. Do you know when these where created?
>> Have you made any networking changes on 3.5.3 just before boot?
>>
>>> /var/lib/vdsm/persistence/netconf.1416666697752319079/nets/ovirtmgmt
>>> {"nic": "em1", "netmask":
"255.255.255.0", "bootproto": "none", "ipaddr":
"1.2.3.16", "gateway": "1.2.3.11"}
>>>
>>> /var/lib/vdsm/persistence/netconf.1416666697752319079/nets/san1
>>> {"nic": "p3p1", "netmask":
"255.255.255.0", "ipaddr": "4.5.6.102", "bridged":
"false", "mtu": "9000"}
>>>
>>> /var/lib/vdsm/persistence/netconf.1416666697752319079/nets/san2
>>> {"nic": "p4p1", "netmask":
"255.255.255.0", "ipaddr": "4.5.7.102", "bridged":
"false", "mtu": "9000"}
>>>
>>>
>>> After update and reboot, no ifcfg scripts are left. Only interface lo is up.
>>> Syslog doess not seem to contain anything suspicious before refore reboot.
>> Have you tweaked vdsm.conf in any way? In particular did you set
>> net_persistence?
>>
>>> Log excerpts from bootup:
>>>
>>> Sep 3 17:27:23 vhm-prd-02 network: Bringing up loopback interface: [ OK
]
>>> Sep 3 17:27:23 vhm-prd-02 systemd-ovirt-ha-agent: Starting ovirt-ha-agent: [
OK ]
>>> Sep 3 17:27:23 vhm-prd-02 systemd: Started oVirt Hosted Engine High
Availability Monitoring Agent.
>>> Sep 3 17:27:23 vhm-prd-02 kernel: IPv6: ADDRCONF(NETDEV_UP): em1: link is
not ready
>>> Sep 3 17:27:23 vhm-prd-02 kernel: device em1 entered promiscuous mode
>>> Sep 3 17:27:23 vhm-prd-02 network: Bringing up interface em1: [ OK ]
>>> Sep 3 17:27:23 vhm-prd-02 kernel: IPv6: ADDRCONF(NETDEV_UP): ovirtmgmt: link
is not ready
>>> Sep 3 17:27:25 vhm-prd-02 avahi-daemon[778]: Joining mDNS multicast group on
interface ovirtmgmt.IPv4 with address 1.2.3.16.
>>> Sep 3 17:27:25 vhm-prd-02 avahi-daemon[778]: New relevant interface
ovirtmgmt.IPv4 for mDNS.
>>> Sep 3 17:27:25 vhm-prd-02 avahi-daemon[778]: Registering new address record
for 1.2.3.16 on ovirtmgmt.IPv4.
>>> Sep 3 17:27:26 vhm-prd-02 kernel: tg3 0000:03:00.0 em1: Link is up at 1000
Mbps, full duplex
>>> Sep 3 17:27:26 vhm-prd-02 kernel: tg3 0000:03:00.0 em1: Flow control is off
for TX and off for RX
>>> Sep 3 17:27:26 vhm-prd-02 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): em1: link
becomes ready
>>> Sep 3 17:27:26 vhm-prd-02 kernel: ovirtmgmt: port 1(em1) entered forwarding
state
>>> Sep 3 17:27:26 vhm-prd-02 kernel: ovirtmgmt: port 1(em1) entered forwarding
state
>>> Sep 3 17:27:26 vhm-prd-02 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): ovirtmgmt:
link becomes ready
>>> Sep 3 17:27:26 vhm-prd-02 network: Bringing up interface ovirtmgmt: [ OK
]
>>> Sep 3 17:27:26 vhm-prd-02 systemd: Started LSB: Bring up/down networking.
>>> Sep 3 17:27:26 vhm-prd-02 systemd: Starting Network.
>>> Sep 3 17:27:26 vhm-prd-02 systemd: Reached target Network.
>>>
>>> So ovirtmgmt and em1 were restore and initialized just fine (p3p1 and p4p1
>>> should have been started, too, but engine configured them as ONBOOT=no).
>>>
>>> Further in messages (full log is attached):
>> would you also attach your post-boot supervdsm.log?
>>
>>> Sep 3 17:27:26 vhm-prd-02 systemd: Starting Virtual Desktop Server Manager
network restoration...
>>> Sep 3 17:27:26 vhm-prd-02 systemd: Started OSAD daemon.
>>> Sep 3 17:27:27 vhm-prd-02 systemd: Started Terminate Plymouth Boot Screen.
>>> Sep 3 17:27:27 vhm-prd-02 systemd: Started Wait for Plymouth Boot Screen to
Quit.
>>> Sep 3 17:27:27 vhm-prd-02 systemd: Starting Serial Getty on ttyS1...
>>> Sep 3 17:27:27 vhm-prd-02 systemd: Started Serial Getty on ttyS1.
>>> Sep 3 17:27:27 vhm-prd-02 systemd: Starting Getty on tty1...
>>> Sep 3 17:27:27 vhm-prd-02 systemd: Started Getty on tty1.
>>> Sep 3 17:27:27 vhm-prd-02 systemd: Starting Login Prompts.
>>> Sep 3 17:27:27 vhm-prd-02 systemd: Reached target Login Prompts.
>>> Sep 3 17:27:27 vhm-prd-02 iscsid: iSCSI daemon with pid=1300 started!
>>> Sep 3 17:27:27 vhm-prd-02 avahi-daemon[778]: Registering new address record
for fe80::d267:e5ff:fef0:e5c6 on ovirtmgmt.*.
>>> Sep 3 17:27:27 vhm-prd-02 kdumpctl: kexec: loaded kdump kernel
>>> Sep 3 17:27:27 vhm-prd-02 kdumpctl: Starting kdump: [OK]
>>> Sep 3 17:27:27 vhm-prd-02 systemd: Started Crash recovery kernel arming.
>>> Sep 3 17:27:27 vhm-prd-02 avahi-daemon[778]: Registering new address record
for fe80::d267:e5ff:fef0:e5c6 on em1.*.
>>> Sep 3 17:27:27 vhm-prd-02 avahi-daemon[778]: Withdrawing address record for
1.2.3.16 on ovirtmgmt.
>>> Sep 3 17:27:27 vhm-prd-02 avahi-daemon[778]: Leaving mDNS multicast group on
interface ovirtmgmt.IPv4 with address 1.2.3.16.
>>> Sep 3 17:27:27 vhm-prd-02 avahi-daemon[778]: Interface ovirtmgmt.IPv4 no
longer relevant for mDNS.
>>> Sep 3 17:27:27 vhm-prd-02 kernel: ovirtmgmt: port 1(em1) entered disabled
state
>>> Sep 3 17:27:27 vhm-prd-02 avahi-daemon[778]: Withdrawing address record for
fe80::d267:e5ff:fef0:e5c6 on ovirtmgmt.
>>> Sep 3 17:27:28 vhm-prd-02 avahi-daemon[778]: Withdrawing address record for
fe80::d267:e5ff:fef0:e5c6 on em1.
>>> Sep 3 17:27:28 vhm-prd-02 kernel: device em1 left promiscuous mode
>>> Sep 3 17:27:28 vhm-prd-02 kernel: ovirtmgmt: port 1(em1) entered disabled
state
>>> Sep 3 17:27:28 vhm-prd-02 avahi-daemon[778]: Withdrawing workstation service
for ovirtmgmt.
>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: Traceback (most recent call last):
>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File
"/usr/share/vdsm/vdsm-restore-net-config", line 345, in <module>
>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: restore(args)
>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File
"/usr/share/vdsm/vdsm-restore-net-config", line 314, in restore
>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: unified_restoration()
>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File
"/usr/share/vdsm/vdsm-restore-net-config", line 93, in unified_restoration
>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: setupNetworks(nets, bonds,
connectivityCheck=False, _inRollback=True)
>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File
"/usr/share/vdsm/network/api.py", line 642, in setupNetworks
>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: implicitBonding=False,
_netinfo=_netinfo)
>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File
"/usr/share/vdsm/network/api.py", line 213, in wrapped
>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: ret = func(**attrs)
>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File
"/usr/share/vdsm/network/api.py", line 429, in delNetwork
>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: netEnt.remove()
>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File
"/usr/share/vdsm/network/models.py", line 100, in remove
>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: self.configurator.removeNic(self)
>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File
"/usr/share/vdsm/network/configurators/ifcfg.py", line 215, in removeNic
>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: self.configApplier.removeNic(nic.name)
>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File
"/usr/share/vdsm/network/configurators/ifcfg.py", line 657, in removeNic
>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: with open(cf) as nicFile:
>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: IOError: [Errno 2] No such file or
directory: u'/etc/sysconfig/network-scripts/ifcfg-p4p1'
>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: Traceback (most recent call last):
>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File "/usr/bin/vdsm-tool",
line 219, in main
>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: return
tool_command[cmd]["command"](*args)
>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File
"/usr/lib/python2.7/site-packages/vdsm/tool/restore_nets.py", line 40, in
restore_command
>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: exec_restore(cmd)
>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File
"/usr/lib/python2.7/site-packages/vdsm/tool/restore_nets.py", line 53, in
exec_restore
>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: raise EnvironmentError('Failed to
restore the persisted networks')
>>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: EnvironmentError: Failed to restore the
persisted networks
>>> Sep 3 17:27:28 vhm-prd-02 systemd: vdsm-network.service: main process
exited, code=exited, status=1/FAILURE
>>> Sep 3 17:27:28 vhm-prd-02 systemd: Failed to start Virtual Desktop Server
Manager network restoration.
>>> Sep 3 17:27:28 vhm-prd-02 systemd: Dependency failed for Virtual Desktop
Server Manager.
>>> Sep 3 17:27:28 vhm-prd-02 systemd:
>>> Sep 3 17:27:28 vhm-prd-02 systemd: Unit vdsm-network.service entered failed
state.
>>> Sep 3 17:27:33 vhm-prd-02 systemd: Started Postfix Mail Transport Agent.
>>> Sep 3 17:27:33 vhm-prd-02 systemd: Starting Multi-User System.
>>> Sep 3 17:27:33 vhm-prd-02 systemd: Reached target Multi-User System.
>>> Sep 3 17:27:33 vhm-prd-02 systemd: Starting Update UTMP about System
Runlevel Changes...
>>> Sep 3 17:27:33 vhm-prd-02 systemd: Starting Stop Read-Ahead Data Collection
10s After Completed Startup.
>>> Sep 3 17:27:33 vhm-prd-02 systemd: Started Stop Read-Ahead Data Collection
10s After Completed Startup.
>>> Sep 3 17:27:33 vhm-prd-02 systemd: Started Update UTMP about System Runlevel
Changes.
>>> Sep 3 17:27:33 vhm-prd-02 systemd: Startup finished in 2.964s (kernel) +
2.507s (initrd) + 15.996s (userspace) = 21.468s.
>>>
>>> So, as I have two more hosts, that need updating, I'm happy to assist in
>>> bisecting and debugging this update issue. Suggestions and help are very
>>> welcome.
>> Thanks for this important report. I assume that calling
>>
>> vdsClient -s 0 setSafeNetworkConfig
>>
>> on the host before upgrade would make your problems go away, please do
>> not do that yet - your assistence in debugging this further is
>> important.
> Hi Dan,
>
> >From backups I could extract the pre-update timestamps of the files in
> /var/lib/vdsm/netconfback:
> ifcfg-em1 2015-08-10 16:40:19
> ifcfg-ovirtmgmt 2015-08-10 16:40:19
> ifcfg-p3p1 2015-08-10 16:40:25
> ifcfg-p4p1 2015-08-10 16:40:22
> route-ovirtmgmt 2015-08-10 16:40:20
> rule-ovirtmgmt 2015-08-10 16:40:20
>
> The ifcfg-scripts had the same corresponding timestamps:
> ifcfg-em1 2015-08-10 16:40:19
> ifcfg-lo 2015-01-15 09:57:03
> ifcfg-ovirtmgmt 2015-08-10 16:40:19
> ifcfg-p3p1 2015-08-10 16:40:25
> ifcfg-p4p1 2015-08-10 16:40:22
Do you recall what has been done on 2015-08-10?
Was your 3.5.3 host rebooted ever since?
I just tried to reconstruct the happings
on 2015-08-10 and it seems, that in fact
the network configuration was not touched. I was mislead by the dates. At that
date/time an updated kernel and some more CentOS rpms where updated (the
whole cluster was updated one by one). A reboot on this specific host was
initiated after the update at 2015-08-10 16:40:04. The timestamps from my
previous email seem still to be _within_ the bootup-process. So yes, the host
was rebooted ever since update to 3.5.3 (that happened on 2015-06-15).
Reboots since 2015-06-15:
reboot system boot 3.10.0-229.11.1. Mon Aug 10 16:56 - 14:34 (27+21:37)
reboot system boot 3.10.0-229.7.2.e Mon Jul 27 17:48 - 16:53 (13+23:05)
reboot system boot 3.10.0-229.7.2.e Wed Jun 24 16:46 - 17:46 (33+00:59)
reboot system boot 3.10.0-229.4.2.e Mon Jun 15 18:10 - 16:44 (8+22:34)
I checked the 2 remaining hosts (still 3.5.3) and both do not have any different
content in /var/lib/vdsm/netconfback. Again only single line comments:
# original file did not exist
My other productive oVirt 3.4 hosts don't even have these. The directory
/var/lib/vdsm/netconfback is empy on those.
What should/could I check on the remaining 2 hosts prior to the update?
Try syncing the network-configuration and verify the contents in
/var/lib/vdsm/netconfback?
If the networks have been configured on the host back then, but never
persisted, any reboot (regardless of upgrade) would cause their removal.
Vdsm should be more robust in handling missing ifcfg; but that's a
second-order bug
1256252 Vdsm should recover ifcfg files in case they are no
longer exist and recover all networks on the server
I'd like to first understand how come you have these placeholders left
behind.
> The attached supervdsm.log contains everything from network configuration
> done on 2015-08-10 till vdsm update on 2015-09-03 at 17:20 and the reboot
> performed afterwards.
Thanks. Maybe Ido could find further hints inside it
--
Lobster SCM GmbH, Hindenburgstraße 15, D-82343 Pöcking
HRB 178831, Amtsgericht München
Geschäftsführer: Dr. Martin Fischer, Rolf Henrich