[ovirt-users] Host loses all network configuration on update to oVirt 3.5.4

Patrick Hurrelmann patrick.hurrelmann at lobster.de
Mon Sep 7 12:44:03 UTC 2015


On 07.09.2015 13:54, Dan Kenigsberg wrote:
> On Mon, Sep 07, 2015 at 11:47:48AM +0200, Patrick Hurrelmann wrote:
>> On 06.09.2015 11:30, Dan Kenigsberg wrote:
>>> On Fri, Sep 04, 2015 at 10:26:39AM +0200, Patrick Hurrelmann wrote:
>>>> Hi all,
>>>>
>>>> I just updated my existing oVirt 3.5.3 installation (iSCSI hosted-engine on
>>>> CentOS 7.1). The engine update went fine. Updating the hosts succeeds until the
>>>> first reboot. After a reboot the host does not come up again. It is missing all
>>>> network configuration. All network cfgs in /etc/sysconfig/network-scripts are
>>>> missing except ifcfg-lo. The host boots up without working networking. Using
>>>> IPMI and config backups, I was able to restore the lost network configs. Once
>>>> these are restored and the host is rebooted again all seems to be back to good.
>>>> This has now happend to 2 updated hosts (this installation has a total of 4
>>>> hosts, so 2 more to debug/try). I'm happy to assist in furter debugging.
>>>>
>>>> Before updating the second host, I gathered some information. All these hosts
>>>> have 3 physical nics. One is used for the ovirtmgmt bridge and the other 2 are
>>>> used for iSCSI storage vlans.
>>>>
>>>> ifcfgs before update:
>>>>
>>>> /etc/sysconfig/network-scripts/ifcfg-em1
>>>> # Generated by VDSM version 4.16.20-0.el7.centos
>>>> DEVICE=em1
>>>> HWADDR=d0:67:e5:f0:e5:c6
>>>> BRIDGE=ovirtmgmt
>>>> ONBOOT=yes
>>>> NM_CONTROLLED=no
>>> /etc/sysconfig/network-scripts/ifcfg-lo
>>>> DEVICE=lo
>>>> IPADDR=127.0.0.1
>>>> NETMASK=255.0.0.0
>>>> NETWORK=127.0.0.0
>>>> # If you're having problems with gated making 127.0.0.0/8 a martian,
>>>> # you can change this to something else (255.255.255.255, for example)
>>>> BROADCAST=127.255.255.255
>>>> ONBOOT=yes
>>>> NAME=loopback
>>>>
>>>> /etc/sysconfig/network-scripts/ifcfg-ovirtmgmt
>>>> # Generated by VDSM version 4.16.20-0.el7.centos
>>>> DEVICE=ovirtmgmt
>>>> TYPE=Bridge
>>>> DELAY=0
>>>> STP=off
>>>> ONBOOT=yes
>>>> IPADDR=1.2.3.16
>>>> NETMASK=255.255.255.0
>>>> GATEWAY=1.2.3.11
>>>> BOOTPROTO=none
>>>> DEFROUTE=yes
>>>> NM_CONTROLLED=no
>>>> HOTPLUG=no
>>>>
>>>> /etc/sysconfig/network-scripts/ifcfg-p4p1
>>>> # Generated by VDSM version 4.16.20-0.el7.centos
>>>> DEVICE=p4p1
>>>> HWADDR=68:05:ca:01:bc:0c
>>>> ONBOOT=no
>>>> IPADDR=4.5.7.102
>>>> NETMASK=255.255.255.0
>>>> BOOTPROTO=none
>>>> MTU=9000
>>>> DEFROUTE=no
>>>> NM_CONTROLLED=no
>>>>
>>>> /etc/sysconfig/network-scripts/ifcfg-p3p1
>>>> # Generated by VDSM version 4.16.20-0.el7.centos
>>>> DEVICE=p3p1
>>>> HWADDR=68:05:ca:18:86:45
>>>> ONBOOT=no
>>>> IPADDR=4.5.6.102
>>>> NETMASK=255.255.255.0
>>>> BOOTPROTO=none
>>>> MTU=9000
>>>> DEFROUTE=no
>>>> NM_CONTROLLED=no
>>>>
>>>> /etc/sysconfig/network-scripts/ifcfg-lo
>>>>
>>>>
>>>> ip link before update:
>>>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT
>>>>     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>>> 2: bond0: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN mode DEFAULT
>>>>     link/ether 46:50:22:7a:f3:9d brd ff:ff:ff:ff:ff:ff
>>>> 3: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovirtmgmt state UP mode DEFAULT qlen 1000
>>>>     link/ether d0:67:e5:f0:e5:c6 brd ff:ff:ff:ff:ff:ff
>>>> 4: p3p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc pfifo_fast state UP mode DEFAULT qlen 1000
>>>>     link/ether 68:05:ca:18:86:45 brd ff:ff:ff:ff:ff:ff
>>>> 5: p4p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc pfifo_fast state UP mode DEFAULT qlen 1000
>>>>     link/ether 68:05:ca:01:bc:0c brd ff:ff:ff:ff:ff:ff
>>>> 7: ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT
>>>>     link/ether d0:67:e5:f0:e5:c6 brd ff:ff:ff:ff:ff:ff
>>>> 8: ;vdsmdummy;: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT
>>>>     link/ether ce:0f:16:49:a7:da brd ff:ff:ff:ff:ff:ff
>>>>
>>>> vdsm files before update:
>>>> /var/lib/vdsm
>>>> /var/lib/vdsm/bonding-defaults.json
>>>> /var/lib/vdsm/netconfback
>>>> /var/lib/vdsm/netconfback/ifcfg-ovirtmgmt
>>>> /var/lib/vdsm/netconfback/ifcfg-em1
>>>> /var/lib/vdsm/netconfback/route-ovirtmgmt
>>>> /var/lib/vdsm/netconfback/rule-ovirtmgmt
>>>> /var/lib/vdsm/netconfback/ifcfg-p4p1
>>>> /var/lib/vdsm/netconfback/ifcfg-p3p1
>>>> /var/lib/vdsm/persistence
>>>> /var/lib/vdsm/persistence/netconf
>>>> /var/lib/vdsm/persistence/netconf.1416666697752319079
>>>> /var/lib/vdsm/persistence/netconf.1416666697752319079/nets
>>>> /var/lib/vdsm/persistence/netconf.1416666697752319079/nets/san1
>>>> /var/lib/vdsm/persistence/netconf.1416666697752319079/nets/san2
>>>> /var/lib/vdsm/persistence/netconf.1416666697752319079/nets/ovirtmgmt
>>>> /var/lib/vdsm/upgrade
>>>> /var/lib/vdsm/upgrade/upgrade-unified-persistence
>>>> /var/lib/vdsm/transient
>>>>
>>>>
>>>> File in /var/lib/vdsm/netconfback each only contained a comment:
>>>> # original file did not exist
>>> This is quite peculiar. Do you know when these where created?
>>> Have you made any networking changes on 3.5.3 just before boot?
>>>
>>>> /var/lib/vdsm/persistence/netconf.1416666697752319079/nets/ovirtmgmt
>>>> {"nic": "em1", "netmask": "255.255.255.0", "bootproto": "none", "ipaddr": "1.2.3.16", "gateway": "1.2.3.11"}
>>>>
>>>> /var/lib/vdsm/persistence/netconf.1416666697752319079/nets/san1
>>>> {"nic": "p3p1", "netmask": "255.255.255.0", "ipaddr": "4.5.6.102", "bridged": "false", "mtu": "9000"}
>>>>
>>>> /var/lib/vdsm/persistence/netconf.1416666697752319079/nets/san2
>>>> {"nic": "p4p1", "netmask": "255.255.255.0", "ipaddr": "4.5.7.102", "bridged": "false", "mtu": "9000"}
>>>>
>>>>
>>>> After update and reboot, no ifcfg scripts are left. Only interface lo is up.
>>>> Syslog doess not seem to contain anything suspicious before refore reboot.
>>> Have you tweaked vdsm.conf in any way? In particular did you set
>>> net_persistence?
>>>
>>>> Log excerpts from bootup:
>>>>
>>>> Sep  3 17:27:23 vhm-prd-02 network: Bringing up loopback interface:  [  OK  ]
>>>> Sep  3 17:27:23 vhm-prd-02 systemd-ovirt-ha-agent: Starting ovirt-ha-agent: [  OK  ]
>>>> Sep  3 17:27:23 vhm-prd-02 systemd: Started oVirt Hosted Engine High Availability Monitoring Agent.
>>>> Sep  3 17:27:23 vhm-prd-02 kernel: IPv6: ADDRCONF(NETDEV_UP): em1: link is not ready
>>>> Sep  3 17:27:23 vhm-prd-02 kernel: device em1 entered promiscuous mode
>>>> Sep  3 17:27:23 vhm-prd-02 network: Bringing up interface em1:  [  OK  ]
>>>> Sep  3 17:27:23 vhm-prd-02 kernel: IPv6: ADDRCONF(NETDEV_UP): ovirtmgmt: link is not ready
>>>> Sep  3 17:27:25 vhm-prd-02 avahi-daemon[778]: Joining mDNS multicast group on interface ovirtmgmt.IPv4 with address 1.2.3.16.
>>>> Sep  3 17:27:25 vhm-prd-02 avahi-daemon[778]: New relevant interface ovirtmgmt.IPv4 for mDNS.
>>>> Sep  3 17:27:25 vhm-prd-02 avahi-daemon[778]: Registering new address record for 1.2.3.16 on ovirtmgmt.IPv4.
>>>> Sep  3 17:27:26 vhm-prd-02 kernel: tg3 0000:03:00.0 em1: Link is up at 1000 Mbps, full duplex
>>>> Sep  3 17:27:26 vhm-prd-02 kernel: tg3 0000:03:00.0 em1: Flow control is off for TX and off for RX
>>>> Sep  3 17:27:26 vhm-prd-02 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): em1: link becomes ready
>>>> Sep  3 17:27:26 vhm-prd-02 kernel: ovirtmgmt: port 1(em1) entered forwarding state
>>>> Sep  3 17:27:26 vhm-prd-02 kernel: ovirtmgmt: port 1(em1) entered forwarding state
>>>> Sep  3 17:27:26 vhm-prd-02 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): ovirtmgmt: link becomes ready
>>>> Sep  3 17:27:26 vhm-prd-02 network: Bringing up interface ovirtmgmt:  [  OK  ]
>>>> Sep  3 17:27:26 vhm-prd-02 systemd: Started LSB: Bring up/down networking.
>>>> Sep  3 17:27:26 vhm-prd-02 systemd: Starting Network.
>>>> Sep  3 17:27:26 vhm-prd-02 systemd: Reached target Network.
>>>>
>>>> So ovirtmgmt and em1 were restore and initialized just fine (p3p1 and p4p1
>>>> should have been started, too, but engine configured them as ONBOOT=no).
>>>>
>>>> Further in messages (full log is attached):
>>> would you also attach your post-boot supervdsm.log?
>>>
>>>> Sep  3 17:27:26 vhm-prd-02 systemd: Starting Virtual Desktop Server Manager network restoration...
>>>> Sep  3 17:27:26 vhm-prd-02 systemd: Started OSAD daemon.
>>>> Sep  3 17:27:27 vhm-prd-02 systemd: Started Terminate Plymouth Boot Screen.
>>>> Sep  3 17:27:27 vhm-prd-02 systemd: Started Wait for Plymouth Boot Screen to Quit.
>>>> Sep  3 17:27:27 vhm-prd-02 systemd: Starting Serial Getty on ttyS1...
>>>> Sep  3 17:27:27 vhm-prd-02 systemd: Started Serial Getty on ttyS1.
>>>> Sep  3 17:27:27 vhm-prd-02 systemd: Starting Getty on tty1...
>>>> Sep  3 17:27:27 vhm-prd-02 systemd: Started Getty on tty1.
>>>> Sep  3 17:27:27 vhm-prd-02 systemd: Starting Login Prompts.
>>>> Sep  3 17:27:27 vhm-prd-02 systemd: Reached target Login Prompts.
>>>> Sep  3 17:27:27 vhm-prd-02 iscsid: iSCSI daemon with pid=1300 started!
>>>> Sep  3 17:27:27 vhm-prd-02 avahi-daemon[778]: Registering new address record for fe80::d267:e5ff:fef0:e5c6 on ovirtmgmt.*.
>>>> Sep  3 17:27:27 vhm-prd-02 kdumpctl: kexec: loaded kdump kernel
>>>> Sep  3 17:27:27 vhm-prd-02 kdumpctl: Starting kdump: [OK]
>>>> Sep  3 17:27:27 vhm-prd-02 systemd: Started Crash recovery kernel arming.
>>>> Sep  3 17:27:27 vhm-prd-02 avahi-daemon[778]: Registering new address record for fe80::d267:e5ff:fef0:e5c6 on em1.*.
>>>> Sep  3 17:27:27 vhm-prd-02 avahi-daemon[778]: Withdrawing address record for 1.2.3.16 on ovirtmgmt.
>>>> Sep  3 17:27:27 vhm-prd-02 avahi-daemon[778]: Leaving mDNS multicast group on interface ovirtmgmt.IPv4 with address 1.2.3.16.
>>>> Sep  3 17:27:27 vhm-prd-02 avahi-daemon[778]: Interface ovirtmgmt.IPv4 no longer relevant for mDNS.
>>>> Sep  3 17:27:27 vhm-prd-02 kernel: ovirtmgmt: port 1(em1) entered disabled state
>>>> Sep  3 17:27:27 vhm-prd-02 avahi-daemon[778]: Withdrawing address record for fe80::d267:e5ff:fef0:e5c6 on ovirtmgmt.
>>>> Sep  3 17:27:28 vhm-prd-02 avahi-daemon[778]: Withdrawing address record for fe80::d267:e5ff:fef0:e5c6 on em1.
>>>> Sep  3 17:27:28 vhm-prd-02 kernel: device em1 left promiscuous mode
>>>> Sep  3 17:27:28 vhm-prd-02 kernel: ovirtmgmt: port 1(em1) entered disabled state
>>>> Sep  3 17:27:28 vhm-prd-02 avahi-daemon[778]: Withdrawing workstation service for ovirtmgmt.
>>>> Sep  3 17:27:28 vhm-prd-02 vdsm-tool: Traceback (most recent call last):
>>>> Sep  3 17:27:28 vhm-prd-02 vdsm-tool: File "/usr/share/vdsm/vdsm-restore-net-config", line 345, in <module>
>>>> Sep  3 17:27:28 vhm-prd-02 vdsm-tool: restore(args)
>>>> Sep  3 17:27:28 vhm-prd-02 vdsm-tool: File "/usr/share/vdsm/vdsm-restore-net-config", line 314, in restore
>>>> Sep  3 17:27:28 vhm-prd-02 vdsm-tool: unified_restoration()
>>>> Sep  3 17:27:28 vhm-prd-02 vdsm-tool: File "/usr/share/vdsm/vdsm-restore-net-config", line 93, in unified_restoration
>>>> Sep  3 17:27:28 vhm-prd-02 vdsm-tool: setupNetworks(nets, bonds, connectivityCheck=False, _inRollback=True)
>>>> Sep  3 17:27:28 vhm-prd-02 vdsm-tool: File "/usr/share/vdsm/network/api.py", line 642, in setupNetworks
>>>> Sep  3 17:27:28 vhm-prd-02 vdsm-tool: implicitBonding=False, _netinfo=_netinfo)
>>>> Sep  3 17:27:28 vhm-prd-02 vdsm-tool: File "/usr/share/vdsm/network/api.py", line 213, in wrapped
>>>> Sep  3 17:27:28 vhm-prd-02 vdsm-tool: ret = func(**attrs)
>>>> Sep  3 17:27:28 vhm-prd-02 vdsm-tool: File "/usr/share/vdsm/network/api.py", line 429, in delNetwork
>>>> Sep  3 17:27:28 vhm-prd-02 vdsm-tool: netEnt.remove()
>>>> Sep  3 17:27:28 vhm-prd-02 vdsm-tool: File "/usr/share/vdsm/network/models.py", line 100, in remove
>>>> Sep  3 17:27:28 vhm-prd-02 vdsm-tool: self.configurator.removeNic(self)
>>>> Sep  3 17:27:28 vhm-prd-02 vdsm-tool: File "/usr/share/vdsm/network/configurators/ifcfg.py", line 215, in removeNic
>>>> Sep  3 17:27:28 vhm-prd-02 vdsm-tool: self.configApplier.removeNic(nic.name)
>>>> Sep  3 17:27:28 vhm-prd-02 vdsm-tool: File "/usr/share/vdsm/network/configurators/ifcfg.py", line 657, in removeNic
>>>> Sep  3 17:27:28 vhm-prd-02 vdsm-tool: with open(cf) as nicFile:
>>>> Sep  3 17:27:28 vhm-prd-02 vdsm-tool: IOError: [Errno 2] No such file or directory: u'/etc/sysconfig/network-scripts/ifcfg-p4p1'
>>>> Sep  3 17:27:28 vhm-prd-02 vdsm-tool: Traceback (most recent call last):
>>>> Sep  3 17:27:28 vhm-prd-02 vdsm-tool: File "/usr/bin/vdsm-tool", line 219, in main
>>>> Sep  3 17:27:28 vhm-prd-02 vdsm-tool: return tool_command[cmd]["command"](*args)
>>>> Sep  3 17:27:28 vhm-prd-02 vdsm-tool: File "/usr/lib/python2.7/site-packages/vdsm/tool/restore_nets.py", line 40, in restore_command
>>>> Sep  3 17:27:28 vhm-prd-02 vdsm-tool: exec_restore(cmd)
>>>> Sep  3 17:27:28 vhm-prd-02 vdsm-tool: File "/usr/lib/python2.7/site-packages/vdsm/tool/restore_nets.py", line 53, in exec_restore
>>>> Sep  3 17:27:28 vhm-prd-02 vdsm-tool: raise EnvironmentError('Failed to restore the persisted networks')
>>>> Sep  3 17:27:28 vhm-prd-02 vdsm-tool: EnvironmentError: Failed to restore the persisted networks
>>>> Sep  3 17:27:28 vhm-prd-02 systemd: vdsm-network.service: main process exited, code=exited, status=1/FAILURE
>>>> Sep  3 17:27:28 vhm-prd-02 systemd: Failed to start Virtual Desktop Server Manager network restoration.
>>>> Sep  3 17:27:28 vhm-prd-02 systemd: Dependency failed for Virtual Desktop Server Manager.
>>>> Sep  3 17:27:28 vhm-prd-02 systemd:
>>>> Sep  3 17:27:28 vhm-prd-02 systemd: Unit vdsm-network.service entered failed state.
>>>> Sep  3 17:27:33 vhm-prd-02 systemd: Started Postfix Mail Transport Agent.
>>>> Sep  3 17:27:33 vhm-prd-02 systemd: Starting Multi-User System.
>>>> Sep  3 17:27:33 vhm-prd-02 systemd: Reached target Multi-User System.
>>>> Sep  3 17:27:33 vhm-prd-02 systemd: Starting Update UTMP about System Runlevel Changes...
>>>> Sep  3 17:27:33 vhm-prd-02 systemd: Starting Stop Read-Ahead Data Collection 10s After Completed Startup.
>>>> Sep  3 17:27:33 vhm-prd-02 systemd: Started Stop Read-Ahead Data Collection 10s After Completed Startup.
>>>> Sep  3 17:27:33 vhm-prd-02 systemd: Started Update UTMP about System Runlevel Changes.
>>>> Sep  3 17:27:33 vhm-prd-02 systemd: Startup finished in 2.964s (kernel) + 2.507s (initrd) + 15.996s (userspace) = 21.468s.
>>>>
>>>> So, as I have two more hosts, that need updating, I'm happy to assist in
>>>> bisecting and debugging this update issue. Suggestions and help are very
>>>> welcome.
>>> Thanks for this important report. I assume that calling
>>>
>>>   vdsClient -s 0 setSafeNetworkConfig
>>>
>>> on the host before upgrade would make your problems go away, please do
>>> not do that yet - your assistence in debugging this further is
>>> important.
>> Hi Dan,
>>
>> >From backups I could extract the pre-update timestamps of the files in
>> /var/lib/vdsm/netconfback:
>> ifcfg-em1       2015-08-10 16:40:19
>> ifcfg-ovirtmgmt 2015-08-10 16:40:19
>> ifcfg-p3p1      2015-08-10 16:40:25
>> ifcfg-p4p1      2015-08-10 16:40:22
>> route-ovirtmgmt 2015-08-10 16:40:20
>> rule-ovirtmgmt  2015-08-10 16:40:20
>>
>> The ifcfg-scripts had the same corresponding timestamps:
>> ifcfg-em1       2015-08-10 16:40:19
>> ifcfg-lo        2015-01-15 09:57:03
>> ifcfg-ovirtmgmt 2015-08-10 16:40:19
>> ifcfg-p3p1      2015-08-10 16:40:25
>> ifcfg-p4p1      2015-08-10 16:40:22
> Do you recall what has been done on 2015-08-10?
> Was your 3.5.3 host rebooted ever since?
I just tried to reconstruct the happings on 2015-08-10 and it seems, that in fact
the network configuration was not touched. I was mislead by the dates. At that
date/time an updated kernel and some more CentOS rpms where updated (the
whole cluster was updated one by one). A reboot on this specific host was
initiated after the update at 2015-08-10 16:40:04. The timestamps from my
previous email seem still to be _within_ the bootup-process. So yes, the host
was rebooted ever since update to 3.5.3 (that happened on 2015-06-15).

Reboots since 2015-06-15:
reboot   system boot  3.10.0-229.11.1. Mon Aug 10 16:56 - 14:34 (27+21:37) 
reboot   system boot  3.10.0-229.7.2.e Mon Jul 27 17:48 - 16:53 (13+23:05) 
reboot   system boot  3.10.0-229.7.2.e Wed Jun 24 16:46 - 17:46 (33+00:59) 
reboot   system boot  3.10.0-229.4.2.e Mon Jun 15 18:10 - 16:44 (8+22:34)

I checked the 2 remaining hosts (still 3.5.3) and both do not have any different
content in /var/lib/vdsm/netconfback. Again only single line comments:
# original file did not exist

My other productive oVirt 3.4 hosts don't even have these. The directory
/var/lib/vdsm/netconfback is empy on those.

What should/could I check on the remaining 2 hosts prior to the update?
Try syncing the network-configuration and verify the contents in
/var/lib/vdsm/netconfback?

>
> If the networks have been configured on the host back then, but never
> persisted, any reboot (regardless of upgrade) would cause their removal.
>
> Vdsm should be more robust in handling missing ifcfg; but that's a
> second-order bug
>
>     1256252     Vdsm should recover ifcfg files in case they are no
>     longer exist and recover all networks on the server
>
> I'd like to first understand how come you have these placeholders left
> behind.
>
>> The attached supervdsm.log contains everything from network configuration
>> done on 2015-08-10 till vdsm update on 2015-09-03 at 17:20 and the reboot
>> performed afterwards.
> Thanks. Maybe Ido could find further hints inside it

-- 
Lobster SCM GmbH, Hindenburgstraße 15, D-82343 Pöcking
HRB 178831, Amtsgericht München
Geschäftsführer: Dr. Martin Fischer, Rolf Henrich




More information about the Users mailing list