
On Fri, Sep 04, 2015 at 10:26:39AM +0200, Patrick Hurrelmann wrote:
Hi all,
I just updated my existing oVirt 3.5.3 installation (iSCSI hosted-engine on CentOS 7.1). The engine update went fine. Updating the hosts succeeds until the first reboot. After a reboot the host does not come up again. It is missing all network configuration. All network cfgs in /etc/sysconfig/network-scripts are missing except ifcfg-lo. The host boots up without working networking. Using IPMI and config backups, I was able to restore the lost network configs. Once these are restored and the host is rebooted again all seems to be back to good. This has now happend to 2 updated hosts (this installation has a total of 4 hosts, so 2 more to debug/try). I'm happy to assist in furter debugging.
Before updating the second host, I gathered some information. All these hosts have 3 physical nics. One is used for the ovirtmgmt bridge and the other 2 are used for iSCSI storage vlans.
ifcfgs before update:
/etc/sysconfig/network-scripts/ifcfg-em1 # Generated by VDSM version 4.16.20-0.el7.centos DEVICE=em1 HWADDR=d0:67:e5:f0:e5:c6 BRIDGE=ovirtmgmt ONBOOT=yes NM_CONTROLLED=no
/etc/sysconfig/network-scripts/ifcfg-lo
DEVICE=lo IPADDR=127.0.0.1 NETMASK=255.0.0.0 NETWORK=127.0.0.0 # If you're having problems with gated making 127.0.0.0/8 a martian, # you can change this to something else (255.255.255.255, for example) BROADCAST=127.255.255.255 ONBOOT=yes NAME=loopback
/etc/sysconfig/network-scripts/ifcfg-ovirtmgmt # Generated by VDSM version 4.16.20-0.el7.centos DEVICE=ovirtmgmt TYPE=Bridge DELAY=0 STP=off ONBOOT=yes IPADDR=1.2.3.16 NETMASK=255.255.255.0 GATEWAY=1.2.3.11 BOOTPROTO=none DEFROUTE=yes NM_CONTROLLED=no HOTPLUG=no
/etc/sysconfig/network-scripts/ifcfg-p4p1 # Generated by VDSM version 4.16.20-0.el7.centos DEVICE=p4p1 HWADDR=68:05:ca:01:bc:0c ONBOOT=no IPADDR=4.5.7.102 NETMASK=255.255.255.0 BOOTPROTO=none MTU=9000 DEFROUTE=no NM_CONTROLLED=no
/etc/sysconfig/network-scripts/ifcfg-p3p1 # Generated by VDSM version 4.16.20-0.el7.centos DEVICE=p3p1 HWADDR=68:05:ca:18:86:45 ONBOOT=no IPADDR=4.5.6.102 NETMASK=255.255.255.0 BOOTPROTO=none MTU=9000 DEFROUTE=no NM_CONTROLLED=no
/etc/sysconfig/network-scripts/ifcfg-lo
ip link before update: 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: bond0: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN mode DEFAULT link/ether 46:50:22:7a:f3:9d brd ff:ff:ff:ff:ff:ff 3: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovirtmgmt state UP mode DEFAULT qlen 1000 link/ether d0:67:e5:f0:e5:c6 brd ff:ff:ff:ff:ff:ff 4: p3p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc pfifo_fast state UP mode DEFAULT qlen 1000 link/ether 68:05:ca:18:86:45 brd ff:ff:ff:ff:ff:ff 5: p4p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc pfifo_fast state UP mode DEFAULT qlen 1000 link/ether 68:05:ca:01:bc:0c brd ff:ff:ff:ff:ff:ff 7: ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT link/ether d0:67:e5:f0:e5:c6 brd ff:ff:ff:ff:ff:ff 8: ;vdsmdummy;: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT link/ether ce:0f:16:49:a7:da brd ff:ff:ff:ff:ff:ff
vdsm files before update: /var/lib/vdsm /var/lib/vdsm/bonding-defaults.json /var/lib/vdsm/netconfback /var/lib/vdsm/netconfback/ifcfg-ovirtmgmt /var/lib/vdsm/netconfback/ifcfg-em1 /var/lib/vdsm/netconfback/route-ovirtmgmt /var/lib/vdsm/netconfback/rule-ovirtmgmt /var/lib/vdsm/netconfback/ifcfg-p4p1 /var/lib/vdsm/netconfback/ifcfg-p3p1 /var/lib/vdsm/persistence /var/lib/vdsm/persistence/netconf /var/lib/vdsm/persistence/netconf.1416666697752319079 /var/lib/vdsm/persistence/netconf.1416666697752319079/nets /var/lib/vdsm/persistence/netconf.1416666697752319079/nets/san1 /var/lib/vdsm/persistence/netconf.1416666697752319079/nets/san2 /var/lib/vdsm/persistence/netconf.1416666697752319079/nets/ovirtmgmt /var/lib/vdsm/upgrade /var/lib/vdsm/upgrade/upgrade-unified-persistence /var/lib/vdsm/transient
File in /var/lib/vdsm/netconfback each only contained a comment: # original file did not exist
This is quite peculiar. Do you know when these where created? Have you made any networking changes on 3.5.3 just before boot?
/var/lib/vdsm/persistence/netconf.1416666697752319079/nets/ovirtmgmt {"nic": "em1", "netmask": "255.255.255.0", "bootproto": "none", "ipaddr": "1.2.3.16", "gateway": "1.2.3.11"}
/var/lib/vdsm/persistence/netconf.1416666697752319079/nets/san1 {"nic": "p3p1", "netmask": "255.255.255.0", "ipaddr": "4.5.6.102", "bridged": "false", "mtu": "9000"}
/var/lib/vdsm/persistence/netconf.1416666697752319079/nets/san2 {"nic": "p4p1", "netmask": "255.255.255.0", "ipaddr": "4.5.7.102", "bridged": "false", "mtu": "9000"}
After update and reboot, no ifcfg scripts are left. Only interface lo is up. Syslog doess not seem to contain anything suspicious before refore reboot.
Have you tweaked vdsm.conf in any way? In particular did you set net_persistence?
Log excerpts from bootup:
Sep 3 17:27:23 vhm-prd-02 network: Bringing up loopback interface: [ OK ] Sep 3 17:27:23 vhm-prd-02 systemd-ovirt-ha-agent: Starting ovirt-ha-agent: [ OK ] Sep 3 17:27:23 vhm-prd-02 systemd: Started oVirt Hosted Engine High Availability Monitoring Agent. Sep 3 17:27:23 vhm-prd-02 kernel: IPv6: ADDRCONF(NETDEV_UP): em1: link is not ready Sep 3 17:27:23 vhm-prd-02 kernel: device em1 entered promiscuous mode Sep 3 17:27:23 vhm-prd-02 network: Bringing up interface em1: [ OK ] Sep 3 17:27:23 vhm-prd-02 kernel: IPv6: ADDRCONF(NETDEV_UP): ovirtmgmt: link is not ready Sep 3 17:27:25 vhm-prd-02 avahi-daemon[778]: Joining mDNS multicast group on interface ovirtmgmt.IPv4 with address 1.2.3.16. Sep 3 17:27:25 vhm-prd-02 avahi-daemon[778]: New relevant interface ovirtmgmt.IPv4 for mDNS. Sep 3 17:27:25 vhm-prd-02 avahi-daemon[778]: Registering new address record for 1.2.3.16 on ovirtmgmt.IPv4. Sep 3 17:27:26 vhm-prd-02 kernel: tg3 0000:03:00.0 em1: Link is up at 1000 Mbps, full duplex Sep 3 17:27:26 vhm-prd-02 kernel: tg3 0000:03:00.0 em1: Flow control is off for TX and off for RX Sep 3 17:27:26 vhm-prd-02 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): em1: link becomes ready Sep 3 17:27:26 vhm-prd-02 kernel: ovirtmgmt: port 1(em1) entered forwarding state Sep 3 17:27:26 vhm-prd-02 kernel: ovirtmgmt: port 1(em1) entered forwarding state Sep 3 17:27:26 vhm-prd-02 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): ovirtmgmt: link becomes ready Sep 3 17:27:26 vhm-prd-02 network: Bringing up interface ovirtmgmt: [ OK ] Sep 3 17:27:26 vhm-prd-02 systemd: Started LSB: Bring up/down networking. Sep 3 17:27:26 vhm-prd-02 systemd: Starting Network. Sep 3 17:27:26 vhm-prd-02 systemd: Reached target Network.
So ovirtmgmt and em1 were restore and initialized just fine (p3p1 and p4p1 should have been started, too, but engine configured them as ONBOOT=no).
Further in messages (full log is attached):
would you also attach your post-boot supervdsm.log?
Sep 3 17:27:26 vhm-prd-02 systemd: Starting Virtual Desktop Server Manager network restoration... Sep 3 17:27:26 vhm-prd-02 systemd: Started OSAD daemon. Sep 3 17:27:27 vhm-prd-02 systemd: Started Terminate Plymouth Boot Screen. Sep 3 17:27:27 vhm-prd-02 systemd: Started Wait for Plymouth Boot Screen to Quit. Sep 3 17:27:27 vhm-prd-02 systemd: Starting Serial Getty on ttyS1... Sep 3 17:27:27 vhm-prd-02 systemd: Started Serial Getty on ttyS1. Sep 3 17:27:27 vhm-prd-02 systemd: Starting Getty on tty1... Sep 3 17:27:27 vhm-prd-02 systemd: Started Getty on tty1. Sep 3 17:27:27 vhm-prd-02 systemd: Starting Login Prompts. Sep 3 17:27:27 vhm-prd-02 systemd: Reached target Login Prompts. Sep 3 17:27:27 vhm-prd-02 iscsid: iSCSI daemon with pid=1300 started! Sep 3 17:27:27 vhm-prd-02 avahi-daemon[778]: Registering new address record for fe80::d267:e5ff:fef0:e5c6 on ovirtmgmt.*. Sep 3 17:27:27 vhm-prd-02 kdumpctl: kexec: loaded kdump kernel Sep 3 17:27:27 vhm-prd-02 kdumpctl: Starting kdump: [OK] Sep 3 17:27:27 vhm-prd-02 systemd: Started Crash recovery kernel arming. Sep 3 17:27:27 vhm-prd-02 avahi-daemon[778]: Registering new address record for fe80::d267:e5ff:fef0:e5c6 on em1.*. Sep 3 17:27:27 vhm-prd-02 avahi-daemon[778]: Withdrawing address record for 1.2.3.16 on ovirtmgmt. Sep 3 17:27:27 vhm-prd-02 avahi-daemon[778]: Leaving mDNS multicast group on interface ovirtmgmt.IPv4 with address 1.2.3.16. Sep 3 17:27:27 vhm-prd-02 avahi-daemon[778]: Interface ovirtmgmt.IPv4 no longer relevant for mDNS. Sep 3 17:27:27 vhm-prd-02 kernel: ovirtmgmt: port 1(em1) entered disabled state Sep 3 17:27:27 vhm-prd-02 avahi-daemon[778]: Withdrawing address record for fe80::d267:e5ff:fef0:e5c6 on ovirtmgmt. Sep 3 17:27:28 vhm-prd-02 avahi-daemon[778]: Withdrawing address record for fe80::d267:e5ff:fef0:e5c6 on em1. Sep 3 17:27:28 vhm-prd-02 kernel: device em1 left promiscuous mode Sep 3 17:27:28 vhm-prd-02 kernel: ovirtmgmt: port 1(em1) entered disabled state Sep 3 17:27:28 vhm-prd-02 avahi-daemon[778]: Withdrawing workstation service for ovirtmgmt. Sep 3 17:27:28 vhm-prd-02 vdsm-tool: Traceback (most recent call last): Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File "/usr/share/vdsm/vdsm-restore-net-config", line 345, in <module> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: restore(args) Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File "/usr/share/vdsm/vdsm-restore-net-config", line 314, in restore Sep 3 17:27:28 vhm-prd-02 vdsm-tool: unified_restoration() Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File "/usr/share/vdsm/vdsm-restore-net-config", line 93, in unified_restoration Sep 3 17:27:28 vhm-prd-02 vdsm-tool: setupNetworks(nets, bonds, connectivityCheck=False, _inRollback=True) Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File "/usr/share/vdsm/network/api.py", line 642, in setupNetworks Sep 3 17:27:28 vhm-prd-02 vdsm-tool: implicitBonding=False, _netinfo=_netinfo) Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File "/usr/share/vdsm/network/api.py", line 213, in wrapped Sep 3 17:27:28 vhm-prd-02 vdsm-tool: ret = func(**attrs) Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File "/usr/share/vdsm/network/api.py", line 429, in delNetwork Sep 3 17:27:28 vhm-prd-02 vdsm-tool: netEnt.remove() Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File "/usr/share/vdsm/network/models.py", line 100, in remove Sep 3 17:27:28 vhm-prd-02 vdsm-tool: self.configurator.removeNic(self) Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File "/usr/share/vdsm/network/configurators/ifcfg.py", line 215, in removeNic Sep 3 17:27:28 vhm-prd-02 vdsm-tool: self.configApplier.removeNic(nic.name) Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File "/usr/share/vdsm/network/configurators/ifcfg.py", line 657, in removeNic Sep 3 17:27:28 vhm-prd-02 vdsm-tool: with open(cf) as nicFile: Sep 3 17:27:28 vhm-prd-02 vdsm-tool: IOError: [Errno 2] No such file or directory: u'/etc/sysconfig/network-scripts/ifcfg-p4p1' Sep 3 17:27:28 vhm-prd-02 vdsm-tool: Traceback (most recent call last): Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File "/usr/bin/vdsm-tool", line 219, in main Sep 3 17:27:28 vhm-prd-02 vdsm-tool: return tool_command[cmd]["command"](*args) Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File "/usr/lib/python2.7/site-packages/vdsm/tool/restore_nets.py", line 40, in restore_command Sep 3 17:27:28 vhm-prd-02 vdsm-tool: exec_restore(cmd) Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File "/usr/lib/python2.7/site-packages/vdsm/tool/restore_nets.py", line 53, in exec_restore Sep 3 17:27:28 vhm-prd-02 vdsm-tool: raise EnvironmentError('Failed to restore the persisted networks') Sep 3 17:27:28 vhm-prd-02 vdsm-tool: EnvironmentError: Failed to restore the persisted networks Sep 3 17:27:28 vhm-prd-02 systemd: vdsm-network.service: main process exited, code=exited, status=1/FAILURE Sep 3 17:27:28 vhm-prd-02 systemd: Failed to start Virtual Desktop Server Manager network restoration. Sep 3 17:27:28 vhm-prd-02 systemd: Dependency failed for Virtual Desktop Server Manager. Sep 3 17:27:28 vhm-prd-02 systemd: Sep 3 17:27:28 vhm-prd-02 systemd: Unit vdsm-network.service entered failed state. Sep 3 17:27:33 vhm-prd-02 systemd: Started Postfix Mail Transport Agent. Sep 3 17:27:33 vhm-prd-02 systemd: Starting Multi-User System. Sep 3 17:27:33 vhm-prd-02 systemd: Reached target Multi-User System. Sep 3 17:27:33 vhm-prd-02 systemd: Starting Update UTMP about System Runlevel Changes... Sep 3 17:27:33 vhm-prd-02 systemd: Starting Stop Read-Ahead Data Collection 10s After Completed Startup. Sep 3 17:27:33 vhm-prd-02 systemd: Started Stop Read-Ahead Data Collection 10s After Completed Startup. Sep 3 17:27:33 vhm-prd-02 systemd: Started Update UTMP about System Runlevel Changes. Sep 3 17:27:33 vhm-prd-02 systemd: Startup finished in 2.964s (kernel) + 2.507s (initrd) + 15.996s (userspace) = 21.468s.
So, as I have two more hosts, that need updating, I'm happy to assist in bisecting and debugging this update issue. Suggestions and help are very welcome.
Thanks for this important report. I assume that calling vdsClient -s 0 setSafeNetworkConfig on the host before upgrade would make your problems go away, please do not do that yet - your assistence in debugging this further is important.