
Hi everyone, it turns out that ifcfg files can be lost even in this very simple scenario: 1) Install/upgrade to VDSM 4.16.21/oVirt 3.5.4 2) Setup a network over eth0 vdsClient -s 0 setupNetworks 'networks={pokus:{nic:eth0,bootproto:dhcp,blockingdhcp:true,bridged:false}}' 3) Persist the configuration (declare it safe) vdsClient -s 0 setSafeNetworkConfig 4) Add a placeholder in /var/lib/vdsm/netconfback/ifcfg-eth0 with: # original file did not exist 5) Reboot I created a fix [1] and prepared it for backport to 3.6 [2] and 3.5 branches [3] (so as to appear in 3.5.5) and linked it to https://bugzilla.redhat.com/show_bug.cgi?id=1256252 Patrick, to apply the patch you can also run the two commands and paste it (the line after "nicFile.writelines(l)" is a single space, so please add it if it gets eaten by e-mail goblins): cd /usr/share/vdsm/ patch -p1 diff --git vdsm/network/configurators/ifcfg.py vdsm/network/configurators/ifcfg.py index 161a3b2..8332224 100644 --- vdsm/network/configurators/ifcfg.py +++ vdsm/network/configurators/ifcfg.py @@ -647,11 +647,21 @@ class ConfigWriter(object): def removeNic(self, nic): cf = netinfo.NET_CONF_PREF + nic self._backup(cf) - with open(cf) as nicFile: - hwlines = [line for line in nicFile if line.startswith('HWADDR=')] + try: + with open(cf) as nicFile: + hwlines = [line for line in nicFile if line.startswith( + 'HWADDR=')] + except IOError as e: + logging.warning("%s couldn't be read (errno %s)", cf, e.errno) + try: + hwlines = ['HWADDR=%s\n' % netinfo.gethwaddr(nic)] + except IOError as e: + logging.exception("couldn't determine hardware address of %s " + "(errno %s)", nic, e.errno) + hwlines = [] l = [self.CONFFILE_HEADER + '\n', 'DEVICE=%s\n' % nic, 'ONBOOT=yes\n', 'MTU=%s\n' % netinfo.DEFAULT_MTU] + hwlines - l += 'NM_CONTROLLED=no\n' + l.append('NM_CONTROLLED=no\n') with open(cf, 'w') as nicFile: nicFile.writelines(l) Michael, will you please give it a try as well? Thanks, Ondra [1] https://gerrit.ovirt.org/#/c/45893/ [2] https://gerrit.ovirt.org/#/c/45932/ [3] https://gerrit.ovirt.org/#/c/45933/ ----- Original Message -----
From: "Patrick Hurrelmann" <patrick.hurrelmann@lobster.de> To: "Dan Kenigsberg" <danken@redhat.com> Cc: "oVirt Mailing List" <users@ovirt.org> Sent: Monday, September 7, 2015 2:46:05 PM Subject: Re: [ovirt-users] Host loses all network configuration on update to oVirt 3.5.4
On 07.09.2015 14:44, Patrick Hurrelmann wrote:
On Mon, Sep 07, 2015 at 11:47:48AM +0200, Patrick Hurrelmann wrote:
On 06.09.2015 11:30, Dan Kenigsberg wrote:
On Fri, Sep 04, 2015 at 10:26:39AM +0200, Patrick Hurrelmann wrote:
Hi all,
I just updated my existing oVirt 3.5.3 installation (iSCSI hosted-engine on CentOS 7.1). The engine update went fine. Updating the hosts succeeds until the first reboot. After a reboot the host does not come up again. It is missing all network configuration. All network cfgs in /etc/sysconfig/network-scripts are missing except ifcfg-lo. The host boots up without working networking. Using IPMI and config backups, I was able to restore the lost network configs. Once these are restored and the host is rebooted again all seems to be back to good. This has now happend to 2 updated hosts (this installation has a total of 4 hosts, so 2 more to debug/try). I'm happy to assist in furter debugging.
Before updating the second host, I gathered some information. All these hosts have 3 physical nics. One is used for the ovirtmgmt bridge and the other 2 are used for iSCSI storage vlans.
ifcfgs before update:
/etc/sysconfig/network-scripts/ifcfg-em1 # Generated by VDSM version 4.16.20-0.el7.centos DEVICE=em1 HWADDR=d0:67:e5:f0:e5:c6 BRIDGE=ovirtmgmt ONBOOT=yes NM_CONTROLLED=no /etc/sysconfig/network-scripts/ifcfg-lo DEVICE=lo IPADDR=127.0.0.1 NETMASK=255.0.0.0 NETWORK=127.0.0.0 # If you're having problems with gated making 127.0.0.0/8 a martian, # you can change this to something else (255.255.255.255, for example) BROADCAST=127.255.255.255 ONBOOT=yes NAME=loopback
/etc/sysconfig/network-scripts/ifcfg-ovirtmgmt # Generated by VDSM version 4.16.20-0.el7.centos DEVICE=ovirtmgmt TYPE=Bridge DELAY=0 STP=off ONBOOT=yes IPADDR=1.2.3.16 NETMASK=255.255.255.0 GATEWAY=1.2.3.11 BOOTPROTO=none DEFROUTE=yes NM_CONTROLLED=no HOTPLUG=no
/etc/sysconfig/network-scripts/ifcfg-p4p1 # Generated by VDSM version 4.16.20-0.el7.centos DEVICE=p4p1 HWADDR=68:05:ca:01:bc:0c ONBOOT=no IPADDR=4.5.7.102 NETMASK=255.255.255.0 BOOTPROTO=none MTU=9000 DEFROUTE=no NM_CONTROLLED=no
/etc/sysconfig/network-scripts/ifcfg-p3p1 # Generated by VDSM version 4.16.20-0.el7.centos DEVICE=p3p1 HWADDR=68:05:ca:18:86:45 ONBOOT=no IPADDR=4.5.6.102 NETMASK=255.255.255.0 BOOTPROTO=none MTU=9000 DEFROUTE=no NM_CONTROLLED=no
/etc/sysconfig/network-scripts/ifcfg-lo
ip link before update: 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: bond0: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN mode DEFAULT link/ether 46:50:22:7a:f3:9d brd ff:ff:ff:ff:ff:ff 3: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovirtmgmt state UP mode DEFAULT qlen 1000 link/ether d0:67:e5:f0:e5:c6 brd ff:ff:ff:ff:ff:ff 4: p3p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc pfifo_fast state UP mode DEFAULT qlen 1000 link/ether 68:05:ca:18:86:45 brd ff:ff:ff:ff:ff:ff 5: p4p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc pfifo_fast state UP mode DEFAULT qlen 1000 link/ether 68:05:ca:01:bc:0c brd ff:ff:ff:ff:ff:ff 7: ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT link/ether d0:67:e5:f0:e5:c6 brd ff:ff:ff:ff:ff:ff 8: ;vdsmdummy;: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT link/ether ce:0f:16:49:a7:da brd ff:ff:ff:ff:ff:ff
vdsm files before update: /var/lib/vdsm /var/lib/vdsm/bonding-defaults.json /var/lib/vdsm/netconfback /var/lib/vdsm/netconfback/ifcfg-ovirtmgmt /var/lib/vdsm/netconfback/ifcfg-em1 /var/lib/vdsm/netconfback/route-ovirtmgmt /var/lib/vdsm/netconfback/rule-ovirtmgmt /var/lib/vdsm/netconfback/ifcfg-p4p1 /var/lib/vdsm/netconfback/ifcfg-p3p1 /var/lib/vdsm/persistence /var/lib/vdsm/persistence/netconf /var/lib/vdsm/persistence/netconf.1416666697752319079 /var/lib/vdsm/persistence/netconf.1416666697752319079/nets /var/lib/vdsm/persistence/netconf.1416666697752319079/nets/san1 /var/lib/vdsm/persistence/netconf.1416666697752319079/nets/san2 /var/lib/vdsm/persistence/netconf.1416666697752319079/nets/ovirtmgmt /var/lib/vdsm/upgrade /var/lib/vdsm/upgrade/upgrade-unified-persistence /var/lib/vdsm/transient
File in /var/lib/vdsm/netconfback each only contained a comment: # original file did not exist This is quite peculiar. Do you know when these where created? Have you made any networking changes on 3.5.3 just before boot?
/var/lib/vdsm/persistence/netconf.1416666697752319079/nets/ovirtmgmt {"nic": "em1", "netmask": "255.255.255.0", "bootproto": "none", "ipaddr": "1.2.3.16", "gateway": "1.2.3.11"}
/var/lib/vdsm/persistence/netconf.1416666697752319079/nets/san1 {"nic": "p3p1", "netmask": "255.255.255.0", "ipaddr": "4.5.6.102", "bridged": "false", "mtu": "9000"}
/var/lib/vdsm/persistence/netconf.1416666697752319079/nets/san2 {"nic": "p4p1", "netmask": "255.255.255.0", "ipaddr": "4.5.7.102", "bridged": "false", "mtu": "9000"}
After update and reboot, no ifcfg scripts are left. Only interface lo is up. Syslog doess not seem to contain anything suspicious before refore reboot. Have you tweaked vdsm.conf in any way? In particular did you set net_persistence?
Log excerpts from bootup:
Sep 3 17:27:23 vhm-prd-02 network: Bringing up loopback interface: [ OK ] Sep 3 17:27:23 vhm-prd-02 systemd-ovirt-ha-agent: Starting ovirt-ha-agent: [ OK ] Sep 3 17:27:23 vhm-prd-02 systemd: Started oVirt Hosted Engine High Availability Monitoring Agent. Sep 3 17:27:23 vhm-prd-02 kernel: IPv6: ADDRCONF(NETDEV_UP): em1: link is not ready Sep 3 17:27:23 vhm-prd-02 kernel: device em1 entered promiscuous mode Sep 3 17:27:23 vhm-prd-02 network: Bringing up interface em1: [ OK ] Sep 3 17:27:23 vhm-prd-02 kernel: IPv6: ADDRCONF(NETDEV_UP): ovirtmgmt: link is not ready Sep 3 17:27:25 vhm-prd-02 avahi-daemon[778]: Joining mDNS multicast group on interface ovirtmgmt.IPv4 with address 1.2.3.16. Sep 3 17:27:25 vhm-prd-02 avahi-daemon[778]: New relevant interface ovirtmgmt.IPv4 for mDNS. Sep 3 17:27:25 vhm-prd-02 avahi-daemon[778]: Registering new address record for 1.2.3.16 on ovirtmgmt.IPv4. Sep 3 17:27:26 vhm-prd-02 kernel: tg3 0000:03:00.0 em1: Link is up at 1000 Mbps, full duplex Sep 3 17:27:26 vhm-prd-02 kernel: tg3 0000:03:00.0 em1: Flow control is off for TX and off for RX Sep 3 17:27:26 vhm-prd-02 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): em1: link becomes ready Sep 3 17:27:26 vhm-prd-02 kernel: ovirtmgmt: port 1(em1) entered forwarding state Sep 3 17:27:26 vhm-prd-02 kernel: ovirtmgmt: port 1(em1) entered forwarding state Sep 3 17:27:26 vhm-prd-02 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): ovirtmgmt: link becomes ready Sep 3 17:27:26 vhm-prd-02 network: Bringing up interface ovirtmgmt: [ OK ] Sep 3 17:27:26 vhm-prd-02 systemd: Started LSB: Bring up/down networking. Sep 3 17:27:26 vhm-prd-02 systemd: Starting Network. Sep 3 17:27:26 vhm-prd-02 systemd: Reached target Network.
So ovirtmgmt and em1 were restore and initialized just fine (p3p1 and p4p1 should have been started, too, but engine configured them as ONBOOT=no).
Further in messages (full log is attached): would you also attach your post-boot supervdsm.log?
Sep 3 17:27:26 vhm-prd-02 systemd: Starting Virtual Desktop Server Manager network restoration... Sep 3 17:27:26 vhm-prd-02 systemd: Started OSAD daemon. Sep 3 17:27:27 vhm-prd-02 systemd: Started Terminate Plymouth Boot Screen. Sep 3 17:27:27 vhm-prd-02 systemd: Started Wait for Plymouth Boot Screen to Quit. Sep 3 17:27:27 vhm-prd-02 systemd: Starting Serial Getty on ttyS1... Sep 3 17:27:27 vhm-prd-02 systemd: Started Serial Getty on ttyS1. Sep 3 17:27:27 vhm-prd-02 systemd: Starting Getty on tty1... Sep 3 17:27:27 vhm-prd-02 systemd: Started Getty on tty1. Sep 3 17:27:27 vhm-prd-02 systemd: Starting Login Prompts. Sep 3 17:27:27 vhm-prd-02 systemd: Reached target Login Prompts. Sep 3 17:27:27 vhm-prd-02 iscsid: iSCSI daemon with pid=1300 started! Sep 3 17:27:27 vhm-prd-02 avahi-daemon[778]: Registering new address record for fe80::d267:e5ff:fef0:e5c6 on ovirtmgmt.*. Sep 3 17:27:27 vhm-prd-02 kdumpctl: kexec: loaded kdump kernel Sep 3 17:27:27 vhm-prd-02 kdumpctl: Starting kdump: [OK] Sep 3 17:27:27 vhm-prd-02 systemd: Started Crash recovery kernel arming. Sep 3 17:27:27 vhm-prd-02 avahi-daemon[778]: Registering new address record for fe80::d267:e5ff:fef0:e5c6 on em1.*. Sep 3 17:27:27 vhm-prd-02 avahi-daemon[778]: Withdrawing address record for 1.2.3.16 on ovirtmgmt. Sep 3 17:27:27 vhm-prd-02 avahi-daemon[778]: Leaving mDNS multicast group on interface ovirtmgmt.IPv4 with address 1.2.3.16. Sep 3 17:27:27 vhm-prd-02 avahi-daemon[778]: Interface ovirtmgmt.IPv4 no longer relevant for mDNS. Sep 3 17:27:27 vhm-prd-02 kernel: ovirtmgmt: port 1(em1) entered disabled state Sep 3 17:27:27 vhm-prd-02 avahi-daemon[778]: Withdrawing address record for fe80::d267:e5ff:fef0:e5c6 on ovirtmgmt. Sep 3 17:27:28 vhm-prd-02 avahi-daemon[778]: Withdrawing address record for fe80::d267:e5ff:fef0:e5c6 on em1. Sep 3 17:27:28 vhm-prd-02 kernel: device em1 left promiscuous mode Sep 3 17:27:28 vhm-prd-02 kernel: ovirtmgmt: port 1(em1) entered disabled state Sep 3 17:27:28 vhm-prd-02 avahi-daemon[778]: Withdrawing workstation service for ovirtmgmt. Sep 3 17:27:28 vhm-prd-02 vdsm-tool: Traceback (most recent call last): Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File "/usr/share/vdsm/vdsm-restore-net-config", line 345, in <module> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: restore(args) Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File "/usr/share/vdsm/vdsm-restore-net-config", line 314, in restore Sep 3 17:27:28 vhm-prd-02 vdsm-tool: unified_restoration() Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File "/usr/share/vdsm/vdsm-restore-net-config", line 93, in unified_restoration Sep 3 17:27:28 vhm-prd-02 vdsm-tool: setupNetworks(nets, bonds, connectivityCheck=False, _inRollback=True) Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File "/usr/share/vdsm/network/api.py", line 642, in setupNetworks Sep 3 17:27:28 vhm-prd-02 vdsm-tool: implicitBonding=False, _netinfo=_netinfo) Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File "/usr/share/vdsm/network/api.py", line 213, in wrapped Sep 3 17:27:28 vhm-prd-02 vdsm-tool: ret = func(**attrs) Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File "/usr/share/vdsm/network/api.py", line 429, in delNetwork Sep 3 17:27:28 vhm-prd-02 vdsm-tool: netEnt.remove() Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File "/usr/share/vdsm/network/models.py", line 100, in remove Sep 3 17:27:28 vhm-prd-02 vdsm-tool: self.configurator.removeNic(self) Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File "/usr/share/vdsm/network/configurators/ifcfg.py", line 215, in removeNic Sep 3 17:27:28 vhm-prd-02 vdsm-tool: self.configApplier.removeNic(nic.name) Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File "/usr/share/vdsm/network/configurators/ifcfg.py", line 657, in removeNic Sep 3 17:27:28 vhm-prd-02 vdsm-tool: with open(cf) as nicFile: Sep 3 17:27:28 vhm-prd-02 vdsm-tool: IOError: [Errno 2] No such file or directory: u'/etc/sysconfig/network-scripts/ifcfg-p4p1' Sep 3 17:27:28 vhm-prd-02 vdsm-tool: Traceback (most recent call last): Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File "/usr/bin/vdsm-tool", line 219, in main Sep 3 17:27:28 vhm-prd-02 vdsm-tool: return tool_command[cmd]["command"](*args) Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File "/usr/lib/python2.7/site-packages/vdsm/tool/restore_nets.py", line 40, in restore_command Sep 3 17:27:28 vhm-prd-02 vdsm-tool: exec_restore(cmd) Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File "/usr/lib/python2.7/site-packages/vdsm/tool/restore_nets.py", line 53, in exec_restore Sep 3 17:27:28 vhm-prd-02 vdsm-tool: raise EnvironmentError('Failed to restore the persisted networks') Sep 3 17:27:28 vhm-prd-02 vdsm-tool: EnvironmentError: Failed to restore the persisted networks Sep 3 17:27:28 vhm-prd-02 systemd: vdsm-network.service: main process exited, code=exited, status=1/FAILURE Sep 3 17:27:28 vhm-prd-02 systemd: Failed to start Virtual Desktop Server Manager network restoration. Sep 3 17:27:28 vhm-prd-02 systemd: Dependency failed for Virtual Desktop Server Manager. Sep 3 17:27:28 vhm-prd-02 systemd: Sep 3 17:27:28 vhm-prd-02 systemd: Unit vdsm-network.service entered failed state. Sep 3 17:27:33 vhm-prd-02 systemd: Started Postfix Mail Transport Agent. Sep 3 17:27:33 vhm-prd-02 systemd: Starting Multi-User System. Sep 3 17:27:33 vhm-prd-02 systemd: Reached target Multi-User System. Sep 3 17:27:33 vhm-prd-02 systemd: Starting Update UTMP about System Runlevel Changes... Sep 3 17:27:33 vhm-prd-02 systemd: Starting Stop Read-Ahead Data Collection 10s After Completed Startup. Sep 3 17:27:33 vhm-prd-02 systemd: Started Stop Read-Ahead Data Collection 10s After Completed Startup. Sep 3 17:27:33 vhm-prd-02 systemd: Started Update UTMP about System Runlevel Changes. Sep 3 17:27:33 vhm-prd-02 systemd: Startup finished in 2.964s (kernel) + 2.507s (initrd) + 15.996s (userspace) = 21.468s.
So, as I have two more hosts, that need updating, I'm happy to assist in bisecting and debugging this update issue. Suggestions and help are very welcome. Thanks for this important report. I assume that calling
vdsClient -s 0 setSafeNetworkConfig
on the host before upgrade would make your problems go away, please do not do that yet - your assistence in debugging this further is important. Hi Dan,
From backups I could extract the pre-update timestamps of the files in /var/lib/vdsm/netconfback: ifcfg-em1 2015-08-10 16:40:19 ifcfg-ovirtmgmt 2015-08-10 16:40:19 ifcfg-p3p1 2015-08-10 16:40:25 ifcfg-p4p1 2015-08-10 16:40:22 route-ovirtmgmt 2015-08-10 16:40:20 rule-ovirtmgmt 2015-08-10 16:40:20
The ifcfg-scripts had the same corresponding timestamps: ifcfg-em1 2015-08-10 16:40:19 ifcfg-lo 2015-01-15 09:57:03 ifcfg-ovirtmgmt 2015-08-10 16:40:19 ifcfg-p3p1 2015-08-10 16:40:25 ifcfg-p4p1 2015-08-10 16:40:22 Do you recall what has been done on 2015-08-10? Was your 3.5.3 host rebooted ever since? I just tried to reconstruct the happings on 2015-08-10 and it seems, that in fact
On 07.09.2015 13:54, Dan Kenigsberg wrote: the network configuration was not touched. I was mislead by the dates. At that date/time an updated kernel and some more CentOS rpms where updated (the whole cluster was updated one by one). A reboot on this specific host was initiated after the update at 2015-08-10 16:40:04. The timestamps from my previous email seem still to be _within_ the bootup-process. So yes, the host was rebooted ever since update to 3.5.3 (that happened on 2015-06-15).
Reboots since 2015-06-15: reboot system boot 3.10.0-229.11.1. Mon Aug 10 16:56 - 14:34 (27+21:37) reboot system boot 3.10.0-229.7.2.e Mon Jul 27 17:48 - 16:53 (13+23:05) reboot system boot 3.10.0-229.7.2.e Wed Jun 24 16:46 - 17:46 (33+00:59) reboot system boot 3.10.0-229.4.2.e Mon Jun 15 18:10 - 16:44 (8+22:34) Wrong reboot list. The correct reboots for this host are: reboot system boot 3.10.0-229.11.1. Thu Sep 3 17:42 - 13:58 (3+20:16) reboot system boot 3.10.0-229.11.1. Thu Sep 3 17:27 - 17:40 (00:12) reboot system boot 3.10.0-229.11.1. Mon Aug 10 16:40 - 17:23 (24+00:43) reboot system boot 3.10.0-229.7.2.e Mon Jul 27 16:52 - 16:33 (13+23:40) reboot system boot 3.10.0-229.7.2.e Thu Jul 9 11:10 - 16:49 (18+05:38) reboot system boot 3.10.0-229.4.2.e Wed Jun 17 17:27 - 11:07 (21+17:40) reboot system boot 3.10.0-229.4.2.e Mon Jun 15 17:22 - 17:23 (2+00:01)
I checked the 2 remaining hosts (still 3.5.3) and both do not have any different content in /var/lib/vdsm/netconfback. Again only single line comments: # original file did not exist
My other productive oVirt 3.4 hosts don't even have these. The directory /var/lib/vdsm/netconfback is empy on those.
What should/could I check on the remaining 2 hosts prior to the update? Try syncing the network-configuration and verify the contents in /var/lib/vdsm/netconfback?
If the networks have been configured on the host back then, but never persisted, any reboot (regardless of upgrade) would cause their removal.
Vdsm should be more robust in handling missing ifcfg; but that's a second-order bug
1256252 Vdsm should recover ifcfg files in case they are no longer exist and recover all networks on the server
I'd like to first understand how come you have these placeholders left behind.
The attached supervdsm.log contains everything from network configuration done on 2015-08-10 till vdsm update on 2015-09-03 at 17:20 and the reboot performed afterwards. Thanks. Maybe Ido could find further hints inside it
-- Lobster SCM GmbH, Hindenburgstraße 15, D-82343 Pöcking HRB 178831, Amtsgericht München Geschäftsführer: Dr. Martin Fischer, Rolf Henrich
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users