On Mon, Sep 07, 2015 at 11:47:48AM +0200, Patrick Hurrelmann wrote:
On 06.09.2015 11:30, Dan Kenigsberg wrote:
> On Fri, Sep 04, 2015 at 10:26:39AM +0200, Patrick Hurrelmann wrote:
>> Hi all,
>>
>> I just updated my existing oVirt 3.5.3 installation (iSCSI hosted-engine on
>> CentOS 7.1). The engine update went fine. Updating the hosts succeeds until the
>> first reboot. After a reboot the host does not come up again. It is missing all
>> network configuration. All network cfgs in /etc/sysconfig/network-scripts are
>> missing except ifcfg-lo. The host boots up without working networking. Using
>> IPMI and config backups, I was able to restore the lost network configs. Once
>> these are restored and the host is rebooted again all seems to be back to good.
>> This has now happend to 2 updated hosts (this installation has a total of 4
>> hosts, so 2 more to debug/try). I'm happy to assist in furter debugging.
>>
>> Before updating the second host, I gathered some information. All these hosts
>> have 3 physical nics. One is used for the ovirtmgmt bridge and the other 2 are
>> used for iSCSI storage vlans.
>>
>> ifcfgs before update:
>>
>> /etc/sysconfig/network-scripts/ifcfg-em1
>> # Generated by VDSM version 4.16.20-0.el7.centos
>> DEVICE=em1
>> HWADDR=d0:67:e5:f0:e5:c6
>> BRIDGE=ovirtmgmt
>> ONBOOT=yes
>> NM_CONTROLLED=no
> /etc/sysconfig/network-scripts/ifcfg-lo
>> DEVICE=lo
>> IPADDR=127.0.0.1
>> NETMASK=255.0.0.0
>> NETWORK=127.0.0.0
>> # If you're having problems with gated making 127.0.0.0/8 a martian,
>> # you can change this to something else (255.255.255.255, for example)
>> BROADCAST=127.255.255.255
>> ONBOOT=yes
>> NAME=loopback
>>
>> /etc/sysconfig/network-scripts/ifcfg-ovirtmgmt
>> # Generated by VDSM version 4.16.20-0.el7.centos
>> DEVICE=ovirtmgmt
>> TYPE=Bridge
>> DELAY=0
>> STP=off
>> ONBOOT=yes
>> IPADDR=1.2.3.16
>> NETMASK=255.255.255.0
>> GATEWAY=1.2.3.11
>> BOOTPROTO=none
>> DEFROUTE=yes
>> NM_CONTROLLED=no
>> HOTPLUG=no
>>
>> /etc/sysconfig/network-scripts/ifcfg-p4p1
>> # Generated by VDSM version 4.16.20-0.el7.centos
>> DEVICE=p4p1
>> HWADDR=68:05:ca:01:bc:0c
>> ONBOOT=no
>> IPADDR=4.5.7.102
>> NETMASK=255.255.255.0
>> BOOTPROTO=none
>> MTU=9000
>> DEFROUTE=no
>> NM_CONTROLLED=no
>>
>> /etc/sysconfig/network-scripts/ifcfg-p3p1
>> # Generated by VDSM version 4.16.20-0.el7.centos
>> DEVICE=p3p1
>> HWADDR=68:05:ca:18:86:45
>> ONBOOT=no
>> IPADDR=4.5.6.102
>> NETMASK=255.255.255.0
>> BOOTPROTO=none
>> MTU=9000
>> DEFROUTE=no
>> NM_CONTROLLED=no
>>
>> /etc/sysconfig/network-scripts/ifcfg-lo
>>
>>
>> ip link before update:
>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode
DEFAULT
>> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>> 2: bond0: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN mode
DEFAULT
>> link/ether 46:50:22:7a:f3:9d brd ff:ff:ff:ff:ff:ff
>> 3: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master
ovirtmgmt state UP mode DEFAULT qlen 1000
>> link/ether d0:67:e5:f0:e5:c6 brd ff:ff:ff:ff:ff:ff
>> 4: p3p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc pfifo_fast state
UP mode DEFAULT qlen 1000
>> link/ether 68:05:ca:18:86:45 brd ff:ff:ff:ff:ff:ff
>> 5: p4p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc pfifo_fast state
UP mode DEFAULT qlen 1000
>> link/ether 68:05:ca:01:bc:0c brd ff:ff:ff:ff:ff:ff
>> 7: ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue
state UP mode DEFAULT
>> link/ether d0:67:e5:f0:e5:c6 brd ff:ff:ff:ff:ff:ff
>> 8: ;vdsmdummy;: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
DEFAULT
>> link/ether ce:0f:16:49:a7:da brd ff:ff:ff:ff:ff:ff
>>
>> vdsm files before update:
>> /var/lib/vdsm
>> /var/lib/vdsm/bonding-defaults.json
>> /var/lib/vdsm/netconfback
>> /var/lib/vdsm/netconfback/ifcfg-ovirtmgmt
>> /var/lib/vdsm/netconfback/ifcfg-em1
>> /var/lib/vdsm/netconfback/route-ovirtmgmt
>> /var/lib/vdsm/netconfback/rule-ovirtmgmt
>> /var/lib/vdsm/netconfback/ifcfg-p4p1
>> /var/lib/vdsm/netconfback/ifcfg-p3p1
>> /var/lib/vdsm/persistence
>> /var/lib/vdsm/persistence/netconf
>> /var/lib/vdsm/persistence/netconf.1416666697752319079
>> /var/lib/vdsm/persistence/netconf.1416666697752319079/nets
>> /var/lib/vdsm/persistence/netconf.1416666697752319079/nets/san1
>> /var/lib/vdsm/persistence/netconf.1416666697752319079/nets/san2
>> /var/lib/vdsm/persistence/netconf.1416666697752319079/nets/ovirtmgmt
>> /var/lib/vdsm/upgrade
>> /var/lib/vdsm/upgrade/upgrade-unified-persistence
>> /var/lib/vdsm/transient
>>
>>
>> File in /var/lib/vdsm/netconfback each only contained a comment:
>> # original file did not exist
> This is quite peculiar. Do you know when these where created?
> Have you made any networking changes on 3.5.3 just before boot?
>
>> /var/lib/vdsm/persistence/netconf.1416666697752319079/nets/ovirtmgmt
>> {"nic": "em1", "netmask":
"255.255.255.0", "bootproto": "none", "ipaddr":
"1.2.3.16", "gateway": "1.2.3.11"}
>>
>> /var/lib/vdsm/persistence/netconf.1416666697752319079/nets/san1
>> {"nic": "p3p1", "netmask":
"255.255.255.0", "ipaddr": "4.5.6.102", "bridged":
"false", "mtu": "9000"}
>>
>> /var/lib/vdsm/persistence/netconf.1416666697752319079/nets/san2
>> {"nic": "p4p1", "netmask":
"255.255.255.0", "ipaddr": "4.5.7.102", "bridged":
"false", "mtu": "9000"}
>>
>>
>> After update and reboot, no ifcfg scripts are left. Only interface lo is up.
>> Syslog doess not seem to contain anything suspicious before refore reboot.
>
> Have you tweaked vdsm.conf in any way? In particular did you set
> net_persistence?
>
>> Log excerpts from bootup:
>>
>> Sep 3 17:27:23 vhm-prd-02 network: Bringing up loopback interface: [ OK ]
>> Sep 3 17:27:23 vhm-prd-02 systemd-ovirt-ha-agent: Starting ovirt-ha-agent: [
OK ]
>> Sep 3 17:27:23 vhm-prd-02 systemd: Started oVirt Hosted Engine High
Availability Monitoring Agent.
>> Sep 3 17:27:23 vhm-prd-02 kernel: IPv6: ADDRCONF(NETDEV_UP): em1: link is not
ready
>> Sep 3 17:27:23 vhm-prd-02 kernel: device em1 entered promiscuous mode
>> Sep 3 17:27:23 vhm-prd-02 network: Bringing up interface em1: [ OK ]
>> Sep 3 17:27:23 vhm-prd-02 kernel: IPv6: ADDRCONF(NETDEV_UP): ovirtmgmt: link is
not ready
>> Sep 3 17:27:25 vhm-prd-02 avahi-daemon[778]: Joining mDNS multicast group on
interface ovirtmgmt.IPv4 with address 1.2.3.16.
>> Sep 3 17:27:25 vhm-prd-02 avahi-daemon[778]: New relevant interface
ovirtmgmt.IPv4 for mDNS.
>> Sep 3 17:27:25 vhm-prd-02 avahi-daemon[778]: Registering new address record for
1.2.3.16 on ovirtmgmt.IPv4.
>> Sep 3 17:27:26 vhm-prd-02 kernel: tg3 0000:03:00.0 em1: Link is up at 1000
Mbps, full duplex
>> Sep 3 17:27:26 vhm-prd-02 kernel: tg3 0000:03:00.0 em1: Flow control is off for
TX and off for RX
>> Sep 3 17:27:26 vhm-prd-02 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): em1: link
becomes ready
>> Sep 3 17:27:26 vhm-prd-02 kernel: ovirtmgmt: port 1(em1) entered forwarding
state
>> Sep 3 17:27:26 vhm-prd-02 kernel: ovirtmgmt: port 1(em1) entered forwarding
state
>> Sep 3 17:27:26 vhm-prd-02 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): ovirtmgmt:
link becomes ready
>> Sep 3 17:27:26 vhm-prd-02 network: Bringing up interface ovirtmgmt: [ OK ]
>> Sep 3 17:27:26 vhm-prd-02 systemd: Started LSB: Bring up/down networking.
>> Sep 3 17:27:26 vhm-prd-02 systemd: Starting Network.
>> Sep 3 17:27:26 vhm-prd-02 systemd: Reached target Network.
>>
>> So ovirtmgmt and em1 were restore and initialized just fine (p3p1 and p4p1
>> should have been started, too, but engine configured them as ONBOOT=no).
>>
>> Further in messages (full log is attached):
> would you also attach your post-boot supervdsm.log?
>
>> Sep 3 17:27:26 vhm-prd-02 systemd: Starting Virtual Desktop Server Manager
network restoration...
>> Sep 3 17:27:26 vhm-prd-02 systemd: Started OSAD daemon.
>> Sep 3 17:27:27 vhm-prd-02 systemd: Started Terminate Plymouth Boot Screen.
>> Sep 3 17:27:27 vhm-prd-02 systemd: Started Wait for Plymouth Boot Screen to
Quit.
>> Sep 3 17:27:27 vhm-prd-02 systemd: Starting Serial Getty on ttyS1...
>> Sep 3 17:27:27 vhm-prd-02 systemd: Started Serial Getty on ttyS1.
>> Sep 3 17:27:27 vhm-prd-02 systemd: Starting Getty on tty1...
>> Sep 3 17:27:27 vhm-prd-02 systemd: Started Getty on tty1.
>> Sep 3 17:27:27 vhm-prd-02 systemd: Starting Login Prompts.
>> Sep 3 17:27:27 vhm-prd-02 systemd: Reached target Login Prompts.
>> Sep 3 17:27:27 vhm-prd-02 iscsid: iSCSI daemon with pid=1300 started!
>> Sep 3 17:27:27 vhm-prd-02 avahi-daemon[778]: Registering new address record for
fe80::d267:e5ff:fef0:e5c6 on ovirtmgmt.*.
>> Sep 3 17:27:27 vhm-prd-02 kdumpctl: kexec: loaded kdump kernel
>> Sep 3 17:27:27 vhm-prd-02 kdumpctl: Starting kdump: [OK]
>> Sep 3 17:27:27 vhm-prd-02 systemd: Started Crash recovery kernel arming.
>> Sep 3 17:27:27 vhm-prd-02 avahi-daemon[778]: Registering new address record for
fe80::d267:e5ff:fef0:e5c6 on em1.*.
>> Sep 3 17:27:27 vhm-prd-02 avahi-daemon[778]: Withdrawing address record for
1.2.3.16 on ovirtmgmt.
>> Sep 3 17:27:27 vhm-prd-02 avahi-daemon[778]: Leaving mDNS multicast group on
interface ovirtmgmt.IPv4 with address 1.2.3.16.
>> Sep 3 17:27:27 vhm-prd-02 avahi-daemon[778]: Interface ovirtmgmt.IPv4 no longer
relevant for mDNS.
>> Sep 3 17:27:27 vhm-prd-02 kernel: ovirtmgmt: port 1(em1) entered disabled
state
>> Sep 3 17:27:27 vhm-prd-02 avahi-daemon[778]: Withdrawing address record for
fe80::d267:e5ff:fef0:e5c6 on ovirtmgmt.
>> Sep 3 17:27:28 vhm-prd-02 avahi-daemon[778]: Withdrawing address record for
fe80::d267:e5ff:fef0:e5c6 on em1.
>> Sep 3 17:27:28 vhm-prd-02 kernel: device em1 left promiscuous mode
>> Sep 3 17:27:28 vhm-prd-02 kernel: ovirtmgmt: port 1(em1) entered disabled
state
>> Sep 3 17:27:28 vhm-prd-02 avahi-daemon[778]: Withdrawing workstation service
for ovirtmgmt.
>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: Traceback (most recent call last):
>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File
"/usr/share/vdsm/vdsm-restore-net-config", line 345, in <module>
>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: restore(args)
>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File
"/usr/share/vdsm/vdsm-restore-net-config", line 314, in restore
>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: unified_restoration()
>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File
"/usr/share/vdsm/vdsm-restore-net-config", line 93, in unified_restoration
>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: setupNetworks(nets, bonds,
connectivityCheck=False, _inRollback=True)
>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File
"/usr/share/vdsm/network/api.py", line 642, in setupNetworks
>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: implicitBonding=False, _netinfo=_netinfo)
>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File
"/usr/share/vdsm/network/api.py", line 213, in wrapped
>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: ret = func(**attrs)
>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File
"/usr/share/vdsm/network/api.py", line 429, in delNetwork
>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: netEnt.remove()
>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File
"/usr/share/vdsm/network/models.py", line 100, in remove
>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: self.configurator.removeNic(self)
>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File
"/usr/share/vdsm/network/configurators/ifcfg.py", line 215, in removeNic
>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: self.configApplier.removeNic(nic.name)
>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File
"/usr/share/vdsm/network/configurators/ifcfg.py", line 657, in removeNic
>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: with open(cf) as nicFile:
>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: IOError: [Errno 2] No such file or
directory: u'/etc/sysconfig/network-scripts/ifcfg-p4p1'
>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: Traceback (most recent call last):
>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File "/usr/bin/vdsm-tool", line
219, in main
>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: return
tool_command[cmd]["command"](*args)
>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File
"/usr/lib/python2.7/site-packages/vdsm/tool/restore_nets.py", line 40, in
restore_command
>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: exec_restore(cmd)
>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: File
"/usr/lib/python2.7/site-packages/vdsm/tool/restore_nets.py", line 53, in
exec_restore
>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: raise EnvironmentError('Failed to
restore the persisted networks')
>> Sep 3 17:27:28 vhm-prd-02 vdsm-tool: EnvironmentError: Failed to restore the
persisted networks
>> Sep 3 17:27:28 vhm-prd-02 systemd: vdsm-network.service: main process exited,
code=exited, status=1/FAILURE
>> Sep 3 17:27:28 vhm-prd-02 systemd: Failed to start Virtual Desktop Server
Manager network restoration.
>> Sep 3 17:27:28 vhm-prd-02 systemd: Dependency failed for Virtual Desktop Server
Manager.
>> Sep 3 17:27:28 vhm-prd-02 systemd:
>> Sep 3 17:27:28 vhm-prd-02 systemd: Unit vdsm-network.service entered failed
state.
>> Sep 3 17:27:33 vhm-prd-02 systemd: Started Postfix Mail Transport Agent.
>> Sep 3 17:27:33 vhm-prd-02 systemd: Starting Multi-User System.
>> Sep 3 17:27:33 vhm-prd-02 systemd: Reached target Multi-User System.
>> Sep 3 17:27:33 vhm-prd-02 systemd: Starting Update UTMP about System Runlevel
Changes...
>> Sep 3 17:27:33 vhm-prd-02 systemd: Starting Stop Read-Ahead Data Collection 10s
After Completed Startup.
>> Sep 3 17:27:33 vhm-prd-02 systemd: Started Stop Read-Ahead Data Collection 10s
After Completed Startup.
>> Sep 3 17:27:33 vhm-prd-02 systemd: Started Update UTMP about System Runlevel
Changes.
>> Sep 3 17:27:33 vhm-prd-02 systemd: Startup finished in 2.964s (kernel) + 2.507s
(initrd) + 15.996s (userspace) = 21.468s.
>>
>> So, as I have two more hosts, that need updating, I'm happy to assist in
>> bisecting and debugging this update issue. Suggestions and help are very
>> welcome.
> Thanks for this important report. I assume that calling
>
> vdsClient -s 0 setSafeNetworkConfig
>
> on the host before upgrade would make your problems go away, please do
> not do that yet - your assistence in debugging this further is
> important.
Hi Dan,
>From backups I could extract the pre-update timestamps of the files in
/var/lib/vdsm/netconfback:
ifcfg-em1 2015-08-10 16:40:19
ifcfg-ovirtmgmt 2015-08-10 16:40:19
ifcfg-p3p1 2015-08-10 16:40:25
ifcfg-p4p1 2015-08-10 16:40:22
route-ovirtmgmt 2015-08-10 16:40:20
rule-ovirtmgmt 2015-08-10 16:40:20
The ifcfg-scripts had the same corresponding timestamps:
ifcfg-em1 2015-08-10 16:40:19
ifcfg-lo 2015-01-15 09:57:03
ifcfg-ovirtmgmt 2015-08-10 16:40:19
ifcfg-p3p1 2015-08-10 16:40:25
ifcfg-p4p1 2015-08-10 16:40:22
Do you recall what has been done on 2015-08-10?
Was your 3.5.3 host rebooted ever since?
If the networks have been configured on the host back then, but never
persisted, any reboot (regardless of upgrade) would cause their removal.
Vdsm should be more robust in handling missing ifcfg; but that's a
second-order bug
1256252 Vdsm should recover ifcfg files in case they are no
longer exist and recover all networks on the server
I'd like to first understand how come you have these placeholders left
behind.
The attached supervdsm.log contains everything from network configuration
done on 2015-08-10 till vdsm update on 2015-09-03 at 17:20 and the reboot
performed afterwards.
Thanks. Maybe Ido could find further hints inside it