oVirt test clustered DOA after upgrade to latest 4.4.2 (AFAIK). Network issue?

Hello all, I just upgraded one of my test oVirt setups to latest. Post reboot, the machine lost network and hosted engine didn't start. When I connected to the machine, I noticed all the /etc/sysconfig/network-scripts/ifcfg-* files disappeared. No idea why / how. (Possibly I did something wrong and forgot about it.) Long story story short, I copied the two missing files (ifcfg-onb0 ethernet device configuration and the ifcfg-ovirtmgmt bridge configuration) from another ovirt host, changed the ovirtmgmt IP address and UUID to match the UUID reported in logs as missing and restarted NetworkManager and restarted all the oVirt related services (vdsmd, supervdsmd, ovirt-*, etc). Sadly enough, even with both onb0 and ovirtmgmt up, vdsm still complains about the missing network (ovirtmgmt) and refuses to start the hosted engine. Reboot doesn't seem to change anything. In the main log I see the following errors: Oct 22 16:33:31 office-wx-otest vdsm[2634]: WARN Attempting to remove a non existing network: ovirtmgmt/1da8c5b7-999c-4ada-8287-1f35de6ce21d Oct 22 16:33:31 office-wx-otest vdsm[2634]: WARN Attempting to remove a non existing net user: ovirtmgmt/1da8c5b7-999c-4ada-8287-1f35de6ce21d Oct 22 16:33:31 office-wx-otest vdsm[2634]: WARN Attempting to remove a non existing network: ovirtmgmt/1da8c5b7-999c-4ada-8287-1f35de6ce21d Oct 22 16:33:31 office-wx-otest vdsm[2634]: WARN Attempting to remove a non existing net user: ovirtmgmt/1da8c5b7-999c-4ada-8287-1f35de6ce21d As it is one of my oVirt test setup, I can simply redeploy the host and continue from there, but I rather use this experience to learn how to fix oVirt such issues in the future. Logs attached. https://drive.google.com/file/d/12ugy6CuaFaMvXYt6uGT4D_EHIW6nXttb/view?usp=s... $ PAGER= nmcli connection show NAME UUID TYPE DEVICE ovirtmgmt 1da8c5b7-999c-4ada-8287-1f35de6ce21d bridge ovirtmgmt onb0 48332db3-8939-bff3-6b71-772a28c9e7b8 ethernet onb0 $ PAGER= nmcli device show GENERAL.DEVICE: ovirtmgmt GENERAL.TYPE: bridge GENERAL.HWADDR: FC:AA:14:6B:A8:E0 GENERAL.MTU: 1500 GENERAL.STATE: 100 (connected) GENERAL.CONNECTION: ovirtmgmt GENERAL.CON-PATH: /org/freedesktop/NetworkManager/ActiveConnection/2 IP4.ADDRESS[1]: 192.168.2.117/24 IP4.GATEWAY: 192.168.2.100 IP4.ROUTE[1]: dst = 192.168.2.0/24, nh = 0.0.0.0, mt = 425 IP4.ROUTE[2]: dst = 0.0.0.0/0, nh = 192.168.2.100, mt = 425 IP4.DNS[1]: 192.168.2.100 IP4.DNS[2]: 8.8.8.8 IP6.GATEWAY: -- GENERAL.DEVICE: onb0 GENERAL.TYPE: ethernet GENERAL.HWADDR: FC:AA:14:6B:A8:E0 GENERAL.MTU: 1500 GENERAL.STATE: 100 (connected) GENERAL.CONNECTION: onb0 GENERAL.CON-PATH: /org/freedesktop/NetworkManager/ActiveConnection/3 WIRED-PROPERTIES.CARRIER: on IP4.GATEWAY: -- GENERAL.DEVICE: ;vdsmdummy; GENERAL.TYPE: bridge GENERAL.HWADDR: 92:8B:9A:5E:C1:3E GENERAL.MTU: 1500 GENERAL.STATE: 10 (unmanaged) GENERAL.CONNECTION: -- GENERAL.CON-PATH: -- IP4.GATEWAY: -- IP6.GATEWAY: -- GENERAL.DEVICE: lo GENERAL.TYPE: loopback GENERAL.HWADDR: 00:00:00:00:00:00 GENERAL.MTU: 65536 GENERAL.STATE: 10 (unmanaged) GENERAL.CONNECTION: -- GENERAL.CON-PATH: -- IP4.ADDRESS[1]: 127.0.0.1/8 IP4.GATEWAY: -- IP6.GATEWAY: -- GENERAL.DEVICE: br-int GENERAL.TYPE: openvswitch GENERAL.HWADDR: 8E:15:6A:F8:3C:45 GENERAL.MTU: 1500 GENERAL.STATE: 10 (unmanaged) GENERAL.CONNECTION: -- GENERAL.CON-PATH: -- IP4.GATEWAY: -- IP6.GATEWAY: -- GENERAL.DEVICE: ovs-system GENERAL.TYPE: openvswitch GENERAL.HWADDR: E2:09:EA:A2:BD:70 GENERAL.MTU: 1500 GENERAL.STATE: 10 (unmanaged) GENERAL.CONNECTION: -- GENERAL.CON-PATH: -- IP4.GATEWAY: -- IP6.GATEWAY: - Gilboa office-wx-otest-vdsm.bz2 <https://drive.google.com/file/d/12ugy6CuaFaMvXYt6uGT4D_EHIW6nXttb/view?usp=drive_web>

On Thu, Oct 22, 2020 at 3:39 PM Gilboa Davara <gilboad@gmail.com> wrote:
Hello all,
Hi,
I just upgraded one of my test oVirt setups to latest.
Post reboot, the machine lost network and hosted engine didn't start. When I connected to the machine, I noticed all the /etc/sysconfig/network-scripts/ifcfg-* files disappeared. No idea why / how. (Possibly I did something wrong and forgot about it.)
It would be nice to know when this happens as it might be a serious problem.
Long story story short, I copied the two missing files (ifcfg-onb0 ethernet device configuration and the ifcfg-ovirtmgmt bridge configuration) from another ovirt host, changed the ovirtmgmt IP address and UUID to match the UUID reported in logs as missing and restarted NetworkManager and restarted all the oVirt related services (vdsmd, supervdsmd, ovirt-*, etc).
Sadly enough, even with both onb0 and ovirtmgmt up, vdsm still complains about the missing network (ovirtmgmt) and refuses to start the hosted engine. Reboot doesn't seem to change anything.
Unfortunately this won't work. From this it seems like vdsm persistence was broken somehow during the upgrade.
In the main log I see the following errors: Oct 22 16:33:31 office-wx-otest vdsm[2634]: WARN Attempting to remove a non existing network: ovirtmgmt/1da8c5b7-999c-4ada-8287-1f35de6ce21d Oct 22 16:33:31 office-wx-otest vdsm[2634]: WARN Attempting to remove a non existing net user: ovirtmgmt/1da8c5b7-999c-4ada-8287-1f35de6ce21d Oct 22 16:33:31 office-wx-otest vdsm[2634]: WARN Attempting to remove a non existing network: ovirtmgmt/1da8c5b7-999c-4ada-8287-1f35de6ce21d Oct 22 16:33:31 office-wx-otest vdsm[2634]: WARN Attempting to remove a non existing net user: ovirtmgmt/1da8c5b7-999c-4ada-8287-1f35de6ce21d
As it is one of my oVirt test setup, I can simply redeploy the host and continue from there, but I rather use this experience to learn how to fix oVirt such issues in the future.
For start you can really ensure that the network is not saved in the vdsm configuration. By using vdsm-tool on the host, if you run "vdsm-tool list-nets" it won't produce anything most likely. To restore your previous configuration on the host you can use: cat << EOF > ovirtmgmt.json { "networks": { "ovirtmgmt": { "netmask": "255.255.255.0", "ipv6autoconf": false, "nic": "onb0", "bridged": true, "ipaddr": "192.168.2.117", "defaultRoute": true, "dhcpv6": false, "gateway": "192.168.2.100", "mtu": 1500, "switch": "legacy", "stp": false, "bootproto": "none", "nameservers": [ "192.168.2.100", "8.8.8.8" ] } }, "bondings": {}, "options": { "connectivityCheck": false } } EOF vdsm-client -f ovirtmgmt.json Host setupNetworks If that works, you have to persist the configuration before reboot, so either from the engine UI or from the host itself: vdsm-client Host setSafeNetworkConfig
Logs attached.
https://drive.google.com/file/d/12ugy6CuaFaMvXYt6uGT4D_EHIW6nXttb/view?usp=s...
$ PAGER= nmcli connection show NAME UUID TYPE DEVICE ovirtmgmt 1da8c5b7-999c-4ada-8287-1f35de6ce21d bridge ovirtmgmt onb0 48332db3-8939-bff3-6b71-772a28c9e7b8 ethernet onb0
$ PAGER= nmcli device show GENERAL.DEVICE: ovirtmgmt GENERAL.TYPE: bridge GENERAL.HWADDR: FC:AA:14:6B:A8:E0 GENERAL.MTU: 1500 GENERAL.STATE: 100 (connected) GENERAL.CONNECTION: ovirtmgmt GENERAL.CON-PATH: /org/freedesktop/NetworkManager/ActiveConnection/2 IP4.ADDRESS[1]: 192.168.2.117/24 IP4.GATEWAY: 192.168.2.100 IP4.ROUTE[1]: dst = 192.168.2.0/24, nh = 0.0.0.0, mt = 425 IP4.ROUTE[2]: dst = 0.0.0.0/0, nh = 192.168.2.100, mt = 425 IP4.DNS[1]: 192.168.2.100 IP4.DNS[2]: 8.8.8.8 IP6.GATEWAY: --
GENERAL.DEVICE: onb0 GENERAL.TYPE: ethernet GENERAL.HWADDR: FC:AA:14:6B:A8:E0 GENERAL.MTU: 1500 GENERAL.STATE: 100 (connected) GENERAL.CONNECTION: onb0 GENERAL.CON-PATH: /org/freedesktop/NetworkManager/ActiveConnection/3 WIRED-PROPERTIES.CARRIER: on IP4.GATEWAY: --
GENERAL.DEVICE: ;vdsmdummy; GENERAL.TYPE: bridge GENERAL.HWADDR: 92:8B:9A:5E:C1:3E GENERAL.MTU: 1500 GENERAL.STATE: 10 (unmanaged) GENERAL.CONNECTION: -- GENERAL.CON-PATH: -- IP4.GATEWAY: -- IP6.GATEWAY: --
GENERAL.DEVICE: lo GENERAL.TYPE: loopback GENERAL.HWADDR: 00:00:00:00:00:00 GENERAL.MTU: 65536 GENERAL.STATE: 10 (unmanaged) GENERAL.CONNECTION: -- GENERAL.CON-PATH: -- IP4.ADDRESS[1]: 127.0.0.1/8 IP4.GATEWAY: -- IP6.GATEWAY: --
GENERAL.DEVICE: br-int GENERAL.TYPE: openvswitch GENERAL.HWADDR: 8E:15:6A:F8:3C:45 GENERAL.MTU: 1500 GENERAL.STATE: 10 (unmanaged) GENERAL.CONNECTION: -- GENERAL.CON-PATH: -- IP4.GATEWAY: -- IP6.GATEWAY: --
GENERAL.DEVICE: ovs-system GENERAL.TYPE: openvswitch GENERAL.HWADDR: E2:09:EA:A2:BD:70 GENERAL.MTU: 1500 GENERAL.STATE: 10 (unmanaged) GENERAL.CONNECTION: -- GENERAL.CON-PATH: -- IP4.GATEWAY: -- IP6.GATEWAY:
- Gilboa
office-wx-otest-vdsm.bz2 <https://drive.google.com/file/d/12ugy6CuaFaMvXYt6uGT4D_EHIW6nXttb/view?usp=drive_web> _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/V5M23EHLZ5GSL2...
Hopefully this helps. Regards, Ales -- Ales Musil Software Engineer - RHV Network Red Hat EMEA <https://www.redhat.com> amusil@redhat.com IM: amusil <https://red.ht/sig>

Hello, Many thanks for the prompt reply. Answers in-line On Fri, Oct 23, 2020 at 9:16 AM Ales Musil <amusil@redhat.com> wrote:
It would be nice to know when this happens as it might be a serious problem.
I can't offer much beyond the logs I uploaded. I had a similar event in one of my production GlusterFS / oVirt clusters, but restoring the missing ifcfg- files from backup and restarting NetworkManager solved the problem.
Unfortunately this won't work. From this it seems like vdsm persistence was broken somehow during the upgrade.
For start you can really ensure that the network is not saved in the vdsm configuration. By using vdsm-tool on the host, if you run "vdsm-tool list-nets" it won't produce anything most likely.
To restore your previous configuration on the host you can use:
cat << EOF > ovirtmgmt.json { "networks": { "ovirtmgmt": { "netmask": "255.255.255.0", "ipv6autoconf": false, "nic": "onb0", "bridged": true, "ipaddr": "192.168.2.117", "defaultRoute": true, "dhcpv6": false, "gateway": "192.168.2.100", "mtu": 1500, "switch": "legacy", "stp": false, "bootproto": "none", "nameservers": [ "192.168.2.100", "8.8.8.8" ] } }, "bondings": {}, "options": { "connectivityCheck": false } } EOF
vdsm-client -f ovirtmgmt.json Host setupNetworks
If that works, you have to persist the configuration before reboot, so either from the engine UI or from the host itself:
vdsm-client Host setSafeNetworkConfig
Worked like a charm! Thanks! Have a good weekend, Gilboa

On Sat, Oct 24, 2020 at 2:01 PM Gilboa Davara <gilboad@gmail.com> wrote:
Hello,
Many thanks for the prompt reply. Answers in-line
On Fri, Oct 23, 2020 at 9:16 AM Ales Musil <amusil@redhat.com> wrote:
It would be nice to know when this happens as it might be a serious
problem.
I can't offer much beyond the logs I uploaded. I had a similar event in one of my production GlusterFS / oVirt clusters, but restoring the missing ifcfg- files from backup and restarting NetworkManager solved the problem.
Alright, please don't hesitate to share it if it ever happens again.
Unfortunately this won't work. From this it seems like vdsm persistence
was broken somehow during the upgrade.
For start you can really ensure that the network is not saved in the
vdsm configuration.
By using vdsm-tool on the host, if you run "vdsm-tool list-nets" it won't produce anything most likely.
To restore your previous configuration on the host you can use:
cat << EOF > ovirtmgmt.json { "networks": { "ovirtmgmt": { "netmask": "255.255.255.0", "ipv6autoconf": false, "nic": "onb0", "bridged": true, "ipaddr": "192.168.2.117", "defaultRoute": true, "dhcpv6": false, "gateway": "192.168.2.100", "mtu": 1500, "switch": "legacy", "stp": false, "bootproto": "none", "nameservers": [ "192.168.2.100", "8.8.8.8" ] } }, "bondings": {}, "options": { "connectivityCheck": false } } EOF
vdsm-client -f ovirtmgmt.json Host setupNetworks
If that works, you have to persist the configuration before reboot, so either from the engine UI or from the host itself:
vdsm-client Host setSafeNetworkConfig
Worked like a charm! Thanks!
Have a good weekend, Gilboa
Glad to hear that. Thanks. Regards, Ales -- Ales Musil Software Engineer - RHV Network Red Hat EMEA <https://www.redhat.com> amusil@redhat.com IM: amusil <https://red.ht/sig>

On Mon, Oct 26, 2020 at 8:29 AM Ales Musil <amusil@redhat.com> wrote:
On Sat, Oct 24, 2020 at 2:01 PM Gilboa Davara <gilboad@gmail.com> wrote:
Hello,
Many thanks for the prompt reply. Answers in-line
On Fri, Oct 23, 2020 at 9:16 AM Ales Musil <amusil@redhat.com> wrote:
It would be nice to know when this happens as it might be a serious problem.
I can't offer much beyond the logs I uploaded. I had a similar event in one of my production GlusterFS / oVirt clusters, but restoring the missing ifcfg- files from backup and restarting NetworkManager solved the problem.
Alright, please don't hesitate to share it if it ever happens again.
Will do, thanks! - Gilboa
participants (2)
-
Ales Musil
-
Gilboa Davara