Host non-responsive after yum update CentOS7/Ovirt3.6

Hi, I working on a CentOS7 based Ovirt 3.6 system (ovirt-engine/db on one machine, two separate ovirt vm hosts) which has been running fine but mostly ignored for 2-3 of years. Recently it was decided to update the OS as it was far behind on security updates, so one host was put into maintenance mode, yum update'd, rebooted, and then it was attempted to take out of maintenance mode but it's "non-responsive" now. If I look in /var/log/ovirt-engine/engine.log on the engine machine I see for this host (vmserver2): "ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-36) [2bc0978d] Command 'GetCapabilitiesVDSCommand(HostName = vmserver2, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='6725086f-42c0-40eb-91f1-0f2411ea9432', vds='Host[vmserver2,6725086f-42c0-40eb-91f1-0f2411ea9432]'})' execution failed: org.ovirt.vdsm.jsonrpc.client.ClientConnectionException: Connection failed" and thereafter more errors. This keep repeating in the log. In the Ovirt GUI I see multiple occurrences of log entries for the problem host: "vmserver2...command failed: Vds timeout occurred" "vmserver2...command failed: Heartbeat exceeded" "vmserver2...command failed: internal error: Unknown CPU model Broadwell-noTSX-IBRS" Firewall rules look identical to the host which is working normally but has not been updated. Any thoughts about how to fix or further troubleshoot this?

On Tue, Jan 1, 2019 at 4:26 AM <jaherring@usa.net> wrote:
Hi, I working on a CentOS7 based Ovirt 3.6 system (ovirt-engine/db on one machine, two separate ovirt vm hosts) which has been running fine but mostly ignored for 2-3 of years. Recently it was decided to update the OS as it was far behind on security updates, so one host was put into maintenance mode, yum update'd, rebooted, and then it was attempted to take out of maintenance mode but it's "non-responsive" now.
If I look in /var/log/ovirt-engine/engine.log on the engine machine I see for this host (vmserver2):
"ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-36) [2bc0978d] Command 'GetCapabilitiesVDSCommand(HostName = vmserver2, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='6725086f-42c0-40eb-91f1-0f2411ea9432', vds='Host[vmserver2,6725086f-42c0-40eb-91f1-0f2411ea9432]'})' execution failed: org.ovirt.vdsm.jsonrpc.client.ClientConnectionException: Connection failed" and thereafter more errors. This keep repeating in the log.
In the Ovirt GUI I see multiple occurrences of log entries for the problem host:
"vmserver2...command failed: Vds timeout occurred" "vmserver2...command failed: Heartbeat exceeded" "vmserver2...command failed: internal error: Unknown CPU model Broadwell-noTSX-IBRS"
Firewall rules look identical to the host which is working normally but has not been updated.
Any thoughts about how to fix or further troubleshoot this?
Please note that 3.6 is old and was not tested with recent CentOS (at least AFAIK). Does vdsm manage to start? Please check/share its logs. Thanks. Best regards, -- Didi

I assumed that if ovirt3.6 required older versions of CentOS7, etc, it would not allow the yum update due to dependency. I guess that's a bad assumption. vdsm does not start in fact. Here's the journal when attempting to start it. # journalctl -xe Jan 19 10:13:36 vmserver2 vdsm-tool[6794]: libvirt.libvirtError: authentication failed: authentication failed Jan 19 10:13:36 vmserver2 vdsm-tool[6794]: Traceback (most recent call last): Jan 19 10:13:36 vmserver2 vdsm-tool[6794]: File "/usr/bin/vdsm-tool", line 219, in main Jan 19 10:13:36 vmserver2 vdsm-tool[6794]: return tool_command[cmd]["command"](*args) Jan 19 10:13:36 vmserver2 vdsm-tool[6794]: File "/usr/lib/python2.7/site-packages/vdsm/tool/restore_nets.py", line 41, in restore_command Jan 19 10:13:36 vmserver2 vdsm-tool[6794]: exec_restore(cmd) Jan 19 10:13:36 vmserver2 vdsm-tool[6794]: File "/usr/lib/python2.7/site-packages/vdsm/tool/restore_nets.py", line 54, in exec_restore Jan 19 10:13:36 vmserver2 vdsm-tool[6794]: raise EnvironmentError('Failed to restore the persisted networks') Jan 19 10:13:36 vmserver2 vdsm-tool[6794]: EnvironmentError: Failed to restore the persisted networks Jan 19 10:13:36 vmserver2 systemd[1]: vdsm-network.service: main process exited, code=exited, status=1/FAILURE Jan 19 10:13:36 vmserver2 systemd[1]: Failed to start Virtual Desktop Server Manager network restoration. -- Subject: Unit vdsm-network.service has failed -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit vdsm-network.service has failed. -- -- The result is failed. Jan 19 10:13:36 vmserver2 systemd[1]: Dependency failed for Virtual Desktop Server Manager. -- Subject: Unit vdsmd.service has failed -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit vdsmd.service has failed. -- -- The result is dependency. Jan 19 10:13:36 vmserver2 systemd[1]: Dependency failed for MOM instance configured for VDSM purposes. -- Subject: Unit mom-vdsm.service has failed -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit mom-vdsm.service has failed. -- -- The result is dependency. Jan 19 10:13:36 vmserver2 systemd[1]: Job mom-vdsm.service/start failed with result 'dependency'. Jan 19 10:13:36 vmserver2 systemd[1]: Job vdsmd.service/start failed with result 'dependency'. Jan 19 10:13:36 vmserver2 systemd[1]: Unit vdsm-network.service entered failed state. Jan 19 10:13:36 vmserver2 systemd[1]: vdsm-network.service failed. Jan 19 10:13:36 vmserver2 polkitd[4789]: Unregistered Authentication Agent for unix-process:6781:161167654 (system bus name :1.953, object path /org/freedesktop/Policy lines 2643-2681/2681 (END)

So, I try to manually start vdsm-network.service and see this, suggesting I look in upgrade.log: # systemctl status vdsm-network.service ● vdsm-network.service - Virtual Desktop Server Manager network restoration Loaded: loaded (/usr/lib/systemd/system/vdsm-network.service; enabled; vendor preset: enabled) Active: failed (Result: exit-code) since Sat 2019-01-19 10:25:02 PST; 8s ago Process: 6845 ExecStart=/usr/bin/vdsm-tool restore-nets (code=exited, status=1/FAILURE) Process: 6837 ExecStartPre=/usr/bin/vdsm-tool --vvverbose --append --logfile=/var/log/vdsm/upgrade.log upgrade-unified-persistence (code=exited, status=0/SUCCESS) Main PID: 6845 (code=exited, status=1/FAILURE) Jan 19 10:25:02 vmserver2 vdsm-tool[6845]: return tool_command[cmd]["command"](*args) Jan 19 10:25:02 vmserver2 vdsm-tool[6845]: File "/usr/lib/python2.7/site-packages/vdsm/tool/restore_nets.py", line 41, in restore_command Jan 19 10:25:02 vmserver2 vdsm-tool[6845]: exec_restore(cmd) Jan 19 10:25:02 vmserver2 vdsm-tool[6845]: File "/usr/lib/python2.7/site-packages/vdsm/tool/restore_nets.py", line 54, in exec_restore Jan 19 10:25:02 vmserver2 vdsm-tool[6845]: raise EnvironmentError('Failed to restore the persisted networks') Jan 19 10:25:02 vmserver2 vdsm-tool[6845]: EnvironmentError: Failed to restore the persisted networks Jan 19 10:25:02 vmserver2 systemd[1]: vdsm-network.service: main process exited, code=exited, status=1/FAILURE Jan 19 10:25:02 vmserver2 systemd[1]: Failed to start Virtual Desktop Server Manager network restoration. Jan 19 10:25:02 vmserver2 systemd[1]: Unit vdsm-network.service entered failed state. Jan 19 10:25:02 vmserver2 systemd[1]: vdsm-network.service failed. In upgrade.log I see this: # cat upgrade.log MainThread::DEBUG::2016-01-10 10:54:47,615::upgrade::90::upgrade::(apply_upgrade) Running upgrade upgrade-unified-persistence MainThread::DEBUG::2016-01-10 10:54:47,623::libvirtconnection::160::root::(get) trying to connect libvirt MainThread::DEBUG::2016-01-10 10:54:47,639::netinfo::714::root::(_get_gateway) The gateway 192.168.1.1 is duplicated for the device em1 MainThread::DEBUG::2016-01-10 10:54:47,647::utils::669::root::(execCmd) /sbin/ip route show to 0.0.0.0/0 table main (cwd None) MainThread::DEBUG::2016-01-10 10:54:47,650::utils::687::root::(execCmd) SUCCESS: <err> = ''; <rc> = 0 MainThread::DEBUG::2016-01-10 10:54:47,651::unified_persistence::46::root::(run) upgrade-unified-persistence upgrade persisting networks {} and bondings {} MainThread::INFO::2016-01-10 10:54:47,651::netconfpersistence::179::root::(_clearDisk) Clearing /var/run/vdsm/netconf/nets/ and /var/run/vdsm/netconf/bonds/ MainThread::DEBUG::2016-01-10 10:54:47,651::netconfpersistence::187::root::(_clearDisk) No existent config to clear. MainThread::INFO::2016-01-10 10:54:47,652::netconfpersistence::179::root::(_clearDisk) Clearing /var/run/vdsm/netconf/nets/ and /var/run/vdsm/netconf/bonds/ MainThread::DEBUG::2016-01-10 10:54:47,652::netconfpersistence::187::root::(_clearDisk) No existent config to clear. MainThread::INFO::2016-01-10 10:54:47,652::netconfpersistence::129::root::(save) Saved new config RunningConfig({}, {}) to /var/run/vdsm/netconf/nets/ and /var/run/vdsm/netconf/bonds/ MainThread::DEBUG::2016-01-10 10:54:47,652::utils::669::root::(execCmd) /usr/share/vdsm/vdsm-store-net-config unified (cwd None) MainThread::DEBUG::2016-01-10 10:54:47,672::utils::687::root::(execCmd) SUCCESS: <err> = 'cp: cannot stat \xe2\x80\x98/var/run/vdsm/netconf\xe2\x80\x99: No such file or directory\n'; <rc> = 0 MainThread::DEBUG::2016-01-10 10:54:47,672::upgrade::51::upgrade::(_upgrade_seal) Upgrade upgrade-unified-persistence successfully performed I see some references to allowing duplicate gateways if identical on the mailing list...

If I try to run "vdsm-tool restore-nets", which is what starting the vdsm-network.service seems to do first, I get the following: ( a large number of lines of the first error) ...... libvirt: XML-RPC error : authentication failed: authentication failed libvirt: XML-RPC error : authentication failed: authentication failed libvirt: XML-RPC error : authentication failed: authentication failed libvirt: XML-RPC error : authentication failed: authentication failed Traceback (most recent call last): File "/usr/share/vdsm/vdsm-restore-net-config", line 476, in <module> restore(args) File "/usr/share/vdsm/vdsm-restore-net-config", line 434, in restore _restore_sriov_numvfs() File "/usr/share/vdsm/vdsm-restore-net-config", line 84, in _restore_sriov_numvfs sriov_devices = _get_sriov_devices() File "/usr/share/vdsm/vdsm-restore-net-config", line 56, in _get_sriov_devices devices = hostdev.list_by_caps() File "/usr/share/vdsm/hostdev.py", line 219, in list_by_caps libvirt_devices = _get_devices_from_libvirt() File "/usr/share/vdsm/hostdev.py", line 204, in _get_devices_from_libvirt for device in libvirtconnection.get().listAllDevices(0)) File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 164, in get password) File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 100, in open_connection return utils.retry(libvirtOpen, timeout=10, sleep=0.2) File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 959, in retry return func() File "/usr/lib64/python2.7/site-packages/libvirt.py", line 104, in openAuth if ret is None:raise libvirtError('virConnectOpenAuth() failed') libvirt.libvirtError: authentication failed: authentication failed Traceback (most recent call last): File "/usr/bin/vdsm-tool", line 219, in main return tool_command[cmd]["command"](*args) File "/usr/lib/python2.7/site-packages/vdsm/tool/restore_nets.py", line 41, in restore_command exec_restore(cmd) File "/usr/lib/python2.7/site-packages/vdsm/tool/restore_nets.py", line 54, in exec_restore raise EnvironmentError('Failed to restore the persisted networks') EnvironmentError: Failed to restore the persisted networks

On Sat, Jan 19, 2019 at 9:14 PM Jason Herring <jaherring@usa.net> wrote:
If I try to run "vdsm-tool restore-nets", which is what starting the vdsm-network.service seems to do first, I get the following:
( a large number of lines of the first error)
......
libvirt: XML-RPC error : authentication failed: authentication failed libvirt: XML-RPC error : authentication failed: authentication failed libvirt: XML-RPC error : authentication failed: authentication failed libvirt: XML-RPC error : authentication failed: authentication failed Traceback (most recent call last): File "/usr/share/vdsm/vdsm-restore-net-config", line 476, in <module> restore(args) File "/usr/share/vdsm/vdsm-restore-net-config", line 434, in restore _restore_sriov_numvfs() File "/usr/share/vdsm/vdsm-restore-net-config", line 84, in _restore_sriov_numvfs sriov_devices = _get_sriov_devices() File "/usr/share/vdsm/vdsm-restore-net-config", line 56, in _get_sriov_devices devices = hostdev.list_by_caps() File "/usr/share/vdsm/hostdev.py", line 219, in list_by_caps libvirt_devices = _get_devices_from_libvirt() File "/usr/share/vdsm/hostdev.py", line 204, in _get_devices_from_libvirt for device in libvirtconnection.get().listAllDevices(0)) File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 164, in get password) File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 100, in open_connection return utils.retry(libvirtOpen, timeout=10, sleep=0.2) File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 959, in retry return func() File "/usr/lib64/python2.7/site-packages/libvirt.py", line 104, in openAuth if ret is None:raise libvirtError('virConnectOpenAuth() failed') libvirt.libvirtError: authentication failed: authentication failed
This is what happens when libvirt is not configured to work with vdsm. Can you share your /etc/libvirt/libvirtd.conf? Would you try `vdsm-tool configure --force` ? (bad for debugging root cause, but can just fix your issue)
Traceback (most recent call last): File "/usr/bin/vdsm-tool", line 219, in main return tool_command[cmd]["command"](*args) File "/usr/lib/python2.7/site-packages/vdsm/tool/restore_nets.py", line 41, in restore_command exec_restore(cmd) File "/usr/lib/python2.7/site-packages/vdsm/tool/restore_nets.py", line 54, in exec_restore raise EnvironmentError('Failed to restore the persisted networks') EnvironmentError: Failed to restore the persisted networks _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/N5JHE4RECU56UJ...

Il giorno mar 1 gen 2019 alle ore 03:26 <jaherring@usa.net> ha scritto:
Hi, I working on a CentOS7 based Ovirt 3.6 system (ovirt-engine/db on one machine, two separate ovirt vm hosts) which has been running fine but mostly ignored for 2-3 of years. Recently it was decided to update the OS as it was far behind on security updates, so one host was put into maintenance mode, yum update'd, rebooted, and then it was attempted to take out of maintenance mode but it's "non-responsive" now.
Hi, please note oVirt 3.6 is no longer supported. While waiting for 4.3 I would recommend to replace the non-responsive node with an oVirt 4.2.8 + CentOS 7.6 or oVirt Node 4.2.8 and plan to upgrade the whole data center.
If I look in /var/log/ovirt-engine/engine.log on the engine machine I see for this host (vmserver2):
"ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-36) [2bc0978d] Command 'GetCapabilitiesVDSCommand(HostName = vmserver2, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='6725086f-42c0-40eb-91f1-0f2411ea9432', vds='Host[vmserver2,6725086f-42c0-40eb-91f1-0f2411ea9432]'})' execution failed: org.ovirt.vdsm.jsonrpc.client.ClientConnectionException: Connection failed" and thereafter more errors. This keep repeating in the log.
In the Ovirt GUI I see multiple occurrences of log entries for the problem host:
"vmserver2...command failed: Vds timeout occurred" "vmserver2...command failed: Heartbeat exceeded" "vmserver2...command failed: internal error: Unknown CPU model Broadwell-noTSX-IBRS"
Firewall rules look identical to the host which is working normally but has not been updated.
Any thoughts about how to fix or further troubleshoot this? _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/HVP2YCIYGSFN7O...
-- SANDRO BONAZZOLA MANAGER, SOFTWARE ENGINEERING, EMEA R&D RHV Red Hat EMEA <https://www.redhat.com/> sbonazzo@redhat.com <https://red.ht/sig>
participants (6)
-
Dan Kenigsberg
-
jaherring@usa.net
-
Jason Herring
-
Sandro Bonazzola
-
Yedidyah Bar David
-
Николаев Алексей