ovirt-3.6 : Hosted-engine crashed and can't restart

After assigning an IP adress to a VLAN network (it was using DHCP by default) that was on the same NIC than ovirtmgmt, my hosted-engine crashed and can't start again...I have no idea how to fix this. I had a similar issue some months ago but with a different error. I tried to restart the ha agent that seems to be linked with this error, also restarted the host. I also tried to remove the _DIRECT_IO_ lockfile on the engine storage as it fixed my problem last time but it didn't help... Any ideas ? Do you think editing manually the logical network in the host and reverting them at it was before crash can help ? hosted-engine --vm-status Traceback (most recent call last): File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 117, in <module> if not status_checker.print_status(): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 60, in print_status all_host_stats = ha_cli.get_all_host_stats() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 160, in get_all_host_stats return self.get_all_stats(self.StatModes.HOST) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 103, in get_all_stats self._configure_broker_conn(broker) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 180, in _configure_broker_conn dom_type=dom_type) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 176, in set_storage_domain .format(sd_type, options, e)) ovirt_hosted_engine_ha.lib.exceptions.RequestError: Failed to set storage domain FilesystemBackend, options {'dom_type': 'nfs3', 'sd_uuid': 'e41807e5-ee68-40a2-a642-cc226ba0e82d'}: Request failed: <class 'ovirt_hosted_engine_ha.lib.storage_backends.BackendFailureException'> vdsClient -s 0 list 16450089-911e-4bad-a8b7-98e84a79ef3a Status = Down nicModel = rtl8139,pv statusTime = 4295559350 exitMessage = Unable to get volume size for domain e41807e5-ee68-40a2-a642-cc226ba0e82d volume 053df3a6-db18-445a-8f75-61c630ab0003 emulatedMachine = rhel6.5.0 pid = 0 vmName = HostedEngine devices = [{'index': '0', 'iface': 'virtio', 'format': 'raw', 'bootOrder': '1', 'address': {'slot': '0x06', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'volumeID': '053df3a6-db18-445a-8f75-61c630ab0003', 'imageID': 'b6daa50d-adad-46a5-8f5f-accfb155a1e1', 'readonly': 'false', 'domainID': 'e41807e5-ee68-40a2-a642-cc226ba0e82d', 'deviceId': 'b6daa50d-adad-46a5-8f5f-accfb155a1e1', 'poolID': '00000000-0000-0000-0000-000000000000', 'device': 'disk', 'shared': 'exclusive', 'propagateErrors': 'off', 'type': 'disk'}, {'nicModel': 'pv', 'macAddr': '00:16:3e:1c:4b:81', 'linkActive': 'true', 'network': 'ovirtmgmt', 'deviceId': '0aeaea2f-a419-43cc-92d7-8422f6aa9223', 'address': 'None', 'device': 'bridge', 'type': 'interface'}, {'index': '2', 'iface': 'ide', 'readonly': 'true', 'deviceId': '8c3179ac-b322-4f5c-9449-c52e3665e0ae', 'address': {'bus': '1', 'controller': '0', 'type': 'drive', 'target': '0', 'unit': '0'}, 'device': 'cdrom', 'shared': 'false', 'path': '', 'type': 'disk'}, {'device': 'scsi', 'model': 'virtio-scsi', 'type': 'controller', 'deviceId': '21db0c6e-071c-48ff-b905-95478b37c384', 'address': {'slot': '0x04', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}}, {'device': 'usb', 'type': 'controller', 'deviceId': 'c0384f68-d0c9-4ebb-a779-8dc9911ce2f8', 'address': {'slot': '0x01', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x2'}}, {'device': 'ide', 'type': 'controller', 'deviceId': 'd5a2dd13-138a-482b-9bc3-994b10ec4100', 'address': {'slot': '0x01', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x1'}}, {'device': 'virtio-serial', 'type': 'controller', 'deviceId': '9e695172-c9b0-47df-bc76-8170219dec28', 'address': {'slot': '0x05', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}}] guestDiskMapping = {} vmType = kvm displaySecurePort = -1 exitReason = 1 memSize = 6000 displayPort = -1 clientIp = spiceSecureChannels = smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir smp = 4 displayIp = 0 display = vnc exitCode = 1 systemctl status ovirt-ha-agent.service -l ● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2016-07-20 14:56:22 UTC; 2min 29s ago Main PID: 20236 (ovirt-ha-agent) CGroup: /system.slice/ovirt-ha-agent.service └─20236 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent --no-daemon Jul 20 14:57:56 rhevserv ovirt-ha-agent[20236]: INFO:ovirt_hosted_engine_ha.lib.storage_server.StorageServer:Connecting storage server Jul 20 14:57:57 rhevserv ovirt-ha-agent[20236]: INFO:ovirt_hosted_engine_ha.lib.storage_server.StorageServer:Connecting storage server Jul 20 14:58:37 rhevserv ovirt-ha-agent[20236]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'Connection to storage server failed' - trying to restart agent Jul 20 14:58:37 rhevserv ovirt-ha-agent[20236]: ERROR:ovirt_hosted_engine_ha.agent.agent.Agent:Error: 'Connection to storage server failed' - trying to restart agent Jul 20 14:58:42 rhevserv ovirt-ha-agent[20236]: WARNING:ovirt_hosted_engine_ha.agent.agent.Agent:Restarting agent, attempt '2' Jul 20 14:58:43 rhevserv ovirt-ha-agent[20236]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Found certificate common name: rhev.mydomain.com Jul 20 14:58:43 rhevserv ovirt-ha-agent[20236]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Initializing VDSM Jul 20 14:58:43 rhevserv ovirt-ha-agent[20236]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Connecting the storage Jul 20 14:58:43 rhevserv ovirt-ha-agent[20236]: INFO:ovirt_hosted_engine_ha.lib.storage_server.StorageServer:Connecting storage server Jul 20 14:58:44 rhevserv ovirt-ha-agent[20236]: INFO:ovirt_hosted_engine_ha.lib.storage_server.StorageServer:Connecting storage server

On Wed, Jul 20, 2016 at 5:01 PM, Alexis HAUSER <alexis.hauser@telecom-bretagne.eu> wrote:
After assigning an IP adress to a VLAN network (it was using DHCP by default) that was on the same NIC than ovirtmgmt, my hosted-engine crashed and can't start again...I have no idea how to fix this. I had a similar issue some months ago but with a different error. I tried to restart the ha agent that seems to be linked with this error, also restarted the host. I also tried to remove the _DIRECT_IO_ lockfile on the engine storage as it fixed my problem last time but it didn't help...
Any ideas ? Do you think editing manually the logical network in the host and reverting them at it was before crash can help ?
hosted-engine --vm-status Traceback (most recent call last): File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 117, in <module> if not status_checker.print_status(): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 60, in print_status all_host_stats = ha_cli.get_all_host_stats() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 160, in get_all_host_stats return self.get_all_stats(self.StatModes.HOST) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 103, in get_all_stats self._configure_broker_conn(broker) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 180, in _configure_broker_conn dom_type=dom_type) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 176, in set_storage_domain .format(sd_type, options, e)) ovirt_hosted_engine_ha.lib.exceptions.RequestError: Failed to set storage domain FilesystemBackend, options {'dom_type': 'nfs3', 'sd_uuid': 'e41807e5-ee68-40a2-a642-cc226ba0e82d'}: Request failed: <class 'ovirt_hosted_engine_ha.lib.storage_backends.BackendFailureException'>
vdsClient -s 0 list
16450089-911e-4bad-a8b7-98e84a79ef3a Status = Down nicModel = rtl8139,pv statusTime = 4295559350 exitMessage = Unable to get volume size for domain e41807e5-ee68-40a2-a642-cc226ba0e82d volume 053df3a6-db18-445a-8f75-61c630ab0003 emulatedMachine = rhel6.5.0 pid = 0 vmName = HostedEngine devices = [{'index': '0', 'iface': 'virtio', 'format': 'raw', 'bootOrder': '1', 'address': {'slot': '0x06', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'volumeID': '053df3a6-db18-445a-8f75-61c630ab0003', 'imageID': 'b6daa50d-adad-46a5-8f5f-accfb155a1e1', 'readonly': 'false', 'domainID': 'e41807e5-ee68-40a2-a642-cc226ba0e82d', 'deviceId': 'b6daa50d-adad-46a5-8f5f-accfb155a1e1', 'poolID': '00000000-0000-0000-0000-000000000000', 'device': 'disk', 'shared': 'exclusive', 'propagateErrors': 'off', 'type': 'disk'}, {'nicModel': 'pv', 'macAddr': '00:16:3e:1c:4b:81', 'linkActive': 'true', 'network': 'ovirtmgmt', 'deviceId': '0aeaea2f-a419-43cc-92d7-8422f6aa9223', 'address': 'None', 'device': 'bridge', 'type': 'interface'}, {'index': '2', 'iface': 'ide', 'readonly': 'true', 'deviceId': '8c3179ac-b322-4f5c-9449-c52e3665e0ae', 'address': {'bus': '1', 'controller': '0', 'type': 'drive', 'target': '0', 'unit': '0'}, 'device': 'cdrom', 'shared': 'false', 'path': '', 'type': 'disk'}, {'device': 'scsi', 'model': 'virtio-scsi', 'type': 'controller', 'deviceId': '21db0c6e-071c-48ff-b905-95478b37c384', 'address': {'slot': '0x04', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}}, {'device': 'usb', 'type': 'controller', 'deviceId': 'c0384f68-d0c9-4ebb-a779-8dc9911ce2f8', 'address': {'slot': '0x01', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x2'}}, {'device': 'ide', 'type': 'controller', 'deviceId': 'd5a2dd13-138a-482b-9bc3-994b10ec4100', 'address': {'slot': '0x01', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x1'}}, {'device': 'virtio-serial', 'type': 'controller', 'deviceId': '9e695172-c9b0-47df-bc76-8170219dec28', 'address': {'slot': '0x05', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}}] guestDiskMapping = {} vmType = kvm displaySecurePort = -1 exitReason = 1 memSize = 6000 displayPort = -1 clientIp = spiceSecureChannels = smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir smp = 4 displayIp = 0 display = vnc exitCode = 1
systemctl status ovirt-ha-agent.service -l ● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2016-07-20 14:56:22 UTC; 2min 29s ago Main PID: 20236 (ovirt-ha-agent) CGroup: /system.slice/ovirt-ha-agent.service └─20236 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent --no-daemon
Jul 20 14:57:56 rhevserv ovirt-ha-agent[20236]: INFO:ovirt_hosted_engine_ha.lib.storage_server.StorageServer:Connecting storage server Jul 20 14:57:57 rhevserv ovirt-ha-agent[20236]: INFO:ovirt_hosted_engine_ha.lib.storage_server.StorageServer:Connecting storage server Jul 20 14:58:37 rhevserv ovirt-ha-agent[20236]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'Connection to storage server failed' - trying to restart agent Jul 20 14:58:37 rhevserv ovirt-ha-agent[20236]: ERROR:ovirt_hosted_engine_ha.agent.agent.Agent:Error: 'Connection to storage server failed' - trying to restart agent
^^^ The issue seams here: please ensure that you can correctly connect your storage server. Can you please attach vdsm logs?
Jul 20 14:58:42 rhevserv ovirt-ha-agent[20236]: WARNING:ovirt_hosted_engine_ha.agent.agent.Agent:Restarting agent, attempt '2' Jul 20 14:58:43 rhevserv ovirt-ha-agent[20236]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Found certificate common name: rhev.mydomain.com Jul 20 14:58:43 rhevserv ovirt-ha-agent[20236]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Initializing VDSM Jul 20 14:58:43 rhevserv ovirt-ha-agent[20236]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Connecting the storage Jul 20 14:58:43 rhevserv ovirt-ha-agent[20236]: INFO:ovirt_hosted_engine_ha.lib.storage_server.StorageServer:Connecting storage server Jul 20 14:58:44 rhevserv ovirt-ha-agent[20236]: INFO:ovirt_hosted_engine_ha.lib.storage_server.StorageServer:Connecting storage server _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

The issue seams here: please ensure that you can correctly connect your storage server. Can you please attach vdsm logs?
Yes actually I figured out it was a DNS problem : as mentioned in the messages from the log I provided, it wasn't able to reach the NFS where the engine was (as it uses FQDN not IP with NFS it seems, I will fix that for not depending on DNS). This is actually my setup : only Em1 is plugged, it has ovirtmgmt + one other logical VLAN network. This VLAN network as in DHCP and never had an IP, everything was working fine. Since I added an IP address to that interface, the manager crashed. Actually it is trying to use that VLAN interface as the default route, I have no idea why, and cause DNS issue (one of the DNS was on another network, the the second was on the game network...it should actually have worked anyway...). The only way I found to resolve this was ifdown of that interface, and route add default gw <gateway-IP> ovirtmgmt After that, I had errors like "unknown stale data" and "failed to reinitilize lockspace" ; removing the lockfile with hosted-engine command, and removing manually __DIRECT_IO__ file on the engine storage didn't fix it. I actually found out what was happening : ovirt-ha-agent had errors in his status (with systemctl), ovirt-ha-broker had errors related to ha-agent and vdsdm had errors related to those 2 previous services. I resolved my issue by restarting the service in the good order : # systemctl restart ovirt-ha-agent.service # systemctl restart ovirt-ha-broker.service # systemctl restart vdsmd Anyway thanks for your answer, I hope this topic will help people with similar issues
participants (2)
-
Alexis HAUSER
-
Simone Tiraboschi