HE in bad stauts, will not start following storage issue - HELP

Hi All I had a storage issue with my gluster volumes running under ovirt hosted. I now cannot start the hosted engine manager vm from "hosted-engine --vm-start". I've scoured the net to find a way, but can't seem to find anything concrete. Running Centos7, ovirt 4.0 and gluster 3.8.9 How do I recover the engine manager. Im at a loss! Engine Status = score between nodes was 0 for all, now node 1 is reading 3400, but all others are 0 {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"} Logs from agent.log ================== INFO::2017-03-09 19:32:52,600::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected INFO::2017-03-09 19:32:52,603::hosted_engine::612::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_vdsm) Initializing VDSM INFO::2017-03-09 19:32:54,820::hosted_engine::639::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Connecting the storage INFO::2017-03-09 19:32:54,821::storage_server::219::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server INFO::2017-03-09 19:32:59,194::storage_server::226::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server INFO::2017-03-09 19:32:59,211::storage_server::233::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Refreshing the storage domain INFO::2017-03-09 19:32:59,328::hosted_engine::666::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Preparing images INFO::2017-03-09 19:32:59,328::image::126::ovirt_hosted_engine_ha.lib.image.Image::(prepare_images) Preparing images INFO::2017-03-09 19:33:01,748::hosted_engine::669::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Reloading vm.conf from the shared storage domain INFO::2017-03-09 19:33:01,748::config::206::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(refresh_local_conf_file) Trying to get a fresher copy of vm configuration from the OVF_STORE WARNING::2017-03-09 19:33:04,056::ovf_store::107::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(scan) Unable to find OVF_STORE ERROR::2017-03-09 19:33:04,058::config::235::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(refresh_local_conf_file) Unable to get vm.conf from OVF_STORE, falling back to initial vm.conf ovirt-ha-agent logs ================ ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config ERROR Unable to get vm.conf from OVF_STORE, falling back to initial vm.conf vdsm ====== vdsm vds.dispatcher ERROR SSL error during reading data: unexpected eof ovirt-ha-broker ============ ovirt-ha-broker cpu_load_no_engine.EngineHealth ERROR Failed to getVmStats: 'pid' -- Ian Neilsen Mobile: 0424 379 762 Linkedin: http://au.linkedin.com/in/ianneilsen Twitter : ineilsen

I just noticed this in the vdsm.logs. The agent looks like it is trying to start hosted engine on both machines?? <on_poweroff>destroy</on_poweroff><on_reboot>destroy</on_reboot><on_crash>destroy</on_crash></domain> Thread-7517::ERROR::2017-03-10 01:26:13,053::vm::773::virt.vm::(_startUnderlyingVm) vmId=`2419f9fe-4998-4b7a-9fe9-151571d20379`::The vm start process failed Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 714, in _startUnderlyingVm self._run() File "/usr/share/vdsm/virt/vm.py", line 2026, in _run self._connection.createXML(domxml, flags), File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 123, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 917, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 3782, in createXML if ret is None:raise libvirtError('virDomainCreateXML() failed', conn=self) libvirtError: Failed to acquire lock: Permission denied INFO::2017-03-10 01:26:13,054::vm::1330::virt.vm::(setDownStatus) vmId=`2419f9fe-4998-4b7a-9fe9-151571d20379`::Changed state to Down: Failed to acquire lock: Permission denied (code=1) INFO::2017-03-10 01:26:13,054::guestagent::430::virt.vm::(stop) vmId=`2419f9fe-4998-4b7a-9fe9-151571d20379`::Stopping connection DEBUG::2017-03-10 01:26:13,054::vmchannels::238::vds::(unregister) Delete fileno 56 from listener. DEBUG::2017-03-10 01:26:13,055::vmchannels::66::vds::(_unregister_fd) Failed to unregister FD from epoll (ENOENT): 56 DEBUG::2017-03-10 01:26:13,055::__init__::209::jsonrpc.Notification::(emit) Sending event {"params": {"2419f9fe-4998-4b7a-9fe9-151571d20379": {"status": "Down", "exitReason": 1, "exitMessage": "Failed to acquire lock: Permission denied", "exitCode": 1}, "notify_time": 4308740560}, "jsonrpc": "2.0", "method": "|virt|VM_status|2419f9fe-4998-4b7a-9fe9-151571d20379"} VM Channels Listener::DEBUG::2017-03-10 01:26:13,475::vmchannels::142::vds::(_do_del_channels) fileno 56 was removed from listener. DEBUG::2017-03-10 01:26:14,430::check::296::storage.check::(_start_process) START check u'/rhev/data-center/mnt/glusterSD/192.168.3.10:_data/a08822ec-3f5b-4dba-ac2d-5510f0b4b6a2/dom_md/metadata' cmd=['/usr/bin/taskset', '--cpu-list', '0-39', '/usr/bin/dd', u'if=/rhev/data-center/mnt/glusterSD/192.168.3.10:_data/a08822ec-3f5b-4dba-ac2d-5510f0b4b6a2/dom_md/metadata', 'of=/dev/null', 'bs=4096', 'count=1', 'iflag=direct'] delay=0.00 DEBUG::2017-03-10 01:26:14,481::asyncevent::564::storage.asyncevent::(reap) Process <cpopen.CPopen object at 0x3ba6550> terminated (count=1) DEBUG::2017-03-10 01:26:14,481::check::327::storage.check::(_check_completed) FINISH check u'/rhev/data-center/mnt/glusterSD/192.168.3.10:_data/a08822ec-3f5b-4dba-ac2d-5510f0b4b6a2/dom_md/metadata' rc=0 err=bytearray(b'0+1 records in\n0+1 records out\n300 bytes (300 B) copied, 8.7603e-05 s, 3.4 MB/s\n') elapsed=0.06 On 10 March 2017 at 10:40, Ian Neilsen <ian.neilsen@gmail.com> wrote:
Hi All
I had a storage issue with my gluster volumes running under ovirt hosted. I now cannot start the hosted engine manager vm from "hosted-engine --vm-start". I've scoured the net to find a way, but can't seem to find anything concrete.
Running Centos7, ovirt 4.0 and gluster 3.8.9
How do I recover the engine manager. Im at a loss!
Engine Status = score between nodes was 0 for all, now node 1 is reading 3400, but all others are 0
{"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"}
Logs from agent.log ==================
INFO::2017-03-09 19:32:52,600::state_decorators::51::ovirt_hosted_ engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected INFO::2017-03-09 19:32:52,603::hosted_engine::612::ovirt_hosted_engine_ha. agent.hosted_engine.HostedEngine::(_initialize_vdsm) Initializing VDSM INFO::2017-03-09 19:32:54,820::hosted_engine::639::ovirt_hosted_engine_ha. agent.hosted_engine.HostedEngine::(_initialize_storage_images) Connecting the storage INFO::2017-03-09 19:32:54,821::storage_server:: 219::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server INFO::2017-03-09 19:32:59,194::storage_server:: 226::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server INFO::2017-03-09 19:32:59,211::storage_server:: 233::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Refreshing the storage domain INFO::2017-03-09 19:32:59,328::hosted_engine::666::ovirt_hosted_engine_ha. agent.hosted_engine.HostedEngine::(_initialize_storage_images) Preparing images INFO::2017-03-09 19:32:59,328::image::126::ovirt_hosted_engine_ha.lib.image.Image::(prepare_images) Preparing images INFO::2017-03-09 19:33:01,748::hosted_engine::669::ovirt_hosted_engine_ha. agent.hosted_engine.HostedEngine::(_initialize_storage_images) Reloading vm.conf from the shared storage domain INFO::2017-03-09 19:33:01,748::config::206::ovirt_hosted_engine_ha.agent. hosted_engine.HostedEngine.config::(refresh_local_conf_file) Trying to get a fresher copy of vm configuration from the OVF_STORE WARNING::2017-03-09 19:33:04,056::ovf_store::107:: ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(scan) Unable to find OVF_STORE ERROR::2017-03-09 19:33:04,058::config::235::ovirt_hosted_engine_ha.agent. hosted_engine.HostedEngine.config::(refresh_local_conf_file) Unable to get vm.conf from OVF_STORE, falling back to initial vm.conf
ovirt-ha-agent logs ================
ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config ERROR Unable to get vm.conf from OVF_STORE, falling back to initial vm.conf
vdsm ======
vdsm vds.dispatcher ERROR SSL error during reading data: unexpected eof
ovirt-ha-broker ============
ovirt-ha-broker cpu_load_no_engine.EngineHealth ERROR Failed to getVmStats: 'pid'
-- Ian Neilsen
Mobile: 0424 379 762 Linkedin: http://au.linkedin.com/in/ianneilsen Twitter : ineilsen
-- Ian Neilsen Mobile: 0424 379 762 Linkedin: http://au.linkedin.com/in/ianneilsen Twitter : ineilsen

Hi Ian, it is normal that VDSMs are competing for the lock, one should win though. If that is not the case then the lockspace might be corrupted or the sanlock daemons can't reach it. I would recommend putting the cluster to global maintenance and attempting a manual start using: # hosted-engine --set-maintenance --mode=global # hosted-engine --vm-start You will need to check your storage connectivity and sanlock status on all hosts if that does not work. # sanlock client status There are couple of locks I would expect to be there (ha_agent, spm), but no lock for hosted engine disk should be visible. Next steps depend on whether you have important VMs running on the cluster and on the Gluster status (I can't help you there unfortunately). Best regards -- Martin Sivak SLA / oVirt On Fri, Mar 10, 2017 at 7:37 AM, Ian Neilsen <ian.neilsen@gmail.com> wrote:
I just noticed this in the vdsm.logs. The agent looks like it is trying to start hosted engine on both machines??
<on_poweroff>destroy</on_poweroff><on_reboot>destroy</on_reboot><on_crash>destroy</on_crash></domain> Thread-7517::ERROR::2017-03-10 01:26:13,053::vm::773::virt.vm::(_startUnderlyingVm) vmId=`2419f9fe-4998-4b7a-9fe9-151571d20379`::The vm start process failed Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 714, in _startUnderlyingVm self._run() File "/usr/share/vdsm/virt/vm.py", line 2026, in _run self._connection.createXML(domxml, flags), File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 123, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 917, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 3782, in createXML if ret is None:raise libvirtError('virDomainCreateXML() failed', conn=self)
libvirtError: Failed to acquire lock: Permission denied
INFO::2017-03-10 01:26:13,054::vm::1330::virt.vm::(setDownStatus) vmId=`2419f9fe-4998-4b7a-9fe9-151571d20379`::Changed state to Down: Failed to acquire lock: Permission denied (code=1) INFO::2017-03-10 01:26:13,054::guestagent::430::virt.vm::(stop) vmId=`2419f9fe-4998-4b7a-9fe9-151571d20379`::Stopping connection
DEBUG::2017-03-10 01:26:13,054::vmchannels::238::vds::(unregister) Delete fileno 56 from listener. DEBUG::2017-03-10 01:26:13,055::vmchannels::66::vds::(_unregister_fd) Failed to unregister FD from epoll (ENOENT): 56 DEBUG::2017-03-10 01:26:13,055::__init__::209::jsonrpc.Notification::(emit) Sending event {"params": {"2419f9fe-4998-4b7a-9fe9-151571d20379": {"status": "Down", "exitReason": 1, "exitMessage": "Failed to acquire lock: Permission denied", "exitCode": 1}, "notify_time": 4308740560}, "jsonrpc": "2.0", "method": "|virt|VM_status|2419f9fe-4998-4b7a-9fe9-151571d20379"} VM Channels Listener::DEBUG::2017-03-10 01:26:13,475::vmchannels::142::vds::(_do_del_channels) fileno 56 was removed from listener. DEBUG::2017-03-10 01:26:14,430::check::296::storage.check::(_start_process) START check u'/rhev/data-center/mnt/glusterSD/192.168.3.10:_data/a08822ec-3f5b-4dba-ac2d-5510f0b4b6a2/dom_md/metadata' cmd=['/usr/bin/taskset', '--cpu-list', '0-39', '/usr/bin/dd', u'if=/rhev/data-center/mnt/glusterSD/192.168.3.10:_data/a08822ec-3f5b-4dba-ac2d-5510f0b4b6a2/dom_md/metadata', 'of=/dev/null', 'bs=4096', 'count=1', 'iflag=direct'] delay=0.00 DEBUG::2017-03-10 01:26:14,481::asyncevent::564::storage.asyncevent::(reap) Process <cpopen.CPopen object at 0x3ba6550> terminated (count=1) DEBUG::2017-03-10 01:26:14,481::check::327::storage.check::(_check_completed) FINISH check u'/rhev/data-center/mnt/glusterSD/192.168.3.10:_data/a08822ec-3f5b-4dba-ac2d-5510f0b4b6a2/dom_md/metadata' rc=0 err=bytearray(b'0+1 records in\n0+1 records out\n300 bytes (300 B) copied, 8.7603e-05 s, 3.4 MB/s\n') elapsed=0.06
On 10 March 2017 at 10:40, Ian Neilsen <ian.neilsen@gmail.com> wrote:
Hi All
I had a storage issue with my gluster volumes running under ovirt hosted. I now cannot start the hosted engine manager vm from "hosted-engine --vm-start". I've scoured the net to find a way, but can't seem to find anything concrete.
Running Centos7, ovirt 4.0 and gluster 3.8.9
How do I recover the engine manager. Im at a loss!
Engine Status = score between nodes was 0 for all, now node 1 is reading 3400, but all others are 0
{"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"}
Logs from agent.log ==================
INFO::2017-03-09 19:32:52,600::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected INFO::2017-03-09 19:32:52,603::hosted_engine::612::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_vdsm) Initializing VDSM INFO::2017-03-09 19:32:54,820::hosted_engine::639::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Connecting the storage INFO::2017-03-09 19:32:54,821::storage_server::219::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server INFO::2017-03-09 19:32:59,194::storage_server::226::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server INFO::2017-03-09 19:32:59,211::storage_server::233::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Refreshing the storage domain INFO::2017-03-09 19:32:59,328::hosted_engine::666::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Preparing images INFO::2017-03-09 19:32:59,328::image::126::ovirt_hosted_engine_ha.lib.image.Image::(prepare_images) Preparing images INFO::2017-03-09 19:33:01,748::hosted_engine::669::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Reloading vm.conf from the shared storage domain INFO::2017-03-09 19:33:01,748::config::206::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(refresh_local_conf_file) Trying to get a fresher copy of vm configuration from the OVF_STORE WARNING::2017-03-09 19:33:04,056::ovf_store::107::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(scan) Unable to find OVF_STORE ERROR::2017-03-09 19:33:04,058::config::235::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(refresh_local_conf_file) Unable to get vm.conf from OVF_STORE, falling back to initial vm.conf
ovirt-ha-agent logs ================
ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config ERROR Unable to get vm.conf from OVF_STORE, falling back to initial vm.conf
vdsm ======
vdsm vds.dispatcher ERROR SSL error during reading data: unexpected eof
ovirt-ha-broker ============
ovirt-ha-broker cpu_load_no_engine.EngineHealth ERROR Failed to getVmStats: 'pid'
-- Ian Neilsen
Mobile: 0424 379 762 Linkedin: http://au.linkedin.com/in/ianneilsen Twitter : ineilsen
-- Ian Neilsen
Mobile: 0424 379 762 Linkedin: http://au.linkedin.com/in/ianneilsen Twitter : ineilsen
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

On Fri, Mar 10, 2017 at 12:39 PM, Martin Sivak <msivak@redhat.com> wrote:
Hi Ian,
it is normal that VDSMs are competing for the lock, one should win though. If that is not the case then the lockspace might be corrupted or the sanlock daemons can't reach it.
I would recommend putting the cluster to global maintenance and attempting a manual start using:
# hosted-engine --set-maintenance --mode=global # hosted-engine --vm-start
Is that possible? See also: http://lists.ovirt.org/pipermail/users/2016-January/036993.html
You will need to check your storage connectivity and sanlock status on all hosts if that does not work.
# sanlock client status
There are couple of locks I would expect to be there (ha_agent, spm), but no lock for hosted engine disk should be visible.
Next steps depend on whether you have important VMs running on the cluster and on the Gluster status (I can't help you there unfortunately).
Best regards
-- Martin Sivak SLA / oVirt
On Fri, Mar 10, 2017 at 7:37 AM, Ian Neilsen <ian.neilsen@gmail.com> wrote:
I just noticed this in the vdsm.logs. The agent looks like it is trying to start hosted engine on both machines??
<on_poweroff>destroy</on_poweroff><on_reboot>destroy</on_reboot><on_crash>destroy</on_crash></domain> Thread-7517::ERROR::2017-03-10 01:26:13,053::vm::773::virt.vm::(_startUnderlyingVm) vmId=`2419f9fe-4998-4b7a-9fe9-151571d20379`::The vm start process failed Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 714, in _startUnderlyingVm self._run() File "/usr/share/vdsm/virt/vm.py", line 2026, in _run self._connection.createXML(domxml, flags), File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 123, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 917, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 3782, in createXML if ret is None:raise libvirtError('virDomainCreateXML() failed', conn=self)
libvirtError: Failed to acquire lock: Permission denied
INFO::2017-03-10 01:26:13,054::vm::1330::virt.vm::(setDownStatus) vmId=`2419f9fe-4998-4b7a-9fe9-151571d20379`::Changed state to Down: Failed to acquire lock: Permission denied (code=1) INFO::2017-03-10 01:26:13,054::guestagent::430::virt.vm::(stop) vmId=`2419f9fe-4998-4b7a-9fe9-151571d20379`::Stopping connection
DEBUG::2017-03-10 01:26:13,054::vmchannels::238::vds::(unregister) Delete fileno 56 from listener. DEBUG::2017-03-10 01:26:13,055::vmchannels::66::vds::(_unregister_fd) Failed to unregister FD from epoll (ENOENT): 56 DEBUG::2017-03-10 01:26:13,055::__init__::209::jsonrpc.Notification::(emit) Sending event {"params": {"2419f9fe-4998-4b7a-9fe9-151571d20379": {"status": "Down", "exitReason": 1, "exitMessage": "Failed to acquire lock: Permission denied", "exitCode": 1}, "notify_time": 4308740560}, "jsonrpc": "2.0", "method": "|virt|VM_status|2419f9fe-4998-4b7a-9fe9-151571d20379"} VM Channels Listener::DEBUG::2017-03-10 01:26:13,475::vmchannels::142::vds::(_do_del_channels) fileno 56 was removed from listener. DEBUG::2017-03-10 01:26:14,430::check::296::storage.check::(_start_process) START check u'/rhev/data-center/mnt/glusterSD/192.168.3.10:_data/a08822ec-3f5b-4dba-ac2d-5510f0b4b6a2/dom_md/metadata' cmd=['/usr/bin/taskset', '--cpu-list', '0-39', '/usr/bin/dd', u'if=/rhev/data-center/mnt/glusterSD/192.168.3.10:_data/a08822ec-3f5b-4dba-ac2d-5510f0b4b6a2/dom_md/metadata', 'of=/dev/null', 'bs=4096', 'count=1', 'iflag=direct'] delay=0.00 DEBUG::2017-03-10 01:26:14,481::asyncevent::564::storage.asyncevent::(reap) Process <cpopen.CPopen object at 0x3ba6550> terminated (count=1) DEBUG::2017-03-10 01:26:14,481::check::327::storage.check::(_check_completed) FINISH check u'/rhev/data-center/mnt/glusterSD/192.168.3.10:_data/a08822ec-3f5b-4dba-ac2d-5510f0b4b6a2/dom_md/metadata' rc=0 err=bytearray(b'0+1 records in\n0+1 records out\n300 bytes (300 B) copied, 8.7603e-05 s, 3.4 MB/s\n') elapsed=0.06
On 10 March 2017 at 10:40, Ian Neilsen <ian.neilsen@gmail.com> wrote:
Hi All
I had a storage issue with my gluster volumes running under ovirt hosted. I now cannot start the hosted engine manager vm from "hosted-engine --vm-start". I've scoured the net to find a way, but can't seem to find anything concrete.
Running Centos7, ovirt 4.0 and gluster 3.8.9
How do I recover the engine manager. Im at a loss!
Engine Status = score between nodes was 0 for all, now node 1 is reading 3400, but all others are 0
{"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"}
Logs from agent.log ==================
INFO::2017-03-09 19:32:52,600::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected INFO::2017-03-09 19:32:52,603::hosted_engine::612::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_vdsm) Initializing VDSM INFO::2017-03-09 19:32:54,820::hosted_engine::639::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Connecting the storage INFO::2017-03-09 19:32:54,821::storage_server::219::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server INFO::2017-03-09 19:32:59,194::storage_server::226::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server INFO::2017-03-09 19:32:59,211::storage_server::233::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Refreshing the storage domain INFO::2017-03-09 19:32:59,328::hosted_engine::666::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Preparing images INFO::2017-03-09 19:32:59,328::image::126::ovirt_hosted_engine_ha.lib.image.Image::(prepare_images) Preparing images INFO::2017-03-09 19:33:01,748::hosted_engine::669::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Reloading vm.conf from the shared storage domain INFO::2017-03-09 19:33:01,748::config::206::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(refresh_local_conf_file) Trying to get a fresher copy of vm configuration from the OVF_STORE WARNING::2017-03-09 19:33:04,056::ovf_store::107::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(scan) Unable to find OVF_STORE ERROR::2017-03-09 19:33:04,058::config::235::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(refresh_local_conf_file) Unable to get vm.conf from OVF_STORE, falling back to initial vm.conf
ovirt-ha-agent logs ================
ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config ERROR Unable to get vm.conf from OVF_STORE, falling back to initial vm.conf
vdsm ======
vdsm vds.dispatcher ERROR SSL error during reading data: unexpected eof
ovirt-ha-broker ============
ovirt-ha-broker cpu_load_no_engine.EngineHealth ERROR Failed to getVmStats: 'pid'
-- Ian Neilsen
Mobile: 0424 379 762 Linkedin: http://au.linkedin.com/in/ianneilsen Twitter : ineilsen
-- Ian Neilsen
Mobile: 0424 379 762 Linkedin: http://au.linkedin.com/in/ianneilsen Twitter : ineilsen
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
-- Didi

I've checked id's in /rhev/data-center/mnt/glusterSD/*...../dom_md/ # -rw-rw----. 1 vdsm kvm 1048576 Mar 12 05:14 ids seems ok sanlock.log showing; --------------------------- r14 acquire_token open error -13 r14 cmd_acquire 2,11,89283 acquire_token -13 Now I'm not quiet sure on which direction to take. Lockspace --------------- "hosted-engine --reinitialize-lockspace" is throwing an exception; Exception("Lockfile reset cannot be performed with" Exception: Lockfile reset cannot be performed with an active agent. @didi - I am in "Global Maintenance". I just noticed that host 1 now shows. Engine status: unknown stale-data state= AgentStopped I'm pretty sure Ive been able to start the Engine VM while in Global Maintenance. But you raise a good question. I don't see why you would be restricted in running the engine while in Global or even starting the VM. If so this is a little bakwards. On 12 March 2017 at 16:28, Yedidyah Bar David <didi@redhat.com> wrote:
On Fri, Mar 10, 2017 at 12:39 PM, Martin Sivak <msivak@redhat.com> wrote:
Hi Ian,
it is normal that VDSMs are competing for the lock, one should win though. If that is not the case then the lockspace might be corrupted or the sanlock daemons can't reach it.
I would recommend putting the cluster to global maintenance and attempting a manual start using:
# hosted-engine --set-maintenance --mode=global # hosted-engine --vm-start
Is that possible? See also:
http://lists.ovirt.org/pipermail/users/2016-January/036993.html
You will need to check your storage connectivity and sanlock status on all hosts if that does not work.
# sanlock client status
There are couple of locks I would expect to be there (ha_agent, spm), but no lock for hosted engine disk should be visible.
Next steps depend on whether you have important VMs running on the cluster and on the Gluster status (I can't help you there unfortunately).
Best regards
-- Martin Sivak SLA / oVirt
On Fri, Mar 10, 2017 at 7:37 AM, Ian Neilsen <ian.neilsen@gmail.com>
I just noticed this in the vdsm.logs. The agent looks like it is
start hosted engine on both machines??
<on_poweroff>destroy</on_poweroff><on_reboot>destroy</ on_reboot><on_crash>destroy</on_crash></domain> Thread-7517::ERROR::2017-03-10 01:26:13,053::vm::773::virt.vm::(_startUnderlyingVm) vmId=`2419f9fe-4998-4b7a-9fe9-151571d20379`::The vm start process failed Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 714, in _startUnderlyingVm self._run() File "/usr/share/vdsm/virt/vm.py", line 2026, in _run self._connection.createXML(domxml, flags), File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py",
123, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 917, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 3782, in createXML if ret is None:raise libvirtError('virDomainCreateXML() failed', conn=self)
libvirtError: Failed to acquire lock: Permission denied
INFO::2017-03-10 01:26:13,054::vm::1330::virt.vm::(setDownStatus) vmId=`2419f9fe-4998-4b7a-9fe9-151571d20379`::Changed state to Down: Failed to acquire lock: Permission denied (code=1) INFO::2017-03-10 01:26:13,054::guestagent::430::virt.vm::(stop) vmId=`2419f9fe-4998-4b7a-9fe9-151571d20379`::Stopping connection
DEBUG::2017-03-10 01:26:13,054::vmchannels::238::vds::(unregister) Delete fileno 56 from listener. DEBUG::2017-03-10 01:26:13,055::vmchannels::66::vds::(_unregister_fd) Failed to unregister FD from epoll (ENOENT): 56 DEBUG::2017-03-10 01:26:13,055::__init__::209:: jsonrpc.Notification::(emit) Sending event {"params": {"2419f9fe-4998-4b7a-9fe9-151571d20379": {"status": "Down", "exitReason": 1, "exitMessage": "Failed to acquire lock: Permission denied", "exitCode": 1}, "notify_time": 4308740560}, "jsonrpc": "2.0", "method": "|virt|VM_status|2419f9fe-4998-4b7a-9fe9-151571d20379"} VM Channels Listener::DEBUG::2017-03-10 01:26:13,475::vmchannels::142::vds::(_do_del_channels) fileno 56 was removed from listener. DEBUG::2017-03-10 01:26:14,430::check::296::storage.check::(_start_
START check u'/rhev/data-center/mnt/glusterSD/192.168.3.10:_data/ a08822ec-3f5b-4dba-ac2d-5510f0b4b6a2/dom_md/metadata' cmd=['/usr/bin/taskset', '--cpu-list', '0-39', '/usr/bin/dd', u'if=/rhev/data-center/mnt/glusterSD/192.168.3.10:_data/ a08822ec-3f5b-4dba-ac2d-5510f0b4b6a2/dom_md/metadata', 'of=/dev/null', 'bs=4096', 'count=1', 'iflag=direct'] delay=0.00 DEBUG::2017-03-10 01:26:14,481::asyncevent::564: :storage.asyncevent::(reap) Process <cpopen.CPopen object at 0x3ba6550> terminated (count=1) DEBUG::2017-03-10 01:26:14,481::check::327::storage.check::(_check_completed) FINISH check u'/rhev/data-center/mnt/glusterSD/192.168.3.10:_data/ a08822ec-3f5b-4dba-ac2d-5510f0b4b6a2/dom_md/metadata' rc=0 err=bytearray(b'0+1 records in\n0+1 records out\n300 bytes (300 B) copied, 8.7603e-05 s, 3.4 MB/s\n') elapsed=0.06
On 10 March 2017 at 10:40, Ian Neilsen <ian.neilsen@gmail.com> wrote:
Hi All
I had a storage issue with my gluster volumes running under ovirt
hosted.
I now cannot start the hosted engine manager vm from "hosted-engine --vm-start". I've scoured the net to find a way, but can't seem to find anything concrete.
Running Centos7, ovirt 4.0 and gluster 3.8.9
How do I recover the engine manager. Im at a loss!
Engine Status = score between nodes was 0 for all, now node 1 is reading 3400, but all others are 0
{"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"}
Logs from agent.log ==================
INFO::2017-03-09 19:32:52,600::state_decorators::51::ovirt_hosted_ engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected INFO::2017-03-09 19:32:52,603::hosted_engine::612::ovirt_hosted_engine_ha. agent.hosted_engine.HostedEngine::(_initialize_vdsm) Initializing VDSM INFO::2017-03-09 19:32:54,820::hosted_engine::639::ovirt_hosted_engine_ha. agent.hosted_engine.HostedEngine::(_initialize_storage_images) Connecting the storage INFO::2017-03-09 19:32:54,821::storage_server::219::ovirt_hosted_engine_ha.
Connecting storage server INFO::2017-03-09 19:32:59,194::storage_server::226::ovirt_hosted_engine_ha.
Connecting storage server INFO::2017-03-09 19:32:59,211::storage_server::233::ovirt_hosted_engine_ha.
wrote: trying to line process) lib.storage_server.StorageServer::(connect_storage_server) lib.storage_server.StorageServer::(connect_storage_server) lib.storage_server.StorageServer::(connect_storage_server)
Refreshing the storage domain INFO::2017-03-09 19:32:59,328::hosted_engine::666::ovirt_hosted_engine_ha. agent.hosted_engine.HostedEngine::(_initialize_storage_images) Preparing images INFO::2017-03-09 19:32:59,328::image::126::ovirt_hosted_engine_ha.lib. image.Image::(prepare_images) Preparing images INFO::2017-03-09 19:33:01,748::hosted_engine::669::ovirt_hosted_engine_ha. agent.hosted_engine.HostedEngine::(_initialize_storage_images) Reloading vm.conf from the shared storage domain INFO::2017-03-09 19:33:01,748::config::206::ovirt_hosted_engine_ha.agent. hosted_engine.HostedEngine.config::(refresh_local_conf_file) Trying to get a fresher copy of vm configuration from the OVF_STORE WARNING::2017-03-09 19:33:04,056::ovf_store::107::ovirt_hosted_engine_ha.lib. ovf.ovf_store.OVFStore::(scan) Unable to find OVF_STORE ERROR::2017-03-09 19:33:04,058::config::235::ovirt_hosted_engine_ha.agent. hosted_engine.HostedEngine.config::(refresh_local_conf_file) Unable to get vm.conf from OVF_STORE, falling back to initial vm.conf
ovirt-ha-agent logs ================
ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config ERROR Unable to get vm.conf from OVF_STORE, falling back to initial vm.conf
vdsm ======
vdsm vds.dispatcher ERROR SSL error during reading data: unexpected eof
ovirt-ha-broker ============
ovirt-ha-broker cpu_load_no_engine.EngineHealth ERROR Failed to getVmStats: 'pid'
-- Ian Neilsen
Mobile: 0424 379 762 Linkedin: http://au.linkedin.com/in/ianneilsen Twitter : ineilsen
-- Ian Neilsen
Mobile: 0424 379 762 Linkedin: http://au.linkedin.com/in/ianneilsen Twitter : ineilsen
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
-- Didi
-- Ian Neilsen Mobile: 0424 379 762 Linkedin: http://au.linkedin.com/in/ianneilsen Twitter : ineilsen
participants (3)
-
Ian Neilsen
-
Martin Sivak
-
Yedidyah Bar David