[ovirt-users] HE in bad stauts, will not start following storage issue - HELP

Fri Mar 10 10:39:39 UTC 2017

Hi Ian,

it is normal that VDSMs are competing for the lock, one should win
though. If that is not the case then the lockspace might be corrupted
or the sanlock daemons can't reach it.

I would recommend putting the cluster to global maintenance and
attempting a manual start using:

# hosted-engine --set-maintenance --mode=global
# hosted-engine --vm-start

You will need to check your storage connectivity and sanlock status on
all hosts if that does not work.

# sanlock client status

There are couple of locks I would expect to be there (ha_agent, spm),
but no lock for hosted engine disk should be visible.

Next steps depend on whether you have important VMs running on the
cluster and on the Gluster status (I can't help you there
unfortunately).

Best regards

--
Martin Sivak
SLA / oVirt

On Fri, Mar 10, 2017 at 7:37 AM, Ian Neilsen <ian.neilsen at gmail.com> wrote:
> I just noticed this in the vdsm.logs.  The agent looks like it is trying to
> start hosted engine on both machines??
>
> <on_poweroff>destroy</on_poweroff><on_reboot>destroy</on_reboot><on_crash>destroy</on_crash></domain>
> Thread-7517::ERROR::2017-03-10
> 01:26:13,053::vm::773::virt.vm::(_startUnderlyingVm)
> vmId=`2419f9fe-4998-4b7a-9fe9-151571d20379`::The vm start process failed
> Traceback (most recent call last):
>   File "/usr/share/vdsm/virt/vm.py", line 714, in _startUnderlyingVm
> self._run()
>   File "/usr/share/vdsm/virt/vm.py", line 2026, in _run
> self._connection.createXML(domxml, flags),
>   File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line
> 123, in wrapper ret = f(*args, **kwargs)
>   File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 917, in
> wrapper return func(inst, *args, **kwargs)
>   File "/usr/lib64/python2.7/site-packages/libvirt.py", line 3782, in
> createXML if ret is None:raise libvirtError('virDomainCreateXML() failed',
> conn=self)
>
> libvirtError: Failed to acquire lock: Permission denied
>
> INFO::2017-03-10 01:26:13,054::vm::1330::virt.vm::(setDownStatus)
> vmId=`2419f9fe-4998-4b7a-9fe9-151571d20379`::Changed state to Down: Failed
> to acquire lock: Permission denied (code=1)
> INFO::2017-03-10 01:26:13,054::guestagent::430::virt.vm::(stop)
> vmId=`2419f9fe-4998-4b7a-9fe9-151571d20379`::Stopping connection
>
> DEBUG::2017-03-10 01:26:13,054::vmchannels::238::vds::(unregister) Delete
> fileno 56 from listener.
> DEBUG::2017-03-10 01:26:13,055::vmchannels::66::vds::(_unregister_fd) Failed
> to unregister FD from epoll (ENOENT): 56
> DEBUG::2017-03-10 01:26:13,055::__init__::209::jsonrpc.Notification::(emit)
> Sending event {"params": {"2419f9fe-4998-4b7a-9fe9-151571d20379": {"status":
> "Down", "exitReason": 1, "exitMessage": "Failed to acquire lock: Permission
> denied", "exitCode": 1}, "notify_time": 4308740560}, "jsonrpc": "2.0",
> "method": "|virt|VM_status|2419f9fe-4998-4b7a-9fe9-151571d20379"}
> VM Channels Listener::DEBUG::2017-03-10
> 01:26:13,475::vmchannels::142::vds::(_do_del_channels) fileno 56 was removed
> from listener.
> DEBUG::2017-03-10 01:26:14,430::check::296::storage.check::(_start_process)
> START check
> u'/rhev/data-center/mnt/glusterSD/192.168.3.10:_data/a08822ec-3f5b-4dba-ac2d-5510f0b4b6a2/dom_md/metadata'
> cmd=['/usr/bin/taskset', '--cpu-list', '0-39', '/usr/bin/dd',
> u'if=/rhev/data-center/mnt/glusterSD/192.168.3.10:_data/a08822ec-3f5b-4dba-ac2d-5510f0b4b6a2/dom_md/metadata',
> 'of=/dev/null', 'bs=4096', 'count=1', 'iflag=direct'] delay=0.00
> DEBUG::2017-03-10 01:26:14,481::asyncevent::564::storage.asyncevent::(reap)
> Process <cpopen.CPopen object at 0x3ba6550> terminated (count=1)
> DEBUG::2017-03-10
> 01:26:14,481::check::327::storage.check::(_check_completed) FINISH check
> u'/rhev/data-center/mnt/glusterSD/192.168.3.10:_data/a08822ec-3f5b-4dba-ac2d-5510f0b4b6a2/dom_md/metadata'
> rc=0 err=bytearray(b'0+1 records in\n0+1 records out\n300 bytes (300 B)
> copied, 8.7603e-05 s, 3.4 MB/s\n') elapsed=0.06
>
>
> On 10 March 2017 at 10:40, Ian Neilsen <ian.neilsen at gmail.com> wrote:
>>
>> Hi All
>>
>> I had a storage issue with my gluster volumes running under ovirt hosted.
>> I now cannot start the hosted engine manager vm from "hosted-engine
>> --vm-start".
>> I've scoured the net to find a way, but can't seem to find anything
>> concrete.
>>
>> Running Centos7, ovirt 4.0 and gluster 3.8.9
>>
>> How do I recover the engine manager. Im at a loss!
>>
>> Engine Status = score between nodes was 0 for all, now node 1 is reading
>> 3400, but all others are 0
>>
>> {"reason": "bad vm status", "health": "bad", "vm": "down", "detail":
>> "down"}
>>
>>
>> Logs from agent.log
>> ==================
>>
>> INFO::2017-03-09
>> 19:32:52,600::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
>> Global maintenance detected
>> INFO::2017-03-09
>> 19:32:52,603::hosted_engine::612::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_vdsm)
>> Initializing VDSM
>> INFO::2017-03-09
>> 19:32:54,820::hosted_engine::639::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images)
>> Connecting the storage
>> INFO::2017-03-09
>> 19:32:54,821::storage_server::219::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
>> Connecting storage server
>> INFO::2017-03-09
>> 19:32:59,194::storage_server::226::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
>> Connecting storage server
>> INFO::2017-03-09
>> 19:32:59,211::storage_server::233::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
>> Refreshing the storage domain
>> INFO::2017-03-09
>> 19:32:59,328::hosted_engine::666::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images)
>> Preparing images
>> INFO::2017-03-09
>> 19:32:59,328::image::126::ovirt_hosted_engine_ha.lib.image.Image::(prepare_images)
>> Preparing images
>> INFO::2017-03-09
>> 19:33:01,748::hosted_engine::669::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images)
>> Reloading vm.conf from the shared storage domain
>> INFO::2017-03-09
>> 19:33:01,748::config::206::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(refresh_local_conf_file)
>> Trying to get a fresher copy of vm configuration from the OVF_STORE
>> WARNING::2017-03-09
>> 19:33:04,056::ovf_store::107::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(scan)
>> Unable to find OVF_STORE
>> ERROR::2017-03-09
>> 19:33:04,058::config::235::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(refresh_local_conf_file)
>> Unable to get vm.conf from OVF_STORE, falling back to initial vm.conf
>>
>> ovirt-ha-agent logs
>> ================
>>
>> ovirt-ha-agent
>> ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config ERROR Unable
>> to get vm.conf from OVF_STORE, falling back to initial vm.conf
>>
>> vdsm
>> ======
>>
>> vdsm vds.dispatcher ERROR SSL error during reading data: unexpected eof
>>
>> ovirt-ha-broker
>> ============
>>
>> ovirt-ha-broker cpu_load_no_engine.EngineHealth ERROR Failed to
>> getVmStats: 'pid'
>>
>> --
>> Ian Neilsen
>>
>> Mobile: 0424 379 762
>> Linkedin: http://au.linkedin.com/in/ianneilsen
>> Twitter : ineilsen
>
>
>
>
> --
> Ian Neilsen
>
> Mobile: 0424 379 762
> Linkedin: http://au.linkedin.com/in/ianneilsen
> Twitter : ineilsen
>
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
>