[ovirt-users] hosted-engine HA: Engine dying unexpectedly

Martin Sivak msivak at redhat.com
Wed Oct 22 08:17:20 UTC 2014


Hi,

I think there is something weird going on with your storage, this is the crash snippet from the host that had the engine at the beginning:

/var/log/vdsm/vdsm.log:Thread-162994::ERROR::2014-10-21 20:22:33,919::task::866::Storage.TaskManager.Task::(_setError) Task=`2ad31974-e1fc-4785-9423-ff3bd087a5aa`::Unexpected error
/var/log/vdsm/vdsm.log:Thread-162994::ERROR::2014-10-21 20:22:33,934::dispatcher::79::Storage.Dispatcher::(wrapper) Connection timed out
/var/log/vdsm/vdsm.log:Thread-62::ERROR::2014-10-21 20:23:00,733::sdc::137::Storage.StorageDomainCache::(_findDomain) looking for unfetched domain 68aad705-7c9b-427a-a84c-6f32f23675b3
/var/log/vdsm/vdsm.log:Thread-62::ERROR::2014-10-21 20:23:00,734::sdc::154::Storage.StorageDomainCache::(_findUnfetchedDomain) looking for domain 68aad705-7c9b-427a-a84c-6f32f23675b3
/var/log/vdsm/vdsm.log:VM Channels Listener::ERROR::2014-10-21 20:23:04,258::vmchannels::54::vds::(_handle_event) Received 00000011 on fileno 53

The second host's VDSM lost the connection to storage domain at the same time..

20:23:09,950::states::437::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine vm is running on host 192.168.50.201 (id 1)
20:23:12,365::hosted_engine::658::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_domain_monitor_status) VDSM domain monitor status: PENDING

The engine VM was restarted right after the connection was restored:

20:25:54,336::hosted_engine::658::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
20:26:20,572::hosted_engine::571::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_sanlock) Acquired lock on host id 2
20:26:20,572::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400)
20:26:20,572::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host 192.168.50.201 (id: 1, score: 2400)
20:26:30,606::states::459::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine down and local host has best score (2400), attempting to start engine VM

...

20:27:34,423::state_decorators::88::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineUp'>
20:27:34,430::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1413916054.43 type=state_transition detail=EngineStarting-EngineUp hostname='nodehv02.lab.mbox.loc'
20:27:34,498::brokerlink::120::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStarting-EngineUp) sent? sent
20:27:38,481::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineUp (score: 2400)

All was then well till the end of the log.

20:29:53,393::states::394::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine vm running on localhost
20:29:55,372::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineUp (score: 2400)
20:29:55,372::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host 192.168.50.201 (id: 1, score: 0)


Hosted engine had nothing to do with the engine crash according to the log. On the contrary, it properly re-started the VM once the cluster recovered from the storage issue.

Can you give us more information about the setup? Storage type, topology, ...

--
Martin Sivák
msivak at redhat.com
Red Hat Czech
RHEV-M SLA / Brno, CZ

----- Original Message -----
> Hello,
> 
> since upgrading to the latest hosted-engine-ha I have the follwing problem:
> 
> ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
> Engine vm died unexpectedly
> 
> I suppose HA is forcing the engine down because liveliness check is
> failing. I attached a log compile from the latest incident, 2014-10-21
> 16:26:31,836. The 'host' logs are from the hosts the engine was running
> on, host2 the other HA host
> Interestingly this only happens when I was connected via a VNC console
> to one of my Winodws 2012 VMs.
> 
> 
> How can I further debug this?
> The engine log seems empty and also the HE does not seem to have any
> trouble when this happens. As precaustion / test I set my cluster to
> global maintenance.
> 
> Thanks,
> 
> vdsm-python-zombiereaper-4.16.7-1.gitdb83943.el6.noarch
> vdsm-xmlrpc-4.16.7-1.gitdb83943.el6.noarch
> vdsm-4.16.7-1.gitdb83943.el6.x86_64
> vdsm-python-4.16.7-1.gitdb83943.el6.noarch
> vdsm-yajsonrpc-4.16.7-1.gitdb83943.el6.noarch
> vdsm-jsonrpc-4.16.7-1.gitdb83943.el6.noarch
> vdsm-cli-4.16.7-1.gitdb83943.el6.noarch
> 
> ovirt-hosted-engine-ha-1.2.4-1.el6.noarch
> ovirt-release35-001-1.noarch
> ovirt-host-deploy-1.3.0-1.el6.noarch
> ovirt-hosted-engine-setup-1.2.1-1.el6.noarch
> ovirt-release34-1.0.3-1.noarch
> ovirt-engine-sdk-python-3.5.0.7-1.el6.noarch
> 
> 
> --
> Daniel Helgenberger
> m box bewegtbild GmbH
> 
> P: +49/30/2408781-22
> F: +49/30/2408781-10
> 
> ACKERSTR. 19
> D-10115 BERLIN
> 
> 
> www.m-box.de  www.monkeymen.tv
> 
> Geschäftsführer: Martin Retschitzegger / Michaela Göllner
> Handeslregister: Amtsgericht Charlottenburg / HRB 112767
> 
> 
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
>



More information about the Users mailing list