[ovirt-users] [Users] Hosted Engine recovery failure of all HA - nodes

Daniel Helgenberger daniel.helgenberger at m-box.de
Wed Apr 9 13:39:36 UTC 2014


On Mi, 2014-04-09 at 14:42 +0200, Jiri Moskovcak wrote:
> On 04/09/2014 02:32 PM, Daniel Helgenberger wrote:
> > On Mi, 2014-04-09 at 09:18 +0200, Jiri Moskovcak wrote:
> >> On 04/08/2014 06:09 PM, Daniel Helgenberger wrote:
> >>> Hello,
> >>>
> >>> I have an oVirt 3.4 hosted engine lab setup witch I am evaluating for
> >>> production use.
> >>>
> >>> I "simulated" an ungraceful shutdown of all HA nodes (powercut) while
> >>> the engine was running. After powering up, the system did not recover
> >>> itself (it seemed).
> >>> I had to restart the ovirt-hosted-ha service (witch was in a locked
> >>> state) and then manually run 'hosted-engine --vm-start'.
> >>>
> >>> What is the supposed procedure after a shutdown (graceful / ungraceful)
> >>> of Hosted-Engine HA nodes? Should the engine recover by itself? Should
> >>> the running VM's be restarted automatically?
> >>
> >> When this happens the agent should start the engine VM and the engine
> >> should take care of restarting the VMs which were running on that
> >> restarted host and are marked as HA. Can you please provide contents ov
> >> /var/log/ovirt* from the host after the powercut when the engine VM
> >> doesn't come up?
> >>
> > Hello Jirka,
> >
> > I accidentally already send the message without pointing out the
> > interesting part; this is:
> >
> > <<< start logging ha-agent after reboot:
> > /var/log/ovirt-hosted-engine-ha/agent.log:MainTMainThread::INFO::2014-04-08 15:53:33,862::agent::52::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 1.1.2-1 started
> > /var/log/ovirt-hosted-engine-ha/agent.log:MainThread::INFO::2014-04-08 15:53:33,936::hosted_engine::223::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Found certificate common name: 192.168.50.201
> > /var/log/ovirt-hosted-engine-ha/agent.log:MainThread::INFO::2014-04-08 15:53:33,937::hosted_engine::363::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Initializing ha-broker connection
> > /var/log/ovirt-hosted-engine-ha/agent.log:MainThread::INFO::2014-04-08 15:53:33,937::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor ping, options {'addr': '192.168.50.1'}
> > /var/log/ovirt-hosted-engine-ha/agent.log:MainThread::INFO::2014-04-08 15:53:33,939::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id 139700911299600
> > /var/log/ovirt-hosted-engine-ha/agent.log:MainThread::INFO::2014-04-08 15:53:33,939::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor mgmt-bridge, options {'use_ssl': 'true', 'bridge_name': 'ovirtmgmt', 'address': '0'}
> > /var/log/ovirt-hosted-engine-ha/agent.log:MainThread::INFO::2014-04-08 15:53:34,013::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id 139700911300304
> > /var/log/ovirt-hosted-engine-ha/agent.log:MainThread::INFO::2014-04-08 15:53:34,013::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor mem-free, options {'use_ssl': 'true', 'address': '0'}
> > /var/log/ovirt-hosted-engine-ha/agent.log:MainThread::INFO::2014-04-08 15:53:34,015::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id 139700911300112
> > /var/log/ovirt-hosted-engine-ha/agent.log:MainThread::INFO::2014-04-08 15:53:34,015::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor cpu-load-no-engine, options {'use_ssl': 'true', 'vm_uuid': 'e68a11c8-1251-4c13-9e3b-3847bbb4fa3d', 'address': '0'}
> > /var/log/ovirt-hosted-engine-ha/agent.log:MainThread::INFO::2014-04-08 15:53:34,018::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id 139700911300240
> > /var/log/ovirt-hosted-engine-ha/agent.log:MainThread::INFO::2014-04-08 15:53:34,018::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor engine-health, options {'use_ssl': 'true', 'vm_uuid': 'e68a11c8-1251-4c13-9e3b-3847bbb4fa3d', 'address': '0'}
> > /var/log/ovirt-hosted-engine-ha/agent.log:MainThread::INFO::2014-04-08 15:53:34,024::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id 139700723857104
> > /var/log/ovirt-hosted-engine-ha/agent.log:MainThread::INFO::2014-04-08 15:53:34,024::hosted_engine::386::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Broker initialized, all submonitors started
> > /var/log/ovirt-hosted-engine-ha/agent.log:MainThread::INFO::2014-04-08 15:53:34,312::hosted_engine::430::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_cond_start_service) Starting vdsmd
> > /var/log/ovirt-hosted-engine-ha/agent.log:MainThread::CRITICAL::2014-04-08 15:53:34,442::agent::103::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Could not start ha-agent
> > (10 min nothing)
> > <<< here I did a 'service ovirt-hosted-ha start'
> > /var/log/ovirt-hosted-engine-ha/agent.log:MainThread::INFO::2014-04-08 15:59:16,698::agent::52::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 1.1.2-1 started
> > ....
> >
> > after this things went quite smoothly.
> >
> 
> Hi Daniel,
> I noticed that in the log and I was just about to ask if that's when you 
> manually fixed it. Is there something else around that time in 
> /var/log/message which might be related to it?

No problem, here we go at the relevant time. I added some newlines for
the interesting parts - otherwise it is complete:

Apr  8 15:53:29 nodehv01 kernel: mboxdomain: port 1(eth1.1) entering forwarding state
Apr  8 15:53:29 nodehv01 rpc.statd[1924]: Version 1.2.3 starting
Apr  8 15:53:30 nodehv01 sm-notify[1925]: Version 1.2.3 starting
Apr  8 15:53:34 nodehv01 wdmd[2315]: wdmd started S0 H1 G179
Apr  8 15:53:34 nodehv01 wdmd[2315]: /dev/watchdog armed with fire_timeout 60
---
Apr  8 15:53:34 nodehv01 ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent CRITICAL Could not start ha-agent#012Traceback (most recent call last):#012  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 97, in run#012    self._run_agent()#012  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 154, in _run_agent#012    hosted_engine.HostedEngine(self.shutdown_requested).start_monitoring()#012  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 299, in start_monitoring#012    self._initialize_vdsm()#012  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 390, in _initialize_vdsm#012    self._cond_start_service('vdsmd')#012  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 439, in _cond_start_service#012    .format(service_name, res[1]))#012Exception: Could not start vdsmd: 
---
Apr  8 15:53:34 nodehv01 kernel: device-mapper: table: 253:2: multipath: error getting device
Apr  8 15:53:34 nodehv01 kernel: device-mapper: ioctl: error adding target to table
Apr  8 15:53:34 nodehv01 multipathd: 1ATA_ST3500418AS_6VMNMNV0: ignoring map
Apr  8 15:53:34 nodehv01 multipathd: path checkers start up
Apr  8 15:53:35 nodehv01 ntpd[2463]: ntpd 4.2.6p5 at 1.2349-o Sat Nov 23 18:21:48 UTC 2013 (1)
Apr  8 15:53:35 nodehv01 ntpd[2464]: proto: precision = 0.062 usec
Apr  8 15:53:35 nodehv01 ntpd[2464]: 0.0.0.0 c01d 0d kern kernel time sync enabled
Apr  8 15:53:35 nodehv01 ntpd[2464]: Listen and drop on 0 v4wildcard 0.0.0.0 UDP 123
Apr  8 15:53:35 nodehv01 ntpd[2464]: Listen and drop on 1 v6wildcard :: UDP 123
Apr  8 15:53:35 nodehv01 ntpd[2464]: Listen normally on 2 lo 127.0.0.1 UDP 123
Apr  8 15:53:35 nodehv01 ntpd[2464]: Listen normally on 3 ovirtmgmt 192.168.50.201 UDP 123
Apr  8 15:53:35 nodehv01 ntpd[2464]: Listen normally on 4 lo ::1 UDP 123
Apr  8 15:53:35 nodehv01 ntpd[2464]: Listen normally on 5 eth1.1 fe80::215:17ff:fe21:111f UDP 123
Apr  8 15:53:35 nodehv01 ntpd[2464]: Listen normally on 6 eth1.100 fe80::215:17ff:fe21:111f UDP 123
Apr  8 15:53:35 nodehv01 ntpd[2464]: Listen normally on 7 eth1.1000 fe80::215:17ff:fe21:111f UDP 123
Apr  8 15:53:35 nodehv01 ntpd[2464]: Listen normally on 8 eth1 fe80::215:17ff:fe21:111f UDP 123
Apr  8 15:53:35 nodehv01 ntpd[2464]: Listen normally on 9 mboxsan fe80::215:17ff:fe21:111f UDP 123
Apr  8 15:53:35 nodehv01 ntpd[2464]: Listen normally on 10 mboxdomain fe80::215:17ff:fe21:111f UDP 123
Apr  8 15:53:35 nodehv01 ntpd[2464]: Listen normally on 11 mboxdmz fe80::215:17ff:fe21:111f UDP 123
Apr  8 15:53:35 nodehv01 ntpd[2464]: Listen normally on 12 eth0 fe80::215:17ff:fe21:111e UDP 123
Apr  8 15:53:35 nodehv01 ntpd[2464]: Listen normally on 13 ovirtmgmt fe80::215:17ff:fe21:111e UDP 123
Apr  8 15:53:35 nodehv01 ntpd[2464]: peers refreshed
Apr  8 15:53:35 nodehv01 ntpd[2464]: Listening on routing socket on fd #30 for interface updates
Apr  8 15:53:35 nodehv01 kernel: Loading iSCSI transport class v2.0-870.
Apr  8 15:53:35 nodehv01 ntpd[2464]: 0.0.0.0 c016 06 restart
Apr  8 15:53:35 nodehv01 ntpd[2464]: 0.0.0.0 c012 02 freq_set kernel 3.731 PPM
Apr  8 15:53:35 nodehv01 kernel: iscsi: registered transport (tcp)
Apr  8 15:53:35 nodehv01 kernel: iscsi: registered transport (iser)
Apr  8 15:53:35 nodehv01 kernel: libcxgbi:libcxgbi_init_module: tag itt 0x1fff, 13 bits, age 0xf, 4 bits.
Apr  8 15:53:35 nodehv01 kernel: libcxgbi:ddp_setup_host_page_size: system PAGE 4096, ddp idx 0.
Apr  8 15:53:35 nodehv01 kernel: Chelsio T3 iSCSI Driver cxgb3i v2.0.0 (Jun. 2010)
Apr  8 15:53:35 nodehv01 kernel: iscsi: registered transport (cxgb3i)
Apr  8 15:53:35 nodehv01 kernel: Chelsio T4 iSCSI Driver cxgb4i v0.9.1 (Aug. 2010)
Apr  8 15:53:35 nodehv01 kernel: iscsi: registered transport (cxgb4i)
Apr  8 15:53:35 nodehv01 kernel: cnic: Broadcom NetXtreme II CNIC Driver cnic v2.5.16 (Dec 05, 2012)
Apr  8 15:53:36 nodehv01 kernel: Broadcom NetXtreme II iSCSI Driver bnx2i v2.7.6.2 (Jun 06, 2013)
Apr  8 15:53:36 nodehv01 kernel: iscsi: registered transport (bnx2i)
Apr  8 15:53:36 nodehv01 kernel: iscsi: registered transport (be2iscsi)
Apr  8 15:53:36 nodehv01 kernel: In beiscsi_module_init, tt=ffffffffa0699e80
Apr  8 15:53:36 nodehv01 iscsid: iSCSI logger with pid=2619 started!
Apr  8 15:53:36 nodehv01 kernel: Ebtables v2.0 registered
---
Apr  8 15:53:37 nodehv01 ovirt-ha-broker mem_free.MemFree ERROR Failed to getVdsStats: VDSM initialization timeout
Apr  8 15:53:37 nodehv01 ovirt-ha-broker engine_health.CpuLoadNoEngine ERROR Failed to getVmStats: VDSM initialization timeout
Apr  8 15:53:37 nodehv01 ovirt-ha-broker cpu_load_no_engine.EngineHealth ERROR Failed to getVmStats: VDSM initialization timeout
Apr  8 15:53:37 nodehv01 ovirt-ha-broker mgmt_bridge.MgmtBridge ERROR Failed to getVdsCapabilities: VDSM initialization timeout
---
Apr  8 15:53:37 nodehv01 iscsid: iSCSI daemon with pid=2620 started!
Apr  8 15:53:37 nodehv01 kernel: lo: Disabled Privacy Extensions
Apr  8 15:53:38 nodehv01 kernel: Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)
Apr  8 15:53:38 nodehv01 kernel: bonding: bond4 is being created...
Apr  8 15:53:38 nodehv01 kernel: bonding: bond1 is being created...
Apr  8 15:53:38 nodehv01 kernel: bonding: bond2 is being created...
Apr  8 15:53:38 nodehv01 kernel: bonding: bond3 is being created...
Apr  8 15:53:40 nodehv01 kernel: device-mapper: table: 253:2: multipath: error getting device
Apr  8 15:53:40 nodehv01 kernel: device-mapper: ioctl: error adding target to table
Apr  8 15:53:40 nodehv01 vdsm vds WARNING Unable to load the json rpc server module. Please make sure it is installed.
Apr  8 15:53:42 nodehv01 ntpd[2464]: 0.0.0.0 c61c 0c clock_step +0.171138 s
Apr  8 15:53:42 nodehv01 ntpd[2464]: 0.0.0.0 c614 04 freq_mode
Apr  8 15:53:43 nodehv01 ntpd[2464]: 0.0.0.0 c618 08 no_sys_peer
---
Apr  8 15:59:16 nodehv01 vdsm vds ERROR failed to retrieve Hosted Engine HA info#012Traceback (most recent call last):#012  File "/usr/share/vdsm/API.py", line 1603, in _getHaInfo#012    stats = instance.get_all_stats()#012  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/client/client.py", line 85, in get_all_stats#012    path.get_metadata_path(self._config),#012  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/env/path.py", line 47, in get_metadata_path#012    return os.path.join(get_domain_path(config_),#012  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/env/path.py", line 40, in get_domain_path#012    .format(sd_uuid, parent))#012Exception: path to storage domain bcfa7ec4-5278-44d8-9f31-682f2d9de91d not found in /rhev/data-center/mnt
---
Apr  8 15:59:17 nodehv01 kernel: RPC: Registered named UNIX socket transport module.
Apr  8 15:59:17 nodehv01 kernel: RPC: Registered udp transport module.
Apr  8 15:59:17 nodehv01 kernel: RPC: Registered tcp transport module.
Apr  8 15:59:17 nodehv01 kernel: RPC: Registered tcp NFSv4.1 backchannel transport module.
Apr  8 15:59:17 nodehv01 kernel: FS-Cache: Loaded
Apr  8 15:59:17 nodehv01 kernel: NFS: Registering the id_resolver key type
Apr  8 15:59:17 nodehv01 kernel: FS-Cache: Netfs 'nfs' registered for caching

> 
> Thanks,
> Jirka
> 
> >> Thanks,
> >> Jirka
> >
> >
> >>
> >>>
> >>> Thanks,
> >>> Daniel
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> Users mailing list
> >>> Users at ovirt.org
> >>> http://lists.ovirt.org/mailman/listinfo/users
> >>>
> >>
> >
> 

-- 

Daniel Helgenberger 
m box bewegtbild GmbH 

P: +49/30/2408781-22
F: +49/30/2408781-10

ACKERSTR. 19 
D-10115 BERLIN 


www.m-box.de  www.monkeymen.tv 

Geschäftsführer: Martin Retschitzegger / Michaela Göllner
Handeslregister: Amtsgericht Charlottenburg / HRB 112767 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 4380 bytes
Desc: not available
URL: <http://lists.ovirt.org/pipermail/users/attachments/20140409/602fc7e2/attachment-0001.bin>


More information about the Users mailing list