Hi Jirka,
the patch works. it stabilized the status of my two hosts. the engine
migration during failover also works fine. thanks guys!
Jaicel
------------------------------------------------------------------------
*From: *"Jiri Moskovcak" <jmoskovc(a)redhat.com>
*To: *"Jaicel" <jaicel(a)asti.dost.gov.ph>
*Cc: *"Niels de Vos" <ndevos(a)redhat.com>, "Vijay Bellur"
<vbellur(a)redhat.com>, users(a)ovirt.org, "Gluster Devel"
<gluster-devel(a)gluster.org>
*Sent: *Monday, November 3, 2014 3:33:16 PM
*Subject: *Re: [ovirt-users] Hosted-Engine HA problem
On 11/01/2014 07:43 AM, Jaicel wrote:
> Hi,
>
> my engine runs on Host1. current status and agent logs below.
>
> Host 1
Hi,
it seems like you ran into [1], you can either zero-out the metadata
file or apply the patch from [1] manually.
--Jirka
[1]
https://bugzilla.redhat.com/show_bug.cgi?id=1158925
>
> MainThread::INFO::2014-10-31
16:55:39,918::agent::52::ovirt_hosted_engine_ha.agent.agent.Agent::(run)
ovirt-hosted-engi
> ne-ha agent 1.1.6 started
> MainThread::INFO::2014-10-31
16:55:39,985::hosted_engine::223::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:
> :(_get_hostname) Found certificate common name: 192.168.12.11
> MainThread::INFO::2014-10-31
16:55:40,228::hosted_engine::367::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:
> :(_initialize_broker) Initializing ha-broker connection
> MainThread::INFO::2014-10-31
16:55:40,228::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
> nitor) Starting monitor ping, options {'addr': '192.168.12.254'}
> MainThread::INFO::2014-10-31
16:55:40,231::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
> nitor) Success, id 140634215107920
> MainThread::INFO::2014-10-31
16:55:40,231::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
> nitor) Starting monitor mgmt-bridge, options {'use_ssl': 'true',
'bridge_name': 'ovirtmgmt', 'address': '0'}
> MainThread::INFO::2014-10-31
16:55:40,237::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
> nitor) Success, id 140634215108432
> MainThread::INFO::2014-10-31
16:55:40,237::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
> nitor) Starting monitor mem-free, options {'use_ssl': 'true',
'address': '0'}
> MainThread::INFO::2014-10-31
16:55:40,240::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
> nitor) Success, id 39956688
> MainThread::INFO::2014-10-31
16:55:40,240::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
> nitor) Starting monitor cpu-load-no-engine, options {'use_ssl':
'true', 'vm_uuid': '41d4aff1-54e1-4946-a812-2e656bb7d3f
> 9', 'address': '0'}
> MainThread::INFO::2014-10-31
16:55:40,243::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
> nitor) Success, id 140634215107664
> MainThread::INFO::2014-10-31
16:55:40,244::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
> nitor) Starting monitor engine-health, options {'use_ssl': 'true',
'vm_uuid': '41d4aff1-54e1-4946-a812-2e656bb7d3f9', '
> address': '0'}
> MainThread::INFO::2014-10-31
16:55:40,249::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
> nitor) Success, id 140634006879632
> MainThread::INFO::2014-10-31
16:55:40,249::hosted_engine::391::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:
> :(_initialize_broker) Broker initialized, all submonitors started
> MainThread::INFO::2014-10-31
16:55:40,298::hosted_engine::476::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:
> :(_initialize_sanlock) Ensuring lease for lockspace hosted-engine,
host id 1 is acquired (file: /rhev/data-center/mnt/g
>
luster1:_engine/6eb220be-daff-4785-8f78-111cc24139c4/ha_agent/hosted-engine.lockspace)
> MainThread::INFO::2014-10-31
16:55:40,322::state_machine::153::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:
> :(refresh) Global metadata: {'maintenance': False}
> MainThread::INFO::2014-10-31
16:55:40,322::state_machine::158::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:
> :(refresh) Host 192.168.12.12 (id 2): {'live-data': False,
'extra':
'metadata_parse_version=1\nmetadata_feature_version
> =1\ntimestamp=1413882675 (Tue Oct 21 17:11:15
2014)\nhost-id=2\nscore=2400\nmaintenance=False\nstate=EngineDown\n',
'hostname': '192.168.12.12', 'host-id': 2,
'engine-status': {'reason':
'vm not running on this host', 'health': 'bad', 'vm':
'down', 'detail':
'unknown'}, 'score': 2400, 'maintenance': False,
'host-ts': 1413882675}
> MainThread::INFO::2014-10-31
16:55:40,322::state_machine::161::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh)
Local (id 1): {'engine-health': None, 'bridge': True, 'mem-free':
None,
'maintenance': False, 'cpu-load': None, 'gateway': True}
> MainThread::INFO::2014-10-31
16:55:40,323::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Trying: notify time=1414745740.32 type=state_transition
detail=StartState-ReinitializeFSM hostname='ovirt1'
> MainThread::INFO::2014-10-31
16:55:40,392::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Success, was notification of state_transition
(StartState-ReinitializeFSM) sent? ignored
> MainThread::INFO::2014-10-31
16:55:40,675::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
Current state ReinitializeFSM (score: 0)
> MainThread::INFO::2014-10-31
16:55:50,710::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
> Trying: notify time=1414745750.71 type=state_transition
detail=ReinitializeFSM-EngineUp hostname='ovirt1'
> MainThread::INFO::2014-10-31
16:55:50,710::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Success, was notification of state_transition (ReinitializeFSM-EngineUp)
sent? ignored
> MainThread::INFO::2014-10-31
16:55:51,001::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
Current state EngineUp (score: 2400)
> MainThread::CRITICAL::2014-10-31
16:56:01,033::agent::103::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Could
not start ha-agent
> Traceback (most recent call last):
> File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line
97, in run
> self._run_agent()
> File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line
154, in _run_agent
>
hosted_engine.HostedEngine(self.shutdown_requested).start_monitoring()
> File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
line 307, in start_monitoring
> for old_state, state, delay in self.fsm:
> File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/fsm/machine.py",
line 125, in next
> new_data = self.refresh(self._state.data)
> File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/state_machine.py",
line 77, in refresh
> stats.update(self.hosted_engine.collect_stats())
> File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
line 700, in collect_stats
> stats = self.process_remote_metadata(host_id, remote_data)
> File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
line 747, in process_remote_metadata
> md['engine-status'] = engine_status(md["engine-status"])
> File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
line 79, in engine_status
> in json.loads(status).iteritems()])
> AttributeError: 'NoneType' object has no attribute 'iteritems'
> [root@ovirt1 ~]# hosted-engine --vm-status
>
>
> --== Host 1 status ==--
>
> Status up-to-date : False
> Hostname : 192.168.12.11
> Host ID : 1
> Engine status : unknown stale-data
> Score : 2400
> Local maintenance : False
> Host timestamp : 1414745750
> Extra metadata (valid at timestamp):
> metadata_parse_version=1
> metadata_feature_version=1
> timestamp=1414745750 (Fri Oct 31 16:55:50 2014)
> host-id=1
> score=2400
> maintenance=False
> state=EngineUp
>
>
> --== Host 2 status ==--
>
> Status up-to-date : False
> Hostname : 192.168.12.12
> Host ID : 2
> Engine status : unknown stale-data
> Score : 2400
> Local maintenance : False
> Host timestamp : 1414745821
> Extra metadata (valid at timestamp):
> metadata_parse_version=1
> metadata_feature_version=1
> timestamp=1414745821 (Fri Oct 31 16:57:01 2014)
> host-id=2
> score=2400
> maintenance=False
> state=EngineStart
> [root@ovirt1 ~]# service ovirt-ha-agent status
> ovirt-ha-agent dead but subsys locked
>
> Host2
>
> MainThread::INFO::2014-10-31
16:55:59,642::agent::52::ovirt_hosted_engine_ha.agent.agent.Agent::(run)
ovirt-hosted-engi
> ne-ha agent 1.1.6 started
> MainThread::INFO::2014-10-31
16:55:59,678::hosted_engine::223::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:
> :(_get_hostname) Found certificate common name: 192.168.12.12
> MainThread::INFO::2014-10-31
16:55:59,918::hosted_engine::367::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:
> :(_initialize_broker) Initializing ha-broker connection
> MainThread::INFO::2014-10-31
16:55:59,919::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
> nitor) Starting monitor ping, options {'addr': '192.168.12.254'}
> MainThread::INFO::2014-10-31
16:55:59,922::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
> nitor) Success, id 25353488
> MainThread::INFO::2014-10-31
16:55:59,922::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
> nitor) Starting monitor mgmt-bridge, options {'use_ssl': 'true',
'bridge_name': 'ovirtmgmt', 'address': '0'}
> MainThread::INFO::2014-10-31
16:55:59,928::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
> nitor) Success, id 25354128
> MainThread::INFO::2014-10-31
16:55:59,928::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
> nitor) Starting monitor mem-free, options {'use_ssl': 'true',
'address': '0'}
> MainThread::INFO::2014-10-31
16:55:59,931::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
> nitor) Success, id 25353552
> MainThread::INFO::2014-10-31
16:55:59,931::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
> nitor) Starting monitor cpu-load-no-engine, options {'use_ssl':
'true', 'vm_uuid': '41d4aff1-54e1-4946-a812-2e656bb7d3f
> 9', 'address': '0'}
> MainThread::INFO::2014-10-31
16:55:59,934::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
> nitor) Success, id 139976608389584
> MainThread::INFO::2014-10-31
16:55:59,934::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
> nitor) Starting monitor engine-health, options {'use_ssl': 'true',
'vm_uuid': '41d4aff1-54e1-4946-a812-2e656bb7d3f9', '
> address': '0'}
> MainThread::INFO::2014-10-31
16:55:59,939::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo
> nitor) Success, id 139976608447760
> MainThread::INFO::2014-10-31
16:55:59,939::hosted_engine::391::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:
> :(_initialize_broker) Broker initialized, all submonitors started
> MainThread::INFO::2014-10-31
16:55:59,983::hosted_engine::476::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:
> :(_initialize_sanlock) Ensuring lease for lockspace hosted-engine,
host id 2 is acquired (file: /rhev/data-center/mnt/g
>
luster1:_engine/6eb220be-daff-4785-8f78-111cc24139c4/ha_agent/hosted-engine.lockspace)
> MainThread::INFO::2014-10-31
16:56:00,001::state_machine::153::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:
> :(refresh) Global metadata: {'maintenance': False}
> MainThread::INFO::2014-10-31
16:56:00,001::state_machine::158::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:
> :(refresh) Host 192.168.12.11 (id 1): {'live-data': True, 'extra':
'metadata_parse_version=1\nmetadata_feature_version=
> 1\ntimestamp=1414745750 (Fri Oct 31 16:55:50
2014)\nhost-id=1\nscore=2400\nmaintenance=False\nstate=EngineUp\n', 'hostn
> ame': '192.168.12.11', 'host-id': 1, 'engine-status':
{'health':
'good', 'vm': 'up', 'detail': 'up'},
'score': 2400, 'm
> aintenance': False, 'host-ts': 1414745750}
> MainThread::INFO::2014-10-31
16:56:00,001::state_machine::161::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:
> :(refresh) Local (id 2): {'engine-health': None, 'bridge': True,
'mem-free': None, 'maintenance': False, 'cpu-load': No
> ne, 'gateway': True}
> MainThread::INFO::2014-10-31
16:56:00,002::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
> Trying: notify time=1414745760.0 type=state_transition
detail=StartState-ReinitializeFSM hostname='ovirt2'
> MainThread::INFO::2014-10-31
16:56:00,045::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
> Success, was notification of state_transition
(StartState-ReinitializeFSM) sent? ignored
> MainThread::INFO::2014-10-31
16:56:00,325::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:
> :(start_monitoring) Current state ReinitializeFSM (score: 0)
> MainThread::INFO::2014-10-31
16:56:10,352::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Trying: notify time=1414745770.35 type=state_transition
detail=ReinitializeFSM-EngineDown hostname='ovirt2'
> MainThread::INFO::2014-10-31
16:56:10,353::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Success, was notification of state_transition
(ReinitializeFSM-EngineDown) sent? ignored
> MainThread::INFO::2014-10-31
16:56:10,638::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
Current state EngineDown (score: 2400)
> MainThread::INFO::2014-10-31
16:56:20,663::states::441::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
The engine is not running, but we do not have enough data to decide
which hosts are alive
> MainThread::INFO::2014-10-31
16:56:20,663::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Trying: notify time=1414745780.66 type=state_transition
detail=EngineDown-EngineDown hostname='ovirt2'
> MainThread::INFO::2014-10-31
16:56:20,664::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Success, was notification of state_transition (EngineDown-EngineDown)
sent? ignored
> MainThread::INFO::2014-10-31
16:56:20,943::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
Current state EngineDown (score: 2400)
> MainThread::INFO::2014-10-31
16:56:30,968::states::441::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
The engine is not running, but we do not have enough data to decide
which hosts are alive
> MainThread::INFO::2014-10-31
16:56:30,969::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Trying: notify time=1414745790.97 type=state_transition
detail=EngineDown-EngineDown hostname='ovirt2'
> MainThread::INFO::2014-10-31
16:56:30,969::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Success, was notification of state_transition (EngineDown-EngineDown)
sent? ignored
> MainThread::INFO::2014-10-31
16:56:31,248::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
Current state EngineDown (score: 2400)
> MainThread::INFO::2014-10-31
16:56:41,274::states::441::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
The engine is not running, but we do not have enough data to decide
which hosts are alive
> MainThread::INFO::2014-10-31
16:56:41,275::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Trying: notify time=1414745801.28 type=state_transition
detail=EngineDown-EngineDown hostname='ovirt2'
> MainThread::INFO::2014-10-31
16:56:41,276::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Success, was notification of state_transition (EngineDown-EngineDown)
sent? ignored
> MainThread::INFO::2014-10-31
16:56:41,555::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
Current state EngineDown (score: 2400)
> MainThread::INFO::2014-10-31
16:56:51,583::states::441::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
The engine is not running, but we do not have enough data to decide
which hosts are alive
> MainThread::INFO::2014-10-31
16:56:51,584::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Trying: notify time=1414745811.58 type=state_transition
detail=EngineDown-EngineDown hostname='ovirt2'
> MainThread::INFO::2014-10-31
16:56:51,584::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Success, was notification of state_transition (EngineDown-EngineDown)
sent? ignored
> MainThread::INFO::2014-10-31
16:56:51,864::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
Current state EngineDown (score: 2400)
> MainThread::INFO::2014-10-31
16:57:01,897::states::454::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
Engine down and local host has best score (2400), attempting to start
engine VM
> MainThread::INFO::2014-10-31
16:57:01,898::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Trying: notify time=1414745821.9 type=state_transition
detail=EngineDown-EngineStart hostname='ovirt2'
> MainThread::INFO::2014-10-31
16:57:01,906::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Success, was notification of state_transition (EngineDown-EngineStart)
sent? ignored
> MainThread::INFO::2014-10-31
16:57:02,189::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
Current state EngineStart (score: 2400)
> MainThread::CRITICAL::2014-10-31
16:57:02,207::agent::103::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Could
not start ha-agent
> Traceback (most recent call last):
> File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line
97, in run
> self._run_agent()
> File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line
154, in _run_agent
>
hosted_engine.HostedEngine(self.shutdown_requested).start_monitoring()
> File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
line 307, in start_monitoring
> for old_state, state, delay in self.fsm:
> File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/fsm/machine.py",
line 125, in next
> new_data = self.refresh(self._state.data)
> File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/state_machine.py",
line 77, in refresh
> stats.update(self.hosted_engine.collect_stats())
> File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
line 662, in collect_stats
> constants.SERVICE_TYPE)
> File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
line 171, in get_stats_from_storage
> result = self._checked_communicate(request)
> File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
line 199, in _checked_communicate
> .format(message or response))
> RequestError: Request failed: <type 'exceptions.OSError'>
>
> [root@ovirt2 ~]# hosted-engine --vm-status
> Traceback (most recent call last):
> File "/usr/lib64/python2.6/runpy.py", line 122, in
_run_module_as_main
> "__main__", fname, loader, pkg_name)
> File "/usr/lib64/python2.6/runpy.py", line 34, in _run_code
> exec code in run_globals
> File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_setup/vm_status.py",
line 111, in <module>
> if not status_checker.print_status():
> File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_setup/vm_status.py",
line 58, in print_status
> all_host_stats = ha_cli.get_all_host_stats()
> File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/client/client.py",
line 137, in get_all_host_stats
> return self.get_all_stats(self.StatModes.HOST)
> File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/client/client.py",
line 86, in get_all_stats
> constants.SERVICE_TYPE)
> File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
line 171, in get_stats_from_storage
> result = self._checked_communicate(request)
> File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
line 199, in _checked_communicate
> .format(message or response))
> ovirt_hosted_engine_ha.lib.exceptions.RequestError: Request failed:
<type 'exceptions.OSError'>
> [root@ovirt2 ~]# service ovirt-ha-agent status
> ovirt-ha-agent dead but subsys locked
>
>
> Thanks,
> Jaicel
>
> ----- Original Message -----
> From: "Jiri Moskovcak" <jmoskovc(a)redhat.com>
> To: "Jaicel" <jaicel(a)asti.dost.gov.ph>
> Cc: "Niels de Vos" <ndevos(a)redhat.com>, "Vijay Bellur"
<vbellur(a)redhat.com>, users(a)ovirt.org, "Gluster Devel"
<gluster-devel(a)gluster.org>
> Sent: Friday, October 31, 2014 11:05:32 PM
> Subject: Re: [ovirt-users] Hosted-Engine HA problem
>
> On 10/31/2014 10:26 AM, Jaicel wrote:
>> i've increased the limit and then restarted agent and broker. status
normalize, but then right now it went to "False" state again but still
both having 2400 score. agent logs remains the same, with
"ovirt-ha-agent dead but subsys locked" status. ha-broker logs below
>>
>> Thread-138::INFO::2014-10-31
17:24:22,981::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup)
Connection established
>> Thread-138::INFO::2014-10-31
17:24:22,991::listener::184::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
Connection closed
>> Thread-139::INFO::2014-10-31
17:24:38,385::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup)
Connection established
>> Thread-139::INFO::2014-10-31
17:24:38,395::listener::184::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
Connection closed
>> Thread-140::INFO::2014-10-31
17:24:53,816::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup)
Connection established
>> Thread-140::INFO::2014-10-31
17:24:53,827::listener::184::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
Connection closed
>> Thread-141::INFO::2014-10-31
17:25:09,172::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup)
Connection established
>> Thread-141::INFO::2014-10-31
17:25:09,182::listener::184::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
Connection closed
>> Thread-142::INFO::2014-10-31
17:25:24,551::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup)
Connection established
>> Thread-142::INFO::2014-10-31
17:25:24,562::listener::184::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
Connection closed
>>
>> Thanks,
>> Jaicel
>
> ok, now it seems that broker runs fine, so I need the recent agent.log
> to debug it more.
>
> --Jirka
>
>>
>> ----- Original Message -----
>> From: "Jiri Moskovcak" <jmoskovc(a)redhat.com>
>> To: "Jaicel R. Sabonsolin" <jaicel(a)asti.dost.gov.ph>,
"Niels de Vos"
<ndevos(a)redhat.com>
>> Cc: "Vijay Bellur" <vbellur(a)redhat.com>, users(a)ovirt.org,
"Gluster
Devel" <gluster-devel(a)gluster.org>
>> Sent: Friday, October 31, 2014 4:32:02 PM
>> Subject: Re: [ovirt-users] Hosted-Engine HA problem
>>
>> On 10/31/2014 03:53 AM, Jaicel R. Sabonsolin wrote:
>>> Hi guys,
>>>
>>> these logs appear on both hosts just like the result of
--vm-status. tried to tcpdump on ovirt hosts and gluster nodes but only
packets exchange with my monitoring VM(zabbix) appeared.
>>>
>>> agent.log
>>> new_data = self.refresh(self._state.data)
>>> File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/state_machine.py",
line 77, in refresh
>>> stats.update(self.hosted_engine.collect_stats())
>>> File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
line 662, in collect_stats
>>> constants.SERVICE_TYPE)
>>> File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
line 171, in get_stats_from_storage
>>> result = self._checked_communicate(request)
>>> File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
line 199, in _checked_communicate
>>> .format(message or response))
>>> RequestError: Request failed: <type 'exceptions.OSError'>
>>>
>>> broker.log
>>> File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/listener.py",
line 165, in handle
>>> response = "success " + self._dispatch(data)
>>> File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/listener.py",
line 261, in _dispatch
>>> .get_all_stats_for_service_type(**options)
>>> File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py",
line 41, in get_all_stats_for_service_type
>>> d = self.get_raw_stats_for_service_type(storage_dir,
service_type)
>>> File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py",
line 74, in get_raw_stats_for_service_type
>>> f = os.open(path, direct_flag | os.O_RDONLY)
>>> OSError: [Errno 24] Too many open files:
'/rhev/data-center/mnt/gluster1:_engine/6eb220be-daff-4785-8f78-111cc24139c4/ha_agent/hosted-engine.metadata'
>>
>> - ah, there we go ^^^^^^ you might need to tweak the limit of allowed
>> open files as described here [1] or find the app keeps so many files
open
>>
>>
>> --Jirka
>>
>> [1]
>>
http://www.cyberciti.biz/faq/linux-increase-the-maximum-number-of-open-fi...
>>
>>> Thread-38160::INFO::2014-10-31
10:28:37,989::listener::184::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
Connection closed
>>> Thread-38161::INFO::2014-10-31
10:28:53,656::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup)
Connection established
>>> Thread-38161::ERROR::2014-10-31
10:28:53,657::listener::190::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
Error handling request, data: 'get-stats
storage_dir=/rhev/data-center/mnt/gluster1:_engine/6eb220be-daff-4785-8f78-111cc24139c4/ha_agent
service_type=hosted-engine'
>>> Traceback (most recent call last):
>>> File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/listener.py",
line 165, in handle
>>> response = "success " + self._dispatch(data)
>>> File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/listener.py",
line 261, in _dispatch
>>> .get_all_stats_for_service_type(**options)
>>> File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py",
line 41, in get_all_stats_for_service_type
>>> d = self.get_raw_stats_for_service_type(storage_dir,
service_type)
>>> File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py",
line 74, in get_raw_stats_for_service_type
>>> f = os.open(path, direct_flag | os.O_RDONLY)
>>> OSError: [Errno 24] Too many open files:
'/rhev/data-center/mnt/gluster1:_engine/6eb220be-daff-4785-8f78-111cc24139c4/ha_agent/hosted-engine.metadata'
>>> Thread-38161::INFO::2014-10-31
10:28:53,658::listener::184::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
Connection closed
>>>
>>> Thanks,
>>> Jaicel
>>>
>>> ----- Original Message -----
>>> From: "Niels de Vos" <ndevos(a)redhat.com>
>>> To: "Vijay Bellur" <vbellur(a)redhat.com>
>>> Cc: "Jiri Moskovcak" <jmoskovc(a)redhat.com>, "Jaicel R.
Sabonsolin"
<jaicel(a)asti.dost.gov.ph>, users(a)ovirt.org, "Gluster Devel"
<gluster-devel(a)gluster.org>
>>> Sent: Friday, October 31, 2014 4:11:25 AM
>>> Subject: Re: [ovirt-users] Hosted-Engine HA problem
>>>
>>> On Thu, Oct 30, 2014 at 09:07:24PM +0530, Vijay Bellur wrote:
>>>> On 10/30/2014 06:45 PM, Jiri Moskovcak wrote:
>>>>> On 10/30/2014 09:22 AM, Jaicel R. Sabonsolin wrote:
>>>>>> Hi Guys,
>>>>>>
>>>>>> I need help with my ovirt Hosted-Engine HA setup. I am running
on 2
>>>>>> ovirt hosts and 2 gluster nodes with replicated volumes. i
already have
>>>>>> VMs running on my hosts and they can migrate normally once i
for
example
>>>>>> power off the host that they are running on. the problem is
that the
>>>>>> engine can't migrate once i switch off the host that hosts
the
engine.
>>>>>>
>>>>>> oVirt 3.4.3-1.el6
>>>>>> KVM 0.12.1.2 - 2.415.el6_5.10
>>>>>> LIBVIRT libvirt-0.10.2-29.el6_5.9
>>>>>> VDSM vdsm-4.14.17-0.el6
>>>>>>
>>>>>>
>>>>>> right now, i have this result from hosted-engine --vm-status.
>>>>>>
>>>>>> File "/usr/lib64/python2.6/runpy.py", line
122, in
>>>>>> _run_module_as_main
>>>>>> "__main__", fname, loader, pkg_name)
>>>>>> File "/usr/lib64/python2.6/runpy.py", line
34, in _run_code
>>>>>> exec code in run_globals
>>>>>> File
>>>>>>
>>>>>>
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_setup/vm_status.py",
>>>>>>
>>>>>> line 111, in <module>
>>>>>> if not status_checker.print_status():
>>>>>> File
>>>>>>
>>>>>>
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_setup/vm_status.py",
>>>>>>
>>>>>> line 58, in print_status
>>>>>> all_host_stats = ha_cli.get_all_host_stats()
>>>>>> File
>>>>>>
>>>>>>
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/client/client.py",
>>>>>>
>>>>>> line 137, in get_all_host_stats
>>>>>> return self.get_all_stats(self.StatModes.HOST)
>>>>>> File
>>>>>>
>>>>>>
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/client/client.py",
>>>>>>
>>>>>> line 86, in get_all_stats
>>>>>> constants.SERVICE_TYPE)
>>>>>> File
>>>>>>
>>>>>>
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
>>>>>>
>>>>>> line 171, in get_stats_from_storage
>>>>>> result = self._checked_communicate(request)
>>>>>> File
>>>>>>
>>>>>>
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
>>>>>>
>>>>>> line 199, in _checked_communicate
>>>>>> .format(message or response))
>>>>>> ovirt_hosted_engine_ha.lib.exceptions.RequestError:
Request failed:
>>>>>> <type 'exceptions.OSError'>
>>>>>>
>>>>>>
>>>>>> restarting ha-broker and ha-agent normalizes the status but
eventually
>>>>>> it would become "false" and then return to the result
above.
hope you
>>>>>> guys could help me with this.
>>>>>>
>>>>>
>>>>> Hi Jaicel,
>>>>> please attach agent.log and broker.log from the host where you
trying to
>>>>> run hosted-engine --vm-status. I have a feeling that you ran into
a
>>>>> known problem on gluster - stalled file descriptor, in that case
the
>>>>> only known solution at this time is to restart the broker &
agent
as you
>>>>> have already found out.
>>>>>
>>>>
>>>> Adding Niels and gluster-devel to troubleshoot from Gluster NFS
perspective.
>>>
>>> I'd welcome any details on this "stalled file descriptor"
problem. Is
>>> there a bug filed with some details like logs, sysrq-t and maybe even
>>> tcpdumps? If there is an easy way to reproduce this behaviour, I can
>>> surely look into it and hopefully come up with some advise or fix.
>>>
>>> Thanks,
>>> Niels
>>>