
On 11/11/2014 05:56 AM, Jaicel wrote:
Hi Jirka,
the patch works. it stabilized the status of my two hosts. the engine migration during failover also works fine. thanks guys!
Hi Jaicel, I'm glad it works for you! Enjoy the hosted engine ;) --Jirka
Jaicel
------------------------------------------------------------------------ *From: *"Jiri Moskovcak" <jmoskovc@redhat.com> *To: *"Jaicel" <jaicel@asti.dost.gov.ph> *Cc: *"Niels de Vos" <ndevos@redhat.com>, "Vijay Bellur" <vbellur@redhat.com>, users@ovirt.org, "Gluster Devel" <gluster-devel@gluster.org> *Sent: *Monday, November 3, 2014 3:33:16 PM *Subject: *Re: [ovirt-users] Hosted-Engine HA problem
On 11/01/2014 07:43 AM, Jaicel wrote:
Hi,
my engine runs on Host1. current status and agent logs below.
Host 1
Hi, it seems like you ran into [1], you can either zero-out the metadata file or apply the patch from [1] manually.
--Jirka
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1158925
MainThread::INFO::2014-10-31
16:55:39,918::agent::52::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engi
ne-ha agent 1.1.6 started MainThread::INFO::2014-10-31 16:55:39,985::hosted_engine::223::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine: :(_get_hostname) Found certificate common name: 192.168.12.11 MainThread::INFO::2014-10-31 16:55:40,228::hosted_engine::367::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine: :(_initialize_broker) Initializing ha-broker connection MainThread::INFO::2014-10-31 16:55:40,228::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo nitor) Starting monitor ping, options {'addr': '192.168.12.254'} MainThread::INFO::2014-10-31 16:55:40,231::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo nitor) Success, id 140634215107920 MainThread::INFO::2014-10-31 16:55:40,231::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo nitor) Starting monitor mgmt-bridge, options {'use_ssl': 'true', 'bridge_name': 'ovirtmgmt', 'address': '0'} MainThread::INFO::2014-10-31 16:55:40,237::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo nitor) Success, id 140634215108432 MainThread::INFO::2014-10-31 16:55:40,237::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo nitor) Starting monitor mem-free, options {'use_ssl': 'true', 'address': '0'} MainThread::INFO::2014-10-31 16:55:40,240::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo nitor) Success, id 39956688 MainThread::INFO::2014-10-31 16:55:40,240::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo nitor) Starting monitor cpu-load-no-engine, options {'use_ssl': 'true', 'vm_uuid': '41d4aff1-54e1-4946-a812-2e656bb7d3f 9', 'address': '0'} MainThread::INFO::2014-10-31 16:55:40,243::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo nitor) Success, id 140634215107664 MainThread::INFO::2014-10-31 16:55:40,244::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo nitor) Starting monitor engine-health, options {'use_ssl': 'true', 'vm_uuid': '41d4aff1-54e1-4946-a812-2e656bb7d3f9', ' address': '0'} MainThread::INFO::2014-10-31 16:55:40,249::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo nitor) Success, id 140634006879632 MainThread::INFO::2014-10-31 16:55:40,249::hosted_engine::391::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine: :(_initialize_broker) Broker initialized, all submonitors started MainThread::INFO::2014-10-31 16:55:40,298::hosted_engine::476::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine: :(_initialize_sanlock) Ensuring lease for lockspace hosted-engine, host id 1 is acquired (file: /rhev/data-center/mnt/g
luster1:_engine/6eb220be-daff-4785-8f78-111cc24139c4/ha_agent/hosted-engine.lockspace)
MainThread::INFO::2014-10-31 16:55:40,322::state_machine::153::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine: :(refresh) Global metadata: {'maintenance': False} MainThread::INFO::2014-10-31 16:55:40,322::state_machine::158::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine: :(refresh) Host 192.168.12.12 (id 2): {'live-data': False, 'extra': 'metadata_parse_version=1\nmetadata_feature_version =1\ntimestamp=1413882675 (Tue Oct 21 17:11:15 2014)\nhost-id=2\nscore=2400\nmaintenance=False\nstate=EngineDown\n', 'hostname': '192.168.12.12', 'host-id': 2, 'engine-status': {'reason': 'vm not running on this host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'}, 'score': 2400, 'maintenance': False, 'host-ts': 1413882675} MainThread::INFO::2014-10-31 16:55:40,322::state_machine::161::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Local (id 1): {'engine-health': None, 'bridge': True, 'mem-free': None, 'maintenance': False, 'cpu-load': None, 'gateway': True} MainThread::INFO::2014-10-31 16:55:40,323::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1414745740.32 type=state_transition detail=StartState-ReinitializeFSM hostname='ovirt1' MainThread::INFO::2014-10-31 16:55:40,392::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (StartState-ReinitializeFSM) sent? ignored MainThread::INFO::2014-10-31 16:55:40,675::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state ReinitializeFSM (score: 0) MainThread::INFO::2014-10-31 16:55:50,710::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1414745750.71 type=state_transition detail=ReinitializeFSM-EngineUp hostname='ovirt1' MainThread::INFO::2014-10-31 16:55:50,710::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (ReinitializeFSM-EngineUp) sent? ignored MainThread::INFO::2014-10-31 16:55:51,001::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineUp (score: 2400) MainThread::CRITICAL::2014-10-31 16:56:01,033::agent::103::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Could not start ha-agent Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 97, in run self._run_agent() File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 154, in _run_agent
File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
hosted_engine.HostedEngine(self.shutdown_requested).start_monitoring() line 307, in start_monitoring
for old_state, state, delay in self.fsm: File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/fsm/machine.py", line 125, in next
new_data = self.refresh(self._state.data) File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/state_machine.py", line 77, in refresh
stats.update(self.hosted_engine.collect_stats()) File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 700, in collect_stats
stats = self.process_remote_metadata(host_id, remote_data) File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 747, in process_remote_metadata
md['engine-status'] = engine_status(md["engine-status"]) File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 79, in engine_status
in json.loads(status).iteritems()]) AttributeError: 'NoneType' object has no attribute 'iteritems' [root@ovirt1 ~]# hosted-engine --vm-status
--== Host 1 status ==--
Status up-to-date : False Hostname : 192.168.12.11 Host ID : 1 Engine status : unknown stale-data Score : 2400 Local maintenance : False Host timestamp : 1414745750 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=1414745750 (Fri Oct 31 16:55:50 2014) host-id=1 score=2400 maintenance=False state=EngineUp
--== Host 2 status ==--
Status up-to-date : False Hostname : 192.168.12.12 Host ID : 2 Engine status : unknown stale-data Score : 2400 Local maintenance : False Host timestamp : 1414745821 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=1414745821 (Fri Oct 31 16:57:01 2014) host-id=2 score=2400 maintenance=False state=EngineStart [root@ovirt1 ~]# service ovirt-ha-agent status ovirt-ha-agent dead but subsys locked
Host2
MainThread::INFO::2014-10-31
16:55:59,642::agent::52::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engi
ne-ha agent 1.1.6 started MainThread::INFO::2014-10-31 16:55:59,678::hosted_engine::223::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine: :(_get_hostname) Found certificate common name: 192.168.12.12 MainThread::INFO::2014-10-31 16:55:59,918::hosted_engine::367::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine: :(_initialize_broker) Initializing ha-broker connection MainThread::INFO::2014-10-31 16:55:59,919::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo nitor) Starting monitor ping, options {'addr': '192.168.12.254'} MainThread::INFO::2014-10-31 16:55:59,922::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo nitor) Success, id 25353488 MainThread::INFO::2014-10-31 16:55:59,922::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo nitor) Starting monitor mgmt-bridge, options {'use_ssl': 'true', 'bridge_name': 'ovirtmgmt', 'address': '0'} MainThread::INFO::2014-10-31 16:55:59,928::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo nitor) Success, id 25354128 MainThread::INFO::2014-10-31 16:55:59,928::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo nitor) Starting monitor mem-free, options {'use_ssl': 'true', 'address': '0'} MainThread::INFO::2014-10-31 16:55:59,931::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo nitor) Success, id 25353552 MainThread::INFO::2014-10-31 16:55:59,931::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo nitor) Starting monitor cpu-load-no-engine, options {'use_ssl': 'true', 'vm_uuid': '41d4aff1-54e1-4946-a812-2e656bb7d3f 9', 'address': '0'} MainThread::INFO::2014-10-31 16:55:59,934::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo nitor) Success, id 139976608389584 MainThread::INFO::2014-10-31 16:55:59,934::brokerlink::126::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo nitor) Starting monitor engine-health, options {'use_ssl': 'true', 'vm_uuid': '41d4aff1-54e1-4946-a812-2e656bb7d3f9', ' address': '0'} MainThread::INFO::2014-10-31 16:55:59,939::brokerlink::137::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_mo nitor) Success, id 139976608447760 MainThread::INFO::2014-10-31 16:55:59,939::hosted_engine::391::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine: :(_initialize_broker) Broker initialized, all submonitors started MainThread::INFO::2014-10-31 16:55:59,983::hosted_engine::476::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine: :(_initialize_sanlock) Ensuring lease for lockspace hosted-engine, host id 2 is acquired (file: /rhev/data-center/mnt/g
luster1:_engine/6eb220be-daff-4785-8f78-111cc24139c4/ha_agent/hosted-engine.lockspace)
MainThread::INFO::2014-10-31 16:56:00,001::state_machine::153::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine: :(refresh) Global metadata: {'maintenance': False} MainThread::INFO::2014-10-31 16:56:00,001::state_machine::158::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine: :(refresh) Host 192.168.12.11 (id 1): {'live-data': True, 'extra': 'metadata_parse_version=1\nmetadata_feature_version= 1\ntimestamp=1414745750 (Fri Oct 31 16:55:50 2014)\nhost-id=1\nscore=2400\nmaintenance=False\nstate=EngineUp\n', 'hostn ame': '192.168.12.11', 'host-id': 1, 'engine-status': {'health': 'good', 'vm': 'up', 'detail': 'up'}, 'score': 2400, 'm aintenance': False, 'host-ts': 1414745750} MainThread::INFO::2014-10-31 16:56:00,001::state_machine::161::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine: :(refresh) Local (id 2): {'engine-health': None, 'bridge': True, 'mem-free': None, 'maintenance': False, 'cpu-load': No ne, 'gateway': True} MainThread::INFO::2014-10-31 16:56:00,002::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1414745760.0 type=state_transition detail=StartState-ReinitializeFSM hostname='ovirt2' MainThread::INFO::2014-10-31 16:56:00,045::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (StartState-ReinitializeFSM) sent? ignored MainThread::INFO::2014-10-31 16:56:00,325::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine: :(start_monitoring) Current state ReinitializeFSM (score: 0) MainThread::INFO::2014-10-31 16:56:10,352::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1414745770.35 type=state_transition detail=ReinitializeFSM-EngineDown hostname='ovirt2' MainThread::INFO::2014-10-31 16:56:10,353::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (ReinitializeFSM-EngineDown) sent? ignored MainThread::INFO::2014-10-31 16:56:10,638::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400) MainThread::INFO::2014-10-31 16:56:20,663::states::441::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) The engine is not running, but we do not have enough data to decide which hosts are alive MainThread::INFO::2014-10-31 16:56:20,663::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1414745780.66 type=state_transition detail=EngineDown-EngineDown hostname='ovirt2' MainThread::INFO::2014-10-31 16:56:20,664::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineDown) sent? ignored MainThread::INFO::2014-10-31 16:56:20,943::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400) MainThread::INFO::2014-10-31 16:56:30,968::states::441::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) The engine is not running, but we do not have enough data to decide which hosts are alive MainThread::INFO::2014-10-31 16:56:30,969::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1414745790.97 type=state_transition detail=EngineDown-EngineDown hostname='ovirt2' MainThread::INFO::2014-10-31 16:56:30,969::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineDown) sent? ignored MainThread::INFO::2014-10-31 16:56:31,248::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400) MainThread::INFO::2014-10-31 16:56:41,274::states::441::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) The engine is not running, but we do not have enough data to decide which hosts are alive MainThread::INFO::2014-10-31 16:56:41,275::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1414745801.28 type=state_transition detail=EngineDown-EngineDown hostname='ovirt2' MainThread::INFO::2014-10-31 16:56:41,276::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineDown) sent? ignored MainThread::INFO::2014-10-31 16:56:41,555::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400) MainThread::INFO::2014-10-31 16:56:51,583::states::441::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) The engine is not running, but we do not have enough data to decide which hosts are alive MainThread::INFO::2014-10-31 16:56:51,584::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1414745811.58 type=state_transition detail=EngineDown-EngineDown hostname='ovirt2' MainThread::INFO::2014-10-31 16:56:51,584::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineDown) sent? ignored MainThread::INFO::2014-10-31 16:56:51,864::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400) MainThread::INFO::2014-10-31 16:57:01,897::states::454::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine down and local host has best score (2400), attempting to start engine VM MainThread::INFO::2014-10-31 16:57:01,898::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1414745821.9 type=state_transition detail=EngineDown-EngineStart hostname='ovirt2' MainThread::INFO::2014-10-31 16:57:01,906::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineStart) sent? ignored MainThread::INFO::2014-10-31 16:57:02,189::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineStart (score: 2400) MainThread::CRITICAL::2014-10-31 16:57:02,207::agent::103::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Could not start ha-agent Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 97, in run self._run_agent() File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 154, in _run_agent
File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
hosted_engine.HostedEngine(self.shutdown_requested).start_monitoring() line 307, in start_monitoring
for old_state, state, delay in self.fsm: File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/fsm/machine.py", line 125, in next
new_data = self.refresh(self._state.data) File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/state_machine.py", line 77, in refresh
stats.update(self.hosted_engine.collect_stats()) File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 662, in collect_stats
constants.SERVICE_TYPE) File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 171, in get_stats_from_storage
result = self._checked_communicate(request) File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 199, in _checked_communicate
.format(message or response)) RequestError: Request failed: <type 'exceptions.OSError'>
[root@ovirt2 ~]# hosted-engine --vm-status Traceback (most recent call last): File "/usr/lib64/python2.6/runpy.py", line 122, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib64/python2.6/runpy.py", line 34, in _run_code exec code in run_globals File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 111, in <module>
if not status_checker.print_status(): File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 58, in print_status
all_host_stats = ha_cli.get_all_host_stats() File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/client/client.py", line 137, in get_all_host_stats
return self.get_all_stats(self.StatModes.HOST) File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/client/client.py", line 86, in get_all_stats
constants.SERVICE_TYPE) File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 171, in get_stats_from_storage
result = self._checked_communicate(request) File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 199, in _checked_communicate
.format(message or response)) ovirt_hosted_engine_ha.lib.exceptions.RequestError: Request failed:
[root@ovirt2 ~]# service ovirt-ha-agent status ovirt-ha-agent dead but subsys locked
Thanks, Jaicel
----- Original Message ----- From: "Jiri Moskovcak" <jmoskovc@redhat.com> To: "Jaicel" <jaicel@asti.dost.gov.ph> Cc: "Niels de Vos" <ndevos@redhat.com>, "Vijay Bellur" <vbellur@redhat.com>, users@ovirt.org, "Gluster Devel" <gluster-devel@gluster.org> Sent: Friday, October 31, 2014 11:05:32 PM Subject: Re: [ovirt-users] Hosted-Engine HA problem
On 10/31/2014 10:26 AM, Jaicel wrote:
i've increased the limit and then restarted agent and broker. status normalize, but then right now it went to "False" state again but still both having 2400 score. agent logs remains the same, with "ovirt-ha-agent dead but subsys locked" status. ha-broker logs below
Thread-138::INFO::2014-10-31 17:24:22,981::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-138::INFO::2014-10-31 17:24:22,991::listener::184::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-139::INFO::2014-10-31 17:24:38,385::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-139::INFO::2014-10-31 17:24:38,395::listener::184::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-140::INFO::2014-10-31 17:24:53,816::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-140::INFO::2014-10-31 17:24:53,827::listener::184::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-141::INFO::2014-10-31 17:25:09,172::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-141::INFO::2014-10-31 17:25:09,182::listener::184::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed Thread-142::INFO::2014-10-31 17:25:24,551::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-142::INFO::2014-10-31 17:25:24,562::listener::184::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed
Thanks, Jaicel
ok, now it seems that broker runs fine, so I need the recent agent.log to debug it more.
--Jirka
----- Original Message ----- From: "Jiri Moskovcak" <jmoskovc@redhat.com> To: "Jaicel R. Sabonsolin" <jaicel@asti.dost.gov.ph>, "Niels de Vos"
<ndevos@redhat.com>
Cc: "Vijay Bellur" <vbellur@redhat.com>, users@ovirt.org, "Gluster Devel" <gluster-devel@gluster.org> Sent: Friday, October 31, 2014 4:32:02 PM Subject: Re: [ovirt-users] Hosted-Engine HA problem
On 10/31/2014 03:53 AM, Jaicel R. Sabonsolin wrote:
Hi guys,
these logs appear on both hosts just like the result of --vm-status. tried to tcpdump on ovirt hosts and gluster nodes but only
agent.log new_data = self.refresh(self._state.data) File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/state_machine.py",
stats.update(self.hosted_engine.collect_stats()) File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
constants.SERVICE_TYPE) File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
result = self._checked_communicate(request) File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
.format(message or response)) RequestError: Request failed: <type 'exceptions.OSError'>
broker.log File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/listener.py",
response = "success " + self._dispatch(data) File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/listener.py",
.get_all_stats_for_service_type(**options) File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py",
d = self.get_raw_stats_for_service_type(storage_dir,
service_type)
File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py",
f = os.open(path, direct_flag | os.O_RDONLY) OSError: [Errno 24] Too many open files:
'/rhev/data-center/mnt/gluster1:_engine/6eb220be-daff-4785-8f78-111cc24139c4/ha_agent/hosted-engine.metadata'
- ah, there we go ^^^^^^ you might need to tweak the limit of allowed open files as described here [1] or find the app keeps so many files open
--Jirka
[1]
http://www.cyberciti.biz/faq/linux-increase-the-maximum-number-of-open-files...
Thread-38160::INFO::2014-10-31
10:28:37,989::listener::184::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed
Thread-38161::INFO::2014-10-31 10:28:53,656::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established Thread-38161::ERROR::2014-10-31 10:28:53,657::listener::190::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Error handling request, data: 'get-stats storage_dir=/rhev/data-center/mnt/gluster1:_engine/6eb220be-daff-4785-8f78-111cc24139c4/ha_agent service_type=hosted-engine' Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/listener.py",
response = "success " + self._dispatch(data) File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/listener.py",
.get_all_stats_for_service_type(**options) File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py",
d = self.get_raw_stats_for_service_type(storage_dir,
service_type)
File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py",
f = os.open(path, direct_flag | os.O_RDONLY) OSError: [Errno 24] Too many open files:
'/rhev/data-center/mnt/gluster1:_engine/6eb220be-daff-4785-8f78-111cc24139c4/ha_agent/hosted-engine.metadata'
Thread-38161::INFO::2014-10-31 10:28:53,658::listener::184::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed
Thanks, Jaicel
----- Original Message ----- From: "Niels de Vos" <ndevos@redhat.com> To: "Vijay Bellur" <vbellur@redhat.com> Cc: "Jiri Moskovcak" <jmoskovc@redhat.com>, "Jaicel R. Sabonsolin" <jaicel@asti.dost.gov.ph>, users@ovirt.org, "Gluster Devel" <gluster-devel@gluster.org> Sent: Friday, October 31, 2014 4:11:25 AM Subject: Re: [ovirt-users] Hosted-Engine HA problem
On Thu, Oct 30, 2014 at 09:07:24PM +0530, Vijay Bellur wrote:
On 10/30/2014 06:45 PM, Jiri Moskovcak wrote:
On 10/30/2014 09:22 AM, Jaicel R. Sabonsolin wrote: > Hi Guys, > > I need help with my ovirt Hosted-Engine HA setup. I am running on 2 > ovirt hosts and 2 gluster nodes with replicated volumes. i already have > VMs running on my hosts and they can migrate normally once i for example > power off the host that they are running on. the problem is that the > engine can't migrate once i switch off the host that hosts the engine. > > oVirt 3.4.3-1.el6 > KVM 0.12.1.2 - 2.415.el6_5.10 > LIBVIRT libvirt-0.10.2-29.el6_5.9 > VDSM vdsm-4.14.17-0.el6 > > > right now, i have this result from hosted-engine --vm-status. > > File "/usr/lib64/python2.6/runpy.py", line 122, in > _run_module_as_main > "__main__", fname, loader, pkg_name) > File "/usr/lib64/python2.6/runpy.py", line 34, in _run_code > exec code in run_globals > File > > "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_setup/vm_status.py", > > line 111, in <module> > if not status_checker.print_status(): > File > > "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_setup/vm_status.py", > > line 58, in print_status > all_host_stats = ha_cli.get_all_host_stats() > File > > "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/client/client.py", > > line 137, in get_all_host_stats > return self.get_all_stats(self.StatModes.HOST) > File > > "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/client/client.py", > > line 86, in get_all_stats > constants.SERVICE_TYPE) > File > > "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", > > line 171, in get_stats_from_storage > result = self._checked_communicate(request) > File > > "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", > > line 199, in _checked_communicate > .format(message or response)) > ovirt_hosted_engine_ha.lib.exceptions.RequestError: Request failed: > <type 'exceptions.OSError'> > > > restarting ha-broker and ha-agent normalizes the status but eventually > it would become "false" and then return to the result above. hope you > guys could help me with this. >
Hi Jaicel, please attach agent.log and broker.log from the host where you
run hosted-engine --vm-status. I have a feeling that you ran into a known problem on gluster - stalled file descriptor, in that case the only known solution at this time is to restart the broker & agent as you have already found out.
Adding Niels and gluster-devel to troubleshoot from Gluster NFS
<type 'exceptions.OSError'> packets exchange with my monitoring VM(zabbix) appeared. line 77, in refresh line 662, in collect_stats line 171, in get_stats_from_storage line 199, in _checked_communicate line 165, in handle line 261, in _dispatch line 41, in get_all_stats_for_service_type line 74, in get_raw_stats_for_service_type line 165, in handle line 261, in _dispatch line 41, in get_all_stats_for_service_type line 74, in get_raw_stats_for_service_type trying to perspective.
I'd welcome any details on this "stalled file descriptor" problem. Is there a bug filed with some details like logs, sysrq-t and maybe even tcpdumps? If there is an easy way to reproduce this behaviour, I can surely look into it and hopefully come up with some advise or fix.
Thanks, Niels