[ovirt-users] Hosted engine reboots every hour

Roy Golan rgolan at redhat.com
Wed Jul 1 08:16:37 UTC 2015


On 07/01/2015 06:52 AM, Carles Costa wrote:
> Dear Experts,
>
> I am experiencing a problem with ovirt, every hour the hosted engine 
> will shutdown and reboot. The engine-status value will move to 
> "unknown stale-data" every second minute, and then the hosted engine 
> will be again operative 14 minutes after that. As far as I can see the 
> scores remain in 2400 at all times, and seems I have a liveliness 
> check failing, but I am not able to find why.
>
> Why I have this problem every hour exactly?
> Why the liveliness check fails?
>
> I would appreciate if someone can bring some light, I am new to ovirt 
> but I really like it so far.
>

Hi Carles, and welcome.

The agent will try to a servlet running on the engine VM in 
http://{ENGINE_IP}/OvirtEngineWeb/HealthStatus

Also debug log will help if we can't resolve this -  see the conf the 
change it
/etc/ovirt-hosted-engine-ha/broker-log.conf
/etc/ovirt-hosted-engine-ha/agent-log.conf


> During the period the machine is down I can see this messages on the 
> /var/log/ovirt-hosted-engine-ha/broker.log :
>
> Thread-170803::INFO::2015-07-01 
> 11:06:47,335::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) 
> Connection established
> Thread-170803::INFO::2015-07-01 
> 11:06:47,342::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) 
> Connection closed
> Thread-170804::INFO::2015-07-01 
> 11:06:47,343::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) 
> Connection established
> Thread-170804::INFO::2015-07-01 
> 11:06:47,344::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) 
> Connection closed
> Thread-170805::INFO::2015-07-01 
> 11:06:47,344::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) 
> Connection established
> Thread-170805::INFO::2015-07-01 
> 11:06:47,346::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) 
> Connection closed
> Thread-170806::INFO::2015-07-01 
> 11:06:47,346::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) 
> Connection established
> Thread-170806::INFO::2015-07-01 
> 11:06:47,348::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) 
> Connection closed
> Thread-170807::INFO::2015-07-01 
> 11:06:47,348::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) 
> Connection established
> Thread-170807::INFO::2015-07-01 
> 11:06:47,350::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) 
> Connection closed
> Thread-7::INFO::2015-07-01 
> 11:06:50,394::cpu_load_no_engine::121::cpu_load_no_engine.EngineHealth::(calculate_load) 
> System load total=0.0095, engine=0.0046, non-engine=0.0049
> Thread-8::WARNING::2015-07-01 
> 11:06:50,464::engine_health::116::engine_health.CpuLoadNoEngine::(action) 
> bad health status: Hosted Engine is not up!
>
> and here the /var/log/ovirt-hosted-engine-ha/agent.log :
>
> MainThread::INFO::2015-07-01 
> 11:01:04,216::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) 
> Current state EngineUp (score: 2400)
> MainThread::INFO::2015-07-01 
> 11:01:04,217::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) 
> Best remote host mc-place-compute-02-live.mc.mcon.net (id: 2, score: 240
> 0)
> MainThread::INFO::2015-07-01 
> 11:01:14,682::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) 
> Current state EngineUp (score: 2400)
> MainThread::INFO::2015-07-01 
> 11:01:14,682::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) 
> Best remote host mc-place-compute-02-live.mc.mcon.net (id: 2, score: 240
> 0)
> MainThread::INFO::2015-07-01 
> 11:01:24,724::states::393::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) 
> Engine vm running on localhost
> MainThread::INFO::2015-07-01 
> 11:01:25,174::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) 
> Current state EngineUp (score: 2400)
> MainThread::INFO::2015-07-01 
> 11:01:25,174::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) 
> Best remote host mc-place-compute-02-live.mc.mcon.net (id: 2, score: 240
> 0)
> MainThread::INFO::2015-07-01 
> 11:01:35,234::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) 
> Trying: notify time=1435719695.23 type=state_transition 
> detail=EngineUp-EngineUpBadHealth ho
> stname='mc-place-compute-01-live.mc.mcon.net'
> MainThread::INFO::2015-07-01 
> 11:03:42,536::brokerlink::120::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) 
> Success, was notification of state_transition 
> (EngineUp-EngineUpBadHealth) sent? ignored
> MainThread::INFO::2015-07-01 
> 11:03:43,018::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) 
> Current state EngineUpBadHealth (score: 2400)
> MainThread::INFO::2015-07-01 
> 11:03:43,018::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) 
> Best remote host mc-place-compute-02-live.mc.mcon.net (id: 2, score: 240
> 0)
> MainThread::INFO::2015-07-01 
> 11:03:53,060::state_machine::160::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) 
> Global metadata: {'maintenance': False}
> MainThread::INFO::2015-07-01 
> 11:03:53,060::state_machine::165::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) 
> Host mc-place-compute-02-live.mc.mcon.net (id 2): {'extra': 
> 'metadata_parse_versi
> on=1\nmetadata_feature_version=1\ntimestamp=226246 (Wed Jul  1 
> 11:02:11 
> 2015)\nhost-id=2\nscore=2400\nmaintenance=False\nstate=EngineDown\n', 
> 'hostname': 'mc-place-compute-02-live.mc.mcon.net', 'alive': True, 'h
> ost-id': 2, 'engine-status': {'reason': 'vm not running on this host', 
> 'health': 'bad', 'vm': 'down', 'detail': 'unknown'}, 'score': 2400, 
> 'maintenance': False, 'host-ts': 226246}
> MainThread::INFO::2015-07-01 
> 11:03:53,060::state_machine::165::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) 
> Host mc-place-compute-03-live.mc.mcon.net (id 3): {'extra': 
> 'metadata_parse_versi
> on=1\nmetadata_feature_version=1\ntimestamp=226256 (Wed Jul  1 
> 11:02:15 
> 2015)\nhost-id=3\nscore=2400\nmaintenance=False\nstate=EngineDown\n', 
> 'hostname': 'mc-place-compute-03-live.mc.mcon.net', 'alive': True, 'h
> ost-id': 3, 'engine-status': {'reason': 'vm not running on this host', 
> 'health': 'bad', 'vm': 'down', 'detail': 'unknown'}, 'score': 2400, 
> 'maintenance': False, 'host-ts': 226256}
> MainThread::INFO::2015-07-01 
> 11:03:53,061::state_machine::165::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) 
> Host mc-place-compute-04-live.mc.mcon.net (id 4): {'extra': 
> 'metadata_parse_versi
> on=1\nmetadata_feature_version=1\ntimestamp=226300 (Wed Jul  1 
> 11:02:11 
> 2015)\nhost-id=4\nscore=2400\nmaintenance=False\nstate=EngineDown\n', 
> 'hostname': 'mc-place-compute-04-live.mc.mcon.net', 'alive': True, 'h
> ost-id': 4, 'engine-status': {'reason': 'vm not running on this host', 
> 'health': 'bad', 'vm': 'down', 'detail': 'unknown'}, 'score': 2400, 
> 'maintenance': False, 'host-ts': 226300}
> MainThread::INFO::2015-07-01 
> 11:03:53,061::state_machine::168::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) 
> Local (id 1): {'engine-health': {'reason': 'failed liveliness check', 
> 'health': '
> bad', 'vm': 'up', 'detail': 'up'}, 'bridge': True, 'mem-free': 
> 136637.0, 'maintenance': False, 'cpu-load': 0.0035, 'gateway': True}
> MainThread::ERROR::2015-07-01 
> 11:03:53,061::states::562::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) 
> Engine VM has bad health status, timeout in 300 seconds
> MainThread::INFO::2015-07-01 
> 11:03:53,081::state_decorators::95::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) 
> Timeout set to Wed Jul  1 11:08:53 2015 while transitioning <class 
> 'ovirt_hosted_
> engine_ha.agent.states.EngineUpBadHealth'> -> <class 
> 'ovirt_hosted_engine_ha.agent.states.EngineUpBadHealth'>
> MainThread::INFO::2015-07-01 
> 11:03:53,530::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) 
> Current state EngineUpBadHealth (score: 2400)
> MainThread::INFO::2015-07-01 
> 11:03:53,530::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) 
> Best remote host mc-place-compute-02-live.mc.mcon.net (id: 2, score: 240
> 0)
> MainThread::ERROR::2015-07-01 
> 11:04:03,559::states::562::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) 
> Engine VM has bad health status, timeout in 289 seconds
> MainThread::INFO::2015-07-01 
> 11:04:03,980::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) 
> Current state EngineUpBadHealth (score: 2400)
> MainThread::INFO::2015-07-01 
> 11:04:03,980::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) 
> Best remote host mc-place-compute-02-live.mc.mcon.net (id: 2, score: 240
> 0)
> MainThread::ERROR::2015-07-01 
> 11:04:14,007::states::562::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) 
> Engine VM has bad health status, timeout in 279 seconds
> MainThread::INFO::2015-07-01 
> 11:04:14,478::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) 
> Current state EngineUpBadHealth (score: 2400)
> MainThread::INFO::2015-07-01 
> 11:04:14,478::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) 
> Best remote host mc-place-compute-02-live.mc.mcon.net (id: 2, score: 240
> 0)
> MainThread::ERROR::2015-07-01 
> 11:04:24,505::states::562::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) 
> Engine VM has bad health status, timeout in 268 seconds
> MainThread::INFO::2015-07-01 
> 11:04:24,994::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) 
> Current state EngineUpBadHealth (score: 2400)
> MainThread::INFO::2015-07-01 
> 11:04:24,994::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) 
> Best remote host mc-place-compute-02-live.mc.mcon.net (id: 2, score: 240
>
>
> Best Regards
>
> Carles Cortes Costa
>
>
>
>
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20150701/88e006fa/attachment-0001.html>


More information about the Users mailing list