[ovirt-users] Hosted engine reboots every hour

Carles Costa Carles.Costa at mcon.net
Wed Jul 1 03:52:01 UTC 2015


Dear Experts,

I am experiencing a problem with ovirt, every hour the hosted engine will shutdown and reboot. The engine-status value will move to "unknown stale-data" every second minute, and then the hosted engine will be again operative 14 minutes after that. As far as I can see the scores remain in 2400 at all times, and seems I have a liveliness check failing, but I am not able to find why.

Why I have this problem every hour exactly?
Why the liveliness check fails?

I would appreciate if someone can bring some light, I am new to ovirt but I really like it so far.

During the period the machine is down I can see this messages on the /var/log/ovirt-hosted-engine-ha/broker.log :

Thread-170803::INFO::2015-07-01 11:06:47,335::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established
Thread-170803::INFO::2015-07-01 11:06:47,342::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed
Thread-170804::INFO::2015-07-01 11:06:47,343::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established
Thread-170804::INFO::2015-07-01 11:06:47,344::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed
Thread-170805::INFO::2015-07-01 11:06:47,344::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established
Thread-170805::INFO::2015-07-01 11:06:47,346::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed
Thread-170806::INFO::2015-07-01 11:06:47,346::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established
Thread-170806::INFO::2015-07-01 11:06:47,348::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed
Thread-170807::INFO::2015-07-01 11:06:47,348::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established
Thread-170807::INFO::2015-07-01 11:06:47,350::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed
Thread-7::INFO::2015-07-01 11:06:50,394::cpu_load_no_engine::121::cpu_load_no_engine.EngineHealth::(calculate_load) System load total=0.0095, engine=0.0046, non-engine=0.0049
Thread-8::WARNING::2015-07-01 11:06:50,464::engine_health::116::engine_health.CpuLoadNoEngine::(action) bad health status: Hosted Engine is not up!

and here the /var/log/ovirt-hosted-engine-ha/agent.log :

MainThread::INFO::2015-07-01 11:01:04,216::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineUp (score: 2400)
MainThread::INFO::2015-07-01 11:01:04,217::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host mc-place-compute-02-live.mc.mcon.net (id: 2, score: 240
0)
MainThread::INFO::2015-07-01 11:01:14,682::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineUp (score: 2400)
MainThread::INFO::2015-07-01 11:01:14,682::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host mc-place-compute-02-live.mc.mcon.net (id: 2, score: 240
0)
MainThread::INFO::2015-07-01 11:01:24,724::states::393::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine vm running on localhost
MainThread::INFO::2015-07-01 11:01:25,174::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineUp (score: 2400)
MainThread::INFO::2015-07-01 11:01:25,174::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host mc-place-compute-02-live.mc.mcon.net (id: 2, score: 240
0)
MainThread::INFO::2015-07-01 11:01:35,234::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1435719695.23 type=state_transition detail=EngineUp-EngineUpBadHealth ho
stname='mc-place-compute-01-live.mc.mcon.net'
MainThread::INFO::2015-07-01 11:03:42,536::brokerlink::120::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineUp-EngineUpBadHealth) sent? ignored
MainThread::INFO::2015-07-01 11:03:43,018::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineUpBadHealth (score: 2400)
MainThread::INFO::2015-07-01 11:03:43,018::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host mc-place-compute-02-live.mc.mcon.net (id: 2, score: 240
0)
MainThread::INFO::2015-07-01 11:03:53,060::state_machine::160::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Global metadata: {'maintenance': False}
MainThread::INFO::2015-07-01 11:03:53,060::state_machine::165::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host mc-place-compute-02-live.mc.mcon.net (id 2): {'extra': 'metadata_parse_versi
on=1\nmetadata_feature_version=1\ntimestamp=226246 (Wed Jul  1 11:02:11 2015)\nhost-id=2\nscore=2400\nmaintenance=False\nstate=EngineDown\n', 'hostname': 'mc-place-compute-02-live.mc.mcon.net', 'alive': True, 'h
ost-id': 2, 'engine-status': {'reason': 'vm not running on this host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'}, 'score': 2400, 'maintenance': False, 'host-ts': 226246}
MainThread::INFO::2015-07-01 11:03:53,060::state_machine::165::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host mc-place-compute-03-live.mc.mcon.net (id 3): {'extra': 'metadata_parse_versi
on=1\nmetadata_feature_version=1\ntimestamp=226256 (Wed Jul  1 11:02:15 2015)\nhost-id=3\nscore=2400\nmaintenance=False\nstate=EngineDown\n', 'hostname': 'mc-place-compute-03-live.mc.mcon.net', 'alive': True, 'h
ost-id': 3, 'engine-status': {'reason': 'vm not running on this host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'}, 'score': 2400, 'maintenance': False, 'host-ts': 226256}
MainThread::INFO::2015-07-01 11:03:53,061::state_machine::165::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host mc-place-compute-04-live.mc.mcon.net (id 4): {'extra': 'metadata_parse_versi
on=1\nmetadata_feature_version=1\ntimestamp=226300 (Wed Jul  1 11:02:11 2015)\nhost-id=4\nscore=2400\nmaintenance=False\nstate=EngineDown\n', 'hostname': 'mc-place-compute-04-live.mc.mcon.net', 'alive': True, 'h
ost-id': 4, 'engine-status': {'reason': 'vm not running on this host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'}, 'score': 2400, 'maintenance': False, 'host-ts': 226300}
MainThread::INFO::2015-07-01 11:03:53,061::state_machine::168::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Local (id 1): {'engine-health': {'reason': 'failed liveliness check', 'health': '
bad', 'vm': 'up', 'detail': 'up'}, 'bridge': True, 'mem-free': 136637.0, 'maintenance': False, 'cpu-load': 0.0035, 'gateway': True}
MainThread::ERROR::2015-07-01 11:03:53,061::states::562::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine VM has bad health status, timeout in 300 seconds
MainThread::INFO::2015-07-01 11:03:53,081::state_decorators::95::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout set to Wed Jul  1 11:08:53 2015 while transitioning <class 'ovirt_hosted_
engine_ha.agent.states.EngineUpBadHealth'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineUpBadHealth'>
MainThread::INFO::2015-07-01 11:03:53,530::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineUpBadHealth (score: 2400)
MainThread::INFO::2015-07-01 11:03:53,530::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host mc-place-compute-02-live.mc.mcon.net (id: 2, score: 240
0)
MainThread::ERROR::2015-07-01 11:04:03,559::states::562::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine VM has bad health status, timeout in 289 seconds
MainThread::INFO::2015-07-01 11:04:03,980::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineUpBadHealth (score: 2400)
MainThread::INFO::2015-07-01 11:04:03,980::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host mc-place-compute-02-live.mc.mcon.net (id: 2, score: 240
0)
MainThread::ERROR::2015-07-01 11:04:14,007::states::562::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine VM has bad health status, timeout in 279 seconds
MainThread::INFO::2015-07-01 11:04:14,478::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineUpBadHealth (score: 2400)
MainThread::INFO::2015-07-01 11:04:14,478::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host mc-place-compute-02-live.mc.mcon.net (id: 2, score: 240
0)
MainThread::ERROR::2015-07-01 11:04:24,505::states::562::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine VM has bad health status, timeout in 268 seconds
MainThread::INFO::2015-07-01 11:04:24,994::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineUpBadHealth (score: 2400)
MainThread::INFO::2015-07-01 11:04:24,994::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host mc-place-compute-02-live.mc.mcon.net (id: 2, score: 240


Best Regards

Carles Cortes Costa


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20150701/e2d43f1d/attachment-0001.html>


More information about the Users mailing list