On Mon, Aug 22, 2022 at 9:52 AM <markeczzz(a)gmail.com> wrote:
Hi!
In the last few days I am having problem with Hosted-Engine, it keeps restarting.
Sometimes after few minutes, sometimes after few hours..
I haven't done any changes on oVirt or network in that time.
Version is 4.4.10.7-1.el8. (this was also installation version)
Here are the logs:
Agent.log------------------------------
MainThread::INFO::2022-08-21
09:48:36,200::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Current state EngineUp (score: 2440)
2440 is pretty low - did you check what lowers it, other than the
single failed network test below?
MainThread::INFO::2022-08-21
09:48:36,200::hosted_engine::525::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Best remote host
node3.ovirt.example.org (id: 3, score: 2440)
MainThread::ERROR::2022-08-21
09:48:46,212::states::398::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
Host
node3.ovirt.example.org (id 3) score is significantly better than local score,
shutting down VM on this host
MainThread::INFO::2022-08-21
09:48:46,641::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Success, was notification of state_transition (EngineUp-EngineStop) sent? ignored
MainThread::INFO::2022-08-21
09:48:46,706::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Current state EngineStop (score: 3400)
MainThread::INFO::2022-08-21
09:48:46,706::hosted_engine::525::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Best remote host
node3.ovirt.example.org (id: 3, score: 3400)
MainThread::INFO::2022-08-21
09:48:56,714::hosted_engine::934::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm)
Shutting down vm using `/usr/sbin/hosted-engine --vm-shutdown`
MainThread::INFO::2022-08-21
09:48:56,871::hosted_engine::941::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm)
stdout:
MainThread::INFO::2022-08-21
09:48:56,871::hosted_engine::942::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm)
stderr:
MainThread::ERROR::2022-08-21
09:48:56,871::hosted_engine::950::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm)
Engine VM stopped on localhost
MainThread::INFO::2022-08-21
09:48:56,880::state_decorators::102::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
Timeout set to Sun Aug 21 09:53:56 2022 while transitioning <class
'ovirt_hosted_engine_ha.agent.states.EngineStop'> -> <class
'ovirt_hosted_engine_ha.agent.states.EngineStop'>
MainThread::INFO::2022-08-21
09:48:56,959::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Current state EngineStop (score: 3400)
MainThread::INFO::2022-08-21
09:49:06,977::states::537::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
Engine vm not running on local host
MainThread::INFO::2022-08-21
09:49:06,983::state_decorators::95::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
Timeout cleared while transitioning <class
'ovirt_hosted_engine_ha.agent.states.EngineStop'> -> <class
'ovirt_hosted_engine_ha.agent.states.EngineDown'>
MainThread::INFO::2022-08-21
09:49:07,173::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Success, was notification of state_transition (EngineStop-EngineDown) sent? ignored
MainThread::INFO::2022-08-21
09:49:07,795::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Current state EngineDown (score: 3400)
MainThread::INFO::2022-08-21
09:49:16,811::states::472::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
Engine down and local host has best score (3400), attempting to start engine VM
MainThread::INFO::2022-08-21
09:49:16,998::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Success, was notification of state_transition (EngineDown-EngineStart) sent? ignored
MainThread::INFO::2022-08-21
09:49:17,179::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Current state EngineStart (score: 3400)
MainThread::INFO::2022-08-21
09:49:17,195::hosted_engine::895::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state)
Ensuring VDSM state is clear for engine VM
MainThread::INFO::2022-08-21
09:49:17,200::hosted_engine::915::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state)
Cleaning state for non-running VM
MainThread::INFO::2022-08-21
09:49:18,211::hosted_engine::907::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state)
Vdsm state for VM clean
MainThread::INFO::2022-08-21
09:49:18,212::hosted_engine::853::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm)
Starting vm using `/usr/sbin/hosted-engine --vm-start`
MainThread::INFO::2022-08-21
09:49:18,814::hosted_engine::862::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm)
stdout: VM in WaitForLaunch
MainThread::INFO::2022-08-21
09:49:18,814::hosted_engine::863::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm)
stderr: Command VM.getStats with args {'vmID':
'cc7931ff-8124-4724-9242-abea2ab5bf42'} failed:
(code=1, message=Virtual machine does not exist: {'vmId':
'cc7931ff-8124-4724-9242-abea2ab5bf42'})
MainThread::INFO::2022-08-21
09:49:18,814::hosted_engine::875::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm)
Engine VM started on localhost
MainThread::INFO::2022-08-21
09:49:18,999::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Success, was notification of state_transition (EngineStart-EngineStarting) sent? ignored
MainThread::INFO::2022-08-21
09:49:19,008::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Current state EngineStarting (score: 3400)
MainThread::INFO::2022-08-21
09:49:29,027::states::741::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
VM is powering up..
MainThread::INFO::2022-08-21
09:49:29,033::state_decorators::102::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
Timeout set to Sun Aug 21 09:59:29 2022 while transitioning <class
'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class
'ovirt_hosted_engine_ha.agent.states.EngineStarting'>
MainThread::INFO::2022-08-21
09:49:29,109::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Current state EngineStarting (score: 3400)
MainThread::INFO::2022-08-21
09:49:38,121::states::741::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
VM is powering up..
MainThread::INFO::2022-08-21
09:49:38,195::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Current state EngineStarting (score: 3400)
MainThread::INFO::2022-08-21
09:49:48,218::state_decorators::95::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
Timeout cleared while transitioning <class
'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class
'ovirt_hosted_engine_ha.agent.states.EngineUp'>
MainThread::INFO::2022-08-21
09:49:48,403::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Success, was notification of state_transition (EngineStarting-EngineUp) sent? ignored
MainThread::INFO::2022-08-21
09:49:48,713::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Current state EngineUp (score: 3400)
MainThread::INFO::2022-08-21
09:49:58,725::states::406::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
Engine vm running on localhost
Broker.log------------------------------
Thread-4::INFO::2022-08-21
09:47:59,342::cpu_load_no_engine::142::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load)
System load total=0.0241, engine=0.0013, non-engine=0.0228
Thread-3::INFO::2022-08-21 09:48:01,311::mem_free::51::mem_free.MemFree::(action)
memFree: 96106
Thread-5::INFO::2022-08-21
09:48:05,612::engine_health::246::engine_health.EngineHealth::(_result_from_stats) VM is
up on this host with healthy engine
Thread-2::INFO::2022-08-21
09:48:08,591::mgmt_bridge::65::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt in
up state
Thread-1::WARNING::2022-08-21 09:48:10,352::network::121::network.Network::(_dns) DNS
query failed:
; <<>> DiG 9.11.36-RedHat-9.11.36-3.el8 <<>> +tries=1 +time=5
+tcp
;; global options: +cmd
;; connection timed out; no servers could be reached
Thread-1::WARNING::2022-08-21 09:48:10,352::network::92::network.Network::(action) Failed
to verify network status, (4 out of 5)
Thread-3::INFO::2022-08-21 09:48:11,389::mem_free::51::mem_free.MemFree::(action)
memFree: 96089
Thread-5::INFO::2022-08-21
09:48:15,707::engine_health::246::engine_health.EngineHealth::(_result_from_stats) VM is
up on this host with healthy engine
Thread-2::INFO::2022-08-21
09:48:18,662::mgmt_bridge::65::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt in
up state
Thread-1::WARNING::2022-08-21 09:48:18,879::network::121::network.Network::(_dns) DNS
query failed:
; <<>> DiG 9.11.36-RedHat-9.11.36-3.el8 <<>> +tries=1 +time=5
+tcp
;; global options: +cmd
;; connection timed out; no servers could be reached
Thread-3::INFO::2022-08-21 09:48:21,467::mem_free::51::mem_free.MemFree::(action)
memFree: 96072
Thread-1::WARNING::2022-08-21 09:48:24,904::network::121::network.Network::(_dns) DNS
query failed:
; <<>> DiG 9.11.36-RedHat-9.11.36-3.el8 <<>> +tries=1 +time=5
+tcp
;; global options: +cmd
;; connection timed out; no servers could be reached
Thread-5::INFO::2022-08-21
09:48:25,808::engine_health::246::engine_health.EngineHealth::(_result_from_stats) VM is
up on this host with healthy engine
Thread-2::INFO::2022-08-21
09:48:28,740::mgmt_bridge::65::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt in
up state
Thread-1::WARNING::2022-08-21 09:48:30,416::network::121::network.Network::(_dns) DNS
query failed:
; <<>> DiG 9.11.36-RedHat-9.11.36-3.el8 <<>> +tries=1 +time=5
+tcp
;; global options: +cmd
;; connection timed out; no servers could be reached
Thread-1::WARNING::2022-08-21 09:48:30,416::network::92::network.Network::(action) Failed
to verify network status, (2 out of 5)
Thread-3::INFO::2022-08-21 09:48:31,545::mem_free::51::mem_free.MemFree::(action)
memFree: 96064
Thread-5::INFO::2022-08-21
09:48:35,909::engine_health::246::engine_health.EngineHealth::(_result_from_stats) VM is
up on this host with healthy engine
Thread-1::WARNING::2022-08-21 09:48:35,940::network::121::network.Network::(_dns) DNS
query failed:
; <<>> DiG 9.11.36-RedHat-9.11.36-3.el8 <<>> +tries=1 +time=5
+tcp
;; global options: +cmd
;; connection timed out; no servers could be reached
Thread-1::WARNING::2022-08-21 09:48:37,480::network::92::network.Network::(action) Failed
to verify network status, (4 out of 5)
Thread-2::INFO::2022-08-21
09:48:38,809::mgmt_bridge::65::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt in
up state
Thread-3::INFO::2022-08-21 09:48:41,623::mem_free::51::mem_free.MemFree::(action)
memFree: 96014
Thread-1::INFO::2022-08-21 09:48:42,549::network::88::network.Network::(action)
Successfully verified network status
Thread-5::INFO::2022-08-21
09:48:46,011::engine_health::246::engine_health.EngineHealth::(_result_from_stats) VM is
up on this host with healthy engine
Listener::ERROR::2022-08-21
09:48:46,639::notifications::42::ovirt_hosted_engine_ha.broker.notifications.Notifications::(send_email)
(530, b'5.7.1 Authentication required', 'alerts(a)example.org.hr')
At first I thought that it is related to this bugs.
https://lists.ovirt.org/archives/list/users@ovirt.org/thread/2HTD5WR43M5M...
https://bugzilla.redhat.com/show_bug.cgi?id=1984356
But in this oVirt version that bug should already be solved.
Indeed - as you can see in your log snippet above, dig was called with '+tcp'.
I was trying to monitor network, but this error keeps happening even
if network load is low.
It might be something else.
I did try to do continuous dig and ping commands on VM-s running on
same host as Hosted Engine, and did not have any network problems, not even one connection
drop.
What exactly do you test?
Before fixing the above bug, we added to ovirt-system-tests loops of
'dig', and did see drops - not many, but enough, apparently, and
often.
The 'dig' test is not very configurable, from -ha's POV - but you do
have control over it from elsewhere - resolv.conf, your name server,
etc. Also, note that it runs 'dig' without passing a query, and the
default query is for '.' - the root - perhaps your name server has
some problem with this?
Any solutions or next steps i should try?
You can configure the agent/broker to log at DEBUG level, to see some
more details.
You can also change the network monitoring method, and/or configure
options for methods that do have them - e.g. 'tcp' and
'tcp_t_address', 'tcp_t_port'. See e.g.:
https://www.ovirt.org/documentation/administration_guide/index.html#Admin...
https://www.ovirt.org/develop/release-management/features/sla/hosted-engi...
Best regards,
--
Didi