
On Mon, Aug 22, 2022 at 9:52 AM <markeczzz@gmail.com> wrote:
Hi! In the last few days I am having problem with Hosted-Engine, it keeps restarting. Sometimes after few minutes, sometimes after few hours.. I haven't done any changes on oVirt or network in that time. Version is 4.4.10.7-1.el8. (this was also installation version)
Here are the logs: Agent.log------------------------------ MainThread::INFO::2022-08-21 09:48:36,200::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUp (score: 2440)
2440 is pretty low - did you check what lowers it, other than the single failed network test below?
MainThread::INFO::2022-08-21 09:48:36,200::hosted_engine::525::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Best remote host node3.ovirt.example.org (id: 3, score: 2440) MainThread::ERROR::2022-08-21 09:48:46,212::states::398::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Host node3.ovirt.example.org (id 3) score is significantly better than local score, shutting down VM on this host MainThread::INFO::2022-08-21 09:48:46,641::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineUp-EngineStop) sent? ignored MainThread::INFO::2022-08-21 09:48:46,706::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStop (score: 3400) MainThread::INFO::2022-08-21 09:48:46,706::hosted_engine::525::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Best remote host node3.ovirt.example.org (id: 3, score: 3400) MainThread::INFO::2022-08-21 09:48:56,714::hosted_engine::934::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) Shutting down vm using `/usr/sbin/hosted-engine --vm-shutdown` MainThread::INFO::2022-08-21 09:48:56,871::hosted_engine::941::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) stdout: MainThread::INFO::2022-08-21 09:48:56,871::hosted_engine::942::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) stderr: MainThread::ERROR::2022-08-21 09:48:56,871::hosted_engine::950::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) Engine VM stopped on localhost MainThread::INFO::2022-08-21 09:48:56,880::state_decorators::102::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout set to Sun Aug 21 09:53:56 2022 while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStop'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineStop'> MainThread::INFO::2022-08-21 09:48:56,959::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStop (score: 3400) MainThread::INFO::2022-08-21 09:49:06,977::states::537::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine vm not running on local host MainThread::INFO::2022-08-21 09:49:06,983::state_decorators::95::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStop'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineDown'> MainThread::INFO::2022-08-21 09:49:07,173::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStop-EngineDown) sent? ignored MainThread::INFO::2022-08-21 09:49:07,795::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3400) MainThread::INFO::2022-08-21 09:49:16,811::states::472::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine down and local host has best score (3400), attempting to start engine VM MainThread::INFO::2022-08-21 09:49:16,998::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineStart) sent? ignored MainThread::INFO::2022-08-21 09:49:17,179::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStart (score: 3400) MainThread::INFO::2022-08-21 09:49:17,195::hosted_engine::895::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state) Ensuring VDSM state is clear for engine VM MainThread::INFO::2022-08-21 09:49:17,200::hosted_engine::915::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state) Cleaning state for non-running VM MainThread::INFO::2022-08-21 09:49:18,211::hosted_engine::907::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state) Vdsm state for VM clean MainThread::INFO::2022-08-21 09:49:18,212::hosted_engine::853::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) Starting vm using `/usr/sbin/hosted-engine --vm-start` MainThread::INFO::2022-08-21 09:49:18,814::hosted_engine::862::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) stdout: VM in WaitForLaunch
MainThread::INFO::2022-08-21 09:49:18,814::hosted_engine::863::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) stderr: Command VM.getStats with args {'vmID': 'cc7931ff-8124-4724-9242-abea2ab5bf42'} failed: (code=1, message=Virtual machine does not exist: {'vmId': 'cc7931ff-8124-4724-9242-abea2ab5bf42'})
MainThread::INFO::2022-08-21 09:49:18,814::hosted_engine::875::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) Engine VM started on localhost MainThread::INFO::2022-08-21 09:49:18,999::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStart-EngineStarting) sent? ignored MainThread::INFO::2022-08-21 09:49:19,008::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400) MainThread::INFO::2022-08-21 09:49:29,027::states::741::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) VM is powering up.. MainThread::INFO::2022-08-21 09:49:29,033::state_decorators::102::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout set to Sun Aug 21 09:59:29 2022 while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> MainThread::INFO::2022-08-21 09:49:29,109::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400) MainThread::INFO::2022-08-21 09:49:38,121::states::741::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) VM is powering up.. MainThread::INFO::2022-08-21 09:49:38,195::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400) MainThread::INFO::2022-08-21 09:49:48,218::state_decorators::95::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineUp'> MainThread::INFO::2022-08-21 09:49:48,403::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStarting-EngineUp) sent? ignored MainThread::INFO::2022-08-21 09:49:48,713::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUp (score: 3400) MainThread::INFO::2022-08-21 09:49:58,725::states::406::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine vm running on localhost
Broker.log------------------------------ Thread-4::INFO::2022-08-21 09:47:59,342::cpu_load_no_engine::142::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load) System load total=0.0241, engine=0.0013, non-engine=0.0228 Thread-3::INFO::2022-08-21 09:48:01,311::mem_free::51::mem_free.MemFree::(action) memFree: 96106 Thread-5::INFO::2022-08-21 09:48:05,612::engine_health::246::engine_health.EngineHealth::(_result_from_stats) VM is up on this host with healthy engine Thread-2::INFO::2022-08-21 09:48:08,591::mgmt_bridge::65::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt in up state Thread-1::WARNING::2022-08-21 09:48:10,352::network::121::network.Network::(_dns) DNS query failed: ; <<>> DiG 9.11.36-RedHat-9.11.36-3.el8 <<>> +tries=1 +time=5 +tcp ;; global options: +cmd ;; connection timed out; no servers could be reached
Thread-1::WARNING::2022-08-21 09:48:10,352::network::92::network.Network::(action) Failed to verify network status, (4 out of 5) Thread-3::INFO::2022-08-21 09:48:11,389::mem_free::51::mem_free.MemFree::(action) memFree: 96089 Thread-5::INFO::2022-08-21 09:48:15,707::engine_health::246::engine_health.EngineHealth::(_result_from_stats) VM is up on this host with healthy engine Thread-2::INFO::2022-08-21 09:48:18,662::mgmt_bridge::65::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt in up state Thread-1::WARNING::2022-08-21 09:48:18,879::network::121::network.Network::(_dns) DNS query failed: ; <<>> DiG 9.11.36-RedHat-9.11.36-3.el8 <<>> +tries=1 +time=5 +tcp ;; global options: +cmd ;; connection timed out; no servers could be reached
Thread-3::INFO::2022-08-21 09:48:21,467::mem_free::51::mem_free.MemFree::(action) memFree: 96072 Thread-1::WARNING::2022-08-21 09:48:24,904::network::121::network.Network::(_dns) DNS query failed: ; <<>> DiG 9.11.36-RedHat-9.11.36-3.el8 <<>> +tries=1 +time=5 +tcp ;; global options: +cmd ;; connection timed out; no servers could be reached
Thread-5::INFO::2022-08-21 09:48:25,808::engine_health::246::engine_health.EngineHealth::(_result_from_stats) VM is up on this host with healthy engine Thread-2::INFO::2022-08-21 09:48:28,740::mgmt_bridge::65::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt in up state Thread-1::WARNING::2022-08-21 09:48:30,416::network::121::network.Network::(_dns) DNS query failed: ; <<>> DiG 9.11.36-RedHat-9.11.36-3.el8 <<>> +tries=1 +time=5 +tcp ;; global options: +cmd ;; connection timed out; no servers could be reached
Thread-1::WARNING::2022-08-21 09:48:30,416::network::92::network.Network::(action) Failed to verify network status, (2 out of 5) Thread-3::INFO::2022-08-21 09:48:31,545::mem_free::51::mem_free.MemFree::(action) memFree: 96064 Thread-5::INFO::2022-08-21 09:48:35,909::engine_health::246::engine_health.EngineHealth::(_result_from_stats) VM is up on this host with healthy engine Thread-1::WARNING::2022-08-21 09:48:35,940::network::121::network.Network::(_dns) DNS query failed: ; <<>> DiG 9.11.36-RedHat-9.11.36-3.el8 <<>> +tries=1 +time=5 +tcp ;; global options: +cmd ;; connection timed out; no servers could be reached
Thread-1::WARNING::2022-08-21 09:48:37,480::network::92::network.Network::(action) Failed to verify network status, (4 out of 5) Thread-2::INFO::2022-08-21 09:48:38,809::mgmt_bridge::65::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt in up state Thread-3::INFO::2022-08-21 09:48:41,623::mem_free::51::mem_free.MemFree::(action) memFree: 96014 Thread-1::INFO::2022-08-21 09:48:42,549::network::88::network.Network::(action) Successfully verified network status Thread-5::INFO::2022-08-21 09:48:46,011::engine_health::246::engine_health.EngineHealth::(_result_from_stats) VM is up on this host with healthy engine Listener::ERROR::2022-08-21 09:48:46,639::notifications::42::ovirt_hosted_engine_ha.broker.notifications.Notifications::(send_email) (530, b'5.7.1 Authentication required', 'alerts@example.org.hr')
At first I thought that it is related to this bugs. https://lists.ovirt.org/archives/list/users@ovirt.org/thread/2HTD5WR43M5MUTE... https://bugzilla.redhat.com/show_bug.cgi?id=1984356
But in this oVirt version that bug should already be solved.
Indeed - as you can see in your log snippet above, dig was called with '+tcp'.
I was trying to monitor network, but this error keeps happening even if network load is low.
It might be something else.
I did try to do continuous dig and ping commands on VM-s running on same host as Hosted Engine, and did not have any network problems, not even one connection drop.
What exactly do you test? Before fixing the above bug, we added to ovirt-system-tests loops of 'dig', and did see drops - not many, but enough, apparently, and often. The 'dig' test is not very configurable, from -ha's POV - but you do have control over it from elsewhere - resolv.conf, your name server, etc. Also, note that it runs 'dig' without passing a query, and the default query is for '.' - the root - perhaps your name server has some problem with this?
Any solutions or next steps i should try?
You can configure the agent/broker to log at DEBUG level, to see some more details. You can also change the network monitoring method, and/or configure options for methods that do have them - e.g. 'tcp' and 'tcp_t_address', 'tcp_t_port'. See e.g.: https://www.ovirt.org/documentation/administration_guide/index.html#Administ... https://www.ovirt.org/develop/release-management/features/sla/hosted-engine-... Best regards, -- Didi