
Hi! In the last few days I am having problem with Hosted-Engine, it keeps restarting. Sometimes after few minutes, sometimes after few hours.. I haven't done any changes on oVirt or network in that time. Version is 4.4.10.7-1.el8. (this was also installation version) Here are the logs: Agent.log------------------------------ MainThread::INFO::2022-08-21 09:48:36,200::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUp (score: 2440) MainThread::INFO::2022-08-21 09:48:36,200::hosted_engine::525::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Best remote host node3.ovirt.example.org (id: 3, score: 2440) MainThread::ERROR::2022-08-21 09:48:46,212::states::398::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Host node3.ovirt.example.org (id 3) score is significantly better than local score, shutting down VM on this host MainThread::INFO::2022-08-21 09:48:46,641::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineUp-EngineStop) sent? ignored MainThread::INFO::2022-08-21 09:48:46,706::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStop (score: 3400) MainThread::INFO::2022-08-21 09:48:46,706::hosted_engine::525::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Best remote host node3.ovirt.example.org (id: 3, score: 3400) MainThread::INFO::2022-08-21 09:48:56,714::hosted_engine::934::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) Shutting down vm using `/usr/sbin/hosted-engine --vm-shutdown` MainThread::INFO::2022-08-21 09:48:56,871::hosted_engine::941::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) stdout: MainThread::INFO::2022-08-21 09:48:56,871::hosted_engine::942::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) stderr: MainThread::ERROR::2022-08-21 09:48:56,871::hosted_engine::950::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) Engine VM stopped on localhost MainThread::INFO::2022-08-21 09:48:56,880::state_decorators::102::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout set to Sun Aug 21 09:53:56 2022 while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStop'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineStop'> MainThread::INFO::2022-08-21 09:48:56,959::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStop (score: 3400) MainThread::INFO::2022-08-21 09:49:06,977::states::537::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine vm not running on local host MainThread::INFO::2022-08-21 09:49:06,983::state_decorators::95::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStop'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineDown'> MainThread::INFO::2022-08-21 09:49:07,173::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStop-EngineDown) sent? ignored MainThread::INFO::2022-08-21 09:49:07,795::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3400) MainThread::INFO::2022-08-21 09:49:16,811::states::472::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine down and local host has best score (3400), attempting to start engine VM MainThread::INFO::2022-08-21 09:49:16,998::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineStart) sent? ignored MainThread::INFO::2022-08-21 09:49:17,179::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStart (score: 3400) MainThread::INFO::2022-08-21 09:49:17,195::hosted_engine::895::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state) Ensuring VDSM state is clear for engine VM MainThread::INFO::2022-08-21 09:49:17,200::hosted_engine::915::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state) Cleaning state for non-running VM MainThread::INFO::2022-08-21 09:49:18,211::hosted_engine::907::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state) Vdsm state for VM clean MainThread::INFO::2022-08-21 09:49:18,212::hosted_engine::853::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) Starting vm using `/usr/sbin/hosted-engine --vm-start` MainThread::INFO::2022-08-21 09:49:18,814::hosted_engine::862::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) stdout: VM in WaitForLaunch MainThread::INFO::2022-08-21 09:49:18,814::hosted_engine::863::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) stderr: Command VM.getStats with args {'vmID': 'cc7931ff-8124-4724-9242-abea2ab5bf42'} failed: (code=1, message=Virtual machine does not exist: {'vmId': 'cc7931ff-8124-4724-9242-abea2ab5bf42'}) MainThread::INFO::2022-08-21 09:49:18,814::hosted_engine::875::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) Engine VM started on localhost MainThread::INFO::2022-08-21 09:49:18,999::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStart-EngineStarting) sent? ignored MainThread::INFO::2022-08-21 09:49:19,008::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400) MainThread::INFO::2022-08-21 09:49:29,027::states::741::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) VM is powering up.. MainThread::INFO::2022-08-21 09:49:29,033::state_decorators::102::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout set to Sun Aug 21 09:59:29 2022 while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> MainThread::INFO::2022-08-21 09:49:29,109::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400) MainThread::INFO::2022-08-21 09:49:38,121::states::741::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) VM is powering up.. MainThread::INFO::2022-08-21 09:49:38,195::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400) MainThread::INFO::2022-08-21 09:49:48,218::state_decorators::95::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineUp'> MainThread::INFO::2022-08-21 09:49:48,403::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStarting-EngineUp) sent? ignored MainThread::INFO::2022-08-21 09:49:48,713::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUp (score: 3400) MainThread::INFO::2022-08-21 09:49:58,725::states::406::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine vm running on localhost Broker.log------------------------------ Thread-4::INFO::2022-08-21 09:47:59,342::cpu_load_no_engine::142::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load) System load total=0.0241, engine=0.0013, non-engine=0.0228 Thread-3::INFO::2022-08-21 09:48:01,311::mem_free::51::mem_free.MemFree::(action) memFree: 96106 Thread-5::INFO::2022-08-21 09:48:05,612::engine_health::246::engine_health.EngineHealth::(_result_from_stats) VM is up on this host with healthy engine Thread-2::INFO::2022-08-21 09:48:08,591::mgmt_bridge::65::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt in up state Thread-1::WARNING::2022-08-21 09:48:10,352::network::121::network.Network::(_dns) DNS query failed: ; <<>> DiG 9.11.36-RedHat-9.11.36-3.el8 <<>> +tries=1 +time=5 +tcp ;; global options: +cmd ;; connection timed out; no servers could be reached Thread-1::WARNING::2022-08-21 09:48:10,352::network::92::network.Network::(action) Failed to verify network status, (4 out of 5) Thread-3::INFO::2022-08-21 09:48:11,389::mem_free::51::mem_free.MemFree::(action) memFree: 96089 Thread-5::INFO::2022-08-21 09:48:15,707::engine_health::246::engine_health.EngineHealth::(_result_from_stats) VM is up on this host with healthy engine Thread-2::INFO::2022-08-21 09:48:18,662::mgmt_bridge::65::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt in up state Thread-1::WARNING::2022-08-21 09:48:18,879::network::121::network.Network::(_dns) DNS query failed: ; <<>> DiG 9.11.36-RedHat-9.11.36-3.el8 <<>> +tries=1 +time=5 +tcp ;; global options: +cmd ;; connection timed out; no servers could be reached Thread-3::INFO::2022-08-21 09:48:21,467::mem_free::51::mem_free.MemFree::(action) memFree: 96072 Thread-1::WARNING::2022-08-21 09:48:24,904::network::121::network.Network::(_dns) DNS query failed: ; <<>> DiG 9.11.36-RedHat-9.11.36-3.el8 <<>> +tries=1 +time=5 +tcp ;; global options: +cmd ;; connection timed out; no servers could be reached Thread-5::INFO::2022-08-21 09:48:25,808::engine_health::246::engine_health.EngineHealth::(_result_from_stats) VM is up on this host with healthy engine Thread-2::INFO::2022-08-21 09:48:28,740::mgmt_bridge::65::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt in up state Thread-1::WARNING::2022-08-21 09:48:30,416::network::121::network.Network::(_dns) DNS query failed: ; <<>> DiG 9.11.36-RedHat-9.11.36-3.el8 <<>> +tries=1 +time=5 +tcp ;; global options: +cmd ;; connection timed out; no servers could be reached Thread-1::WARNING::2022-08-21 09:48:30,416::network::92::network.Network::(action) Failed to verify network status, (2 out of 5) Thread-3::INFO::2022-08-21 09:48:31,545::mem_free::51::mem_free.MemFree::(action) memFree: 96064 Thread-5::INFO::2022-08-21 09:48:35,909::engine_health::246::engine_health.EngineHealth::(_result_from_stats) VM is up on this host with healthy engine Thread-1::WARNING::2022-08-21 09:48:35,940::network::121::network.Network::(_dns) DNS query failed: ; <<>> DiG 9.11.36-RedHat-9.11.36-3.el8 <<>> +tries=1 +time=5 +tcp ;; global options: +cmd ;; connection timed out; no servers could be reached Thread-1::WARNING::2022-08-21 09:48:37,480::network::92::network.Network::(action) Failed to verify network status, (4 out of 5) Thread-2::INFO::2022-08-21 09:48:38,809::mgmt_bridge::65::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt in up state Thread-3::INFO::2022-08-21 09:48:41,623::mem_free::51::mem_free.MemFree::(action) memFree: 96014 Thread-1::INFO::2022-08-21 09:48:42,549::network::88::network.Network::(action) Successfully verified network status Thread-5::INFO::2022-08-21 09:48:46,011::engine_health::246::engine_health.EngineHealth::(_result_from_stats) VM is up on this host with healthy engine Listener::ERROR::2022-08-21 09:48:46,639::notifications::42::ovirt_hosted_engine_ha.broker.notifications.Notifications::(send_email) (530, b'5.7.1 Authentication required', 'alerts@example.org.hr') At first I thought that it is related to this bugs. https://lists.ovirt.org/archives/list/users@ovirt.org/thread/2HTD5WR43M5MUTE... https://bugzilla.redhat.com/show_bug.cgi?id=1984356 But in this oVirt version that bug should already be solved. I was trying to monitor network, but this error keeps happening even if network load is low. I did try to do continuous dig and ping commands on VM-s running on same host as Hosted Engine, and did not have any network problems, not even one connection drop. Any solutions or next steps i should try?

On Mon, Aug 22, 2022 at 9:52 AM <markeczzz@gmail.com> wrote:
Hi! In the last few days I am having problem with Hosted-Engine, it keeps restarting. Sometimes after few minutes, sometimes after few hours.. I haven't done any changes on oVirt or network in that time. Version is 4.4.10.7-1.el8. (this was also installation version)
Here are the logs: Agent.log------------------------------ MainThread::INFO::2022-08-21 09:48:36,200::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUp (score: 2440)
2440 is pretty low - did you check what lowers it, other than the single failed network test below?
MainThread::INFO::2022-08-21 09:48:36,200::hosted_engine::525::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Best remote host node3.ovirt.example.org (id: 3, score: 2440) MainThread::ERROR::2022-08-21 09:48:46,212::states::398::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Host node3.ovirt.example.org (id 3) score is significantly better than local score, shutting down VM on this host MainThread::INFO::2022-08-21 09:48:46,641::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineUp-EngineStop) sent? ignored MainThread::INFO::2022-08-21 09:48:46,706::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStop (score: 3400) MainThread::INFO::2022-08-21 09:48:46,706::hosted_engine::525::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Best remote host node3.ovirt.example.org (id: 3, score: 3400) MainThread::INFO::2022-08-21 09:48:56,714::hosted_engine::934::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) Shutting down vm using `/usr/sbin/hosted-engine --vm-shutdown` MainThread::INFO::2022-08-21 09:48:56,871::hosted_engine::941::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) stdout: MainThread::INFO::2022-08-21 09:48:56,871::hosted_engine::942::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) stderr: MainThread::ERROR::2022-08-21 09:48:56,871::hosted_engine::950::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) Engine VM stopped on localhost MainThread::INFO::2022-08-21 09:48:56,880::state_decorators::102::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout set to Sun Aug 21 09:53:56 2022 while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStop'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineStop'> MainThread::INFO::2022-08-21 09:48:56,959::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStop (score: 3400) MainThread::INFO::2022-08-21 09:49:06,977::states::537::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine vm not running on local host MainThread::INFO::2022-08-21 09:49:06,983::state_decorators::95::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStop'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineDown'> MainThread::INFO::2022-08-21 09:49:07,173::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStop-EngineDown) sent? ignored MainThread::INFO::2022-08-21 09:49:07,795::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3400) MainThread::INFO::2022-08-21 09:49:16,811::states::472::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine down and local host has best score (3400), attempting to start engine VM MainThread::INFO::2022-08-21 09:49:16,998::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineStart) sent? ignored MainThread::INFO::2022-08-21 09:49:17,179::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStart (score: 3400) MainThread::INFO::2022-08-21 09:49:17,195::hosted_engine::895::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state) Ensuring VDSM state is clear for engine VM MainThread::INFO::2022-08-21 09:49:17,200::hosted_engine::915::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state) Cleaning state for non-running VM MainThread::INFO::2022-08-21 09:49:18,211::hosted_engine::907::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state) Vdsm state for VM clean MainThread::INFO::2022-08-21 09:49:18,212::hosted_engine::853::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) Starting vm using `/usr/sbin/hosted-engine --vm-start` MainThread::INFO::2022-08-21 09:49:18,814::hosted_engine::862::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) stdout: VM in WaitForLaunch
MainThread::INFO::2022-08-21 09:49:18,814::hosted_engine::863::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) stderr: Command VM.getStats with args {'vmID': 'cc7931ff-8124-4724-9242-abea2ab5bf42'} failed: (code=1, message=Virtual machine does not exist: {'vmId': 'cc7931ff-8124-4724-9242-abea2ab5bf42'})
MainThread::INFO::2022-08-21 09:49:18,814::hosted_engine::875::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm) Engine VM started on localhost MainThread::INFO::2022-08-21 09:49:18,999::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStart-EngineStarting) sent? ignored MainThread::INFO::2022-08-21 09:49:19,008::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400) MainThread::INFO::2022-08-21 09:49:29,027::states::741::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) VM is powering up.. MainThread::INFO::2022-08-21 09:49:29,033::state_decorators::102::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout set to Sun Aug 21 09:59:29 2022 while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> MainThread::INFO::2022-08-21 09:49:29,109::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400) MainThread::INFO::2022-08-21 09:49:38,121::states::741::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) VM is powering up.. MainThread::INFO::2022-08-21 09:49:38,195::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400) MainThread::INFO::2022-08-21 09:49:48,218::state_decorators::95::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineUp'> MainThread::INFO::2022-08-21 09:49:48,403::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStarting-EngineUp) sent? ignored MainThread::INFO::2022-08-21 09:49:48,713::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUp (score: 3400) MainThread::INFO::2022-08-21 09:49:58,725::states::406::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine vm running on localhost
Broker.log------------------------------ Thread-4::INFO::2022-08-21 09:47:59,342::cpu_load_no_engine::142::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load) System load total=0.0241, engine=0.0013, non-engine=0.0228 Thread-3::INFO::2022-08-21 09:48:01,311::mem_free::51::mem_free.MemFree::(action) memFree: 96106 Thread-5::INFO::2022-08-21 09:48:05,612::engine_health::246::engine_health.EngineHealth::(_result_from_stats) VM is up on this host with healthy engine Thread-2::INFO::2022-08-21 09:48:08,591::mgmt_bridge::65::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt in up state Thread-1::WARNING::2022-08-21 09:48:10,352::network::121::network.Network::(_dns) DNS query failed: ; <<>> DiG 9.11.36-RedHat-9.11.36-3.el8 <<>> +tries=1 +time=5 +tcp ;; global options: +cmd ;; connection timed out; no servers could be reached
Thread-1::WARNING::2022-08-21 09:48:10,352::network::92::network.Network::(action) Failed to verify network status, (4 out of 5) Thread-3::INFO::2022-08-21 09:48:11,389::mem_free::51::mem_free.MemFree::(action) memFree: 96089 Thread-5::INFO::2022-08-21 09:48:15,707::engine_health::246::engine_health.EngineHealth::(_result_from_stats) VM is up on this host with healthy engine Thread-2::INFO::2022-08-21 09:48:18,662::mgmt_bridge::65::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt in up state Thread-1::WARNING::2022-08-21 09:48:18,879::network::121::network.Network::(_dns) DNS query failed: ; <<>> DiG 9.11.36-RedHat-9.11.36-3.el8 <<>> +tries=1 +time=5 +tcp ;; global options: +cmd ;; connection timed out; no servers could be reached
Thread-3::INFO::2022-08-21 09:48:21,467::mem_free::51::mem_free.MemFree::(action) memFree: 96072 Thread-1::WARNING::2022-08-21 09:48:24,904::network::121::network.Network::(_dns) DNS query failed: ; <<>> DiG 9.11.36-RedHat-9.11.36-3.el8 <<>> +tries=1 +time=5 +tcp ;; global options: +cmd ;; connection timed out; no servers could be reached
Thread-5::INFO::2022-08-21 09:48:25,808::engine_health::246::engine_health.EngineHealth::(_result_from_stats) VM is up on this host with healthy engine Thread-2::INFO::2022-08-21 09:48:28,740::mgmt_bridge::65::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt in up state Thread-1::WARNING::2022-08-21 09:48:30,416::network::121::network.Network::(_dns) DNS query failed: ; <<>> DiG 9.11.36-RedHat-9.11.36-3.el8 <<>> +tries=1 +time=5 +tcp ;; global options: +cmd ;; connection timed out; no servers could be reached
Thread-1::WARNING::2022-08-21 09:48:30,416::network::92::network.Network::(action) Failed to verify network status, (2 out of 5) Thread-3::INFO::2022-08-21 09:48:31,545::mem_free::51::mem_free.MemFree::(action) memFree: 96064 Thread-5::INFO::2022-08-21 09:48:35,909::engine_health::246::engine_health.EngineHealth::(_result_from_stats) VM is up on this host with healthy engine Thread-1::WARNING::2022-08-21 09:48:35,940::network::121::network.Network::(_dns) DNS query failed: ; <<>> DiG 9.11.36-RedHat-9.11.36-3.el8 <<>> +tries=1 +time=5 +tcp ;; global options: +cmd ;; connection timed out; no servers could be reached
Thread-1::WARNING::2022-08-21 09:48:37,480::network::92::network.Network::(action) Failed to verify network status, (4 out of 5) Thread-2::INFO::2022-08-21 09:48:38,809::mgmt_bridge::65::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt in up state Thread-3::INFO::2022-08-21 09:48:41,623::mem_free::51::mem_free.MemFree::(action) memFree: 96014 Thread-1::INFO::2022-08-21 09:48:42,549::network::88::network.Network::(action) Successfully verified network status Thread-5::INFO::2022-08-21 09:48:46,011::engine_health::246::engine_health.EngineHealth::(_result_from_stats) VM is up on this host with healthy engine Listener::ERROR::2022-08-21 09:48:46,639::notifications::42::ovirt_hosted_engine_ha.broker.notifications.Notifications::(send_email) (530, b'5.7.1 Authentication required', 'alerts@example.org.hr')
At first I thought that it is related to this bugs. https://lists.ovirt.org/archives/list/users@ovirt.org/thread/2HTD5WR43M5MUTE... https://bugzilla.redhat.com/show_bug.cgi?id=1984356
But in this oVirt version that bug should already be solved.
Indeed - as you can see in your log snippet above, dig was called with '+tcp'.
I was trying to monitor network, but this error keeps happening even if network load is low.
It might be something else.
I did try to do continuous dig and ping commands on VM-s running on same host as Hosted Engine, and did not have any network problems, not even one connection drop.
What exactly do you test? Before fixing the above bug, we added to ovirt-system-tests loops of 'dig', and did see drops - not many, but enough, apparently, and often. The 'dig' test is not very configurable, from -ha's POV - but you do have control over it from elsewhere - resolv.conf, your name server, etc. Also, note that it runs 'dig' without passing a query, and the default query is for '.' - the root - perhaps your name server has some problem with this?
Any solutions or next steps i should try?
You can configure the agent/broker to log at DEBUG level, to see some more details. You can also change the network monitoring method, and/or configure options for methods that do have them - e.g. 'tcp' and 'tcp_t_address', 'tcp_t_port'. See e.g.: https://www.ovirt.org/documentation/administration_guide/index.html#Administ... https://www.ovirt.org/develop/release-management/features/sla/hosted-engine-... Best regards, -- Didi

> 2440 is pretty low - did you check what lowers it, other than the > single failed network test below? As far as I can see in agent.log, only thing that lowers that score is "network status" There are a lot of lines like this: Penalizing score by 319 due to network status Penalizing score by 640 due to network status Penalizing score by 1280 due to network status Penalizing score by 960 due to network status > What exactly do you test? On another vm on same host where which contains hosted engine i tried this: 1. Continuous ping command to 8.8.8.8 (0 lost packages) 2. dig command every second -> dig +tries=1 +time=5 +tcp (no errors, query time between 2 and 15ms) > Before fixing the above bug, we added to ovirt-system-tests loops of > 'dig', and did see drops - not many, but enough, apparently, and > often. > > The 'dig' test is not very configurable, from -ha's POV - but you do > have control over it from elsewhere - resolv.conf, your name server, > etc. Also, note that it runs 'dig' without passing a query, and the > default query is for '.' - the root - perhaps your name server has > some problem with this? Given that the dig command from the other vm goes through without problems, I think the nameserver should be ok. > You can configure the agent/broker to log at DEBUG level, to see some > more details. > > You can also change the network monitoring method, and/or configure > options for methods that do have them - e.g. 'tcp' and > 'tcp_t_address', 'tcp_t_port'. See e.g.: > > https://www.ovirt.org/documentation/administration_guide/index.html#Admin... > https://www.ovirt.org/develop/release-management/features/sla/hosted-engi... I will try this, thank you.

On Mon, Aug 22, 2022 at 11:58 AM <markeczzz@gmail.com> wrote: > > > 2440 is pretty low - did you check what lowers it, other than the > > single failed network test below? > > As far as I can see in agent.log, only thing that lowers that score is "network status" > There are a lot of lines like this: > Penalizing score by 319 due to network status > Penalizing score by 640 due to network status > Penalizing score by 1280 due to network status > Penalizing score by 960 due to network status > > > What exactly do you test? > On another vm on same host where which contains hosted engine i tried this: > 1. Continuous ping command to 8.8.8.8 (0 lost packages) > 2. dig command every second -> dig +tries=1 +time=5 +tcp (no errors, query time between 2 and 15ms) > > > Before fixing the above bug, we added to ovirt-system-tests loops of > > 'dig', and did see drops - not many, but enough, apparently, and > > often. > > > > The 'dig' test is not very configurable, from -ha's POV - but you do > > have control over it from elsewhere - resolv.conf, your name server, > > etc. Also, note that it runs 'dig' without passing a query, and the > > default query is for '.' - the root - perhaps your name server has > > some problem with this? > Given that the dig command from the other vm goes through without problems, I think the nameserver should be ok. Did you run such a dig loop also on the host? Does it use the same name server? > > > You can configure the agent/broker to log at DEBUG level, to see some > > more details. > > > > You can also change the network monitoring method, and/or configure > > options for methods that do have them - e.g. 'tcp' and > > 'tcp_t_address', 'tcp_t_port'. See e.g.: > > > > https://www.ovirt.org/documentation/administration_guide/index.html#Admin... > > https://www.ovirt.org/develop/release-management/features/sla/hosted-engi... > I will try this, thank you. Good luck and best regards, -- Didi

Found the problem.. I did tcpdump and found that some DNS requests didn't get responeses. One of upstream DNS servers didn't have TCP enabled for DNS requests so requests that were sent to that server were unanswered, thats why I had Engine restarts at random times. Since I don't control upstream DNS servers (my ovirt DNS server is Gateway that just resends reqests to upstream) . I am thinking that maybe ping would be better in my case for liveness check..

On Tue, Aug 23, 2022 at 10:41 AM <markeczzz@gmail.com> wrote:
Found the problem.. I did tcpdump and found that some DNS requests didn't get responeses.
One of upstream DNS servers didn't have TCP enabled for DNS requests so requests that were sent to that server were unanswered, thats why I had Engine restarts at random times.
Thanks for the update!
Since I don't control upstream DNS servers (my ovirt DNS server is Gateway that just resends reqests to upstream) .
A tcp-level gateway? It might be not that hard to replace it with a real caching DNS server. Up to you, of course.
I am thinking that maybe ping would be better in my case for liveness check..
The reason we changed from ping (the only test we had some time ago) to dns, and from dns udp to tcp, is that many routers/switches try harder to handle tcp, under load - they assume that udp is cheaper to lose. So I'd personally consider the tcp test, unless you have other means to make sure ping is reliable, or you deliberately want the engine to move to some other host if ping fails often enough. Best regards, -- Didi
participants (2)
-
markeczzz@gmail.com
-
Yedidyah Bar David