OST HE: Engine VM went down due to cpu load (was: [oVirt Jenkins] ovirt-system-tests_he-basic-suite-master - Build # 2126 - Failure!)

On Tue, Aug 3, 2021 at 7:50 AM <jenkins@jenkins.phx.ovirt.org> wrote:
Project: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/ Build: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2126/ Build Number: 2126 Build Status: Failure Triggered By: Started by timer
------------------------------------- Changes Since Last Success: ------------------------------------- Changes for Build #2126 [Michal Skrivanek] basic: skipping just the VNC console part of test_virtual_machines
----------------- Failed Tests: ----------------- 2 tests failed. FAILED: he-basic-suite-master.test-scenarios.test_012_local_maintenance_sdk.test_local_maintenance
Error Message: ovirtsdk4.Error: Failed to read response: [(<pycurl.Curl object at 0x5555faf11228>, 7, 'Failed to connect to 192.168.200.99 port 443: Connection refused')]
This looks very similar to the issue we have with dns/dig failures that cause the engine VM to go down, and it's similar, but different. dig didn't fail (it now uses TCP), but something else caused the agent to stop the engine VM - a combination of high cpu load and low free memory, after restarting the engine VM as part of test_008. https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2126/... : ===================================================================================== MainThread::INFO::2021-08-03 06:46:55,068::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state ReinitializeFSM (score: 0) MainThread::INFO::2021-08-03 06:47:04,089::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:47:04,169::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (ReinitializeFSM-GlobalMaintenance) sent? ignored MainThread::INFO::2021-08-03 06:47:05,249::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 3400) MainThread::INFO::2021-08-03 06:47:14,439::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:47:25,526::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 814 due to cpu load MainThread::INFO::2021-08-03 06:47:25,527::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2586) MainThread::INFO::2021-08-03 06:47:25,537::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:47:26,029::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2586) MainThread::INFO::2021-08-03 06:47:35,050::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:47:35,576::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2586) MainThread::INFO::2021-08-03 06:47:45,597::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:47:46,521::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2586) MainThread::INFO::2021-08-03 06:47:55,577::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:47:56,559::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2586) MainThread::INFO::2021-08-03 06:47:56,559::hosted_engine::525::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Best remote host ost-he-basic-suite-master-host-1 (id: 2, score: 3400) MainThread::INFO::2021-08-03 06:48:05,633::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:48:06,188::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 820 due to cpu load MainThread::INFO::2021-08-03 06:48:06,188::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2580) MainThread::INFO::2021-08-03 06:48:16,256::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:48:16,950::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 831 due to cpu load MainThread::INFO::2021-08-03 06:48:16,951::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2569) MainThread::INFO::2021-08-03 06:48:26,053::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:48:26,999::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 839 due to cpu load MainThread::INFO::2021-08-03 06:48:26,999::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2561) MainThread::INFO::2021-08-03 06:48:36,026::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:48:36,802::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 844 due to cpu load MainThread::INFO::2021-08-03 06:48:36,802::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2556) MainThread::INFO::2021-08-03 06:48:45,827::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:48:46,401::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 849 due to cpu load MainThread::INFO::2021-08-03 06:48:46,401::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2551) MainThread::INFO::2021-08-03 06:48:56,588::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (GlobalMaintenance-ReinitializeFSM) sent? ignored MainThread::INFO::2021-08-03 06:48:58,685::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state ReinitializeFSM (score: 0) MainThread::INFO::2021-08-03 06:49:05,729::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (ReinitializeFSM-EngineStarting) sent? ignored MainThread::INFO::2021-08-03 06:49:06,146::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 830 due to cpu load MainThread::INFO::2021-08-03 06:49:06,146::states::72::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_penalize_memory) Penalizing score by 400 due to free memory 1782 being lower than required 3171 MainThread::INFO::2021-08-03 06:49:06,146::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 2170) MainThread::INFO::2021-08-03 06:49:06,150::state_decorators::95::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineUp'> MainThread::INFO::2021-08-03 06:49:06,172::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStarting-EngineUp) sent? ignored MainThread::INFO::2021-08-03 06:49:06,178::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUp (score: 2570) MainThread::ERROR::2021-08-03 06:49:16,197::states::398::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Host ost-he-basic-suite-master-host-1 (id 2) score is significantly better than local score, shutting down VM on this host ===================================================================================== I looked a bit at /var/log/messages of the host, and while there is quite some noise there, can't tell specifically what might have caused the high cpu load. Also looked at logs of previous runs, and there is indeed a similar pattern, where the cpuload on the host causes the agent to penalize the score, but according to the agent log the load goes down faster, up to the point of being quite low when the engine is up, and only then we exit global maintenance, allowing the agent to take actions. I now ran it again, but think that otherwise I'll ignore this for now, unless we see more similar failures. If we do, we might want to check/monitor/log the cpu load on the hosts, and/or change test_008 to wait, after the engine is up, until the cpu load on the host goes down a bit. Before the port to pytest we waited a hard-coded 5 minutes and I changed that then to only wait until the engine VM is not migrating, and this worked more-or-less ok. We might need to refine that, although I'd rather not introduce another arbitrary long delays but wait for some condition. I also noticed that this ran was started at "03:04 AM" (UTC), by timer, whereas all previous timer-based runs started at 01:55 AM, somewhat earlier - perhaps there are some other things that run at that time that cause load. Best regards, -- Didi

On Tue, Aug 3, 2021 at 10:05 AM Yedidyah Bar David <didi@redhat.com> wrote:
On Tue, Aug 3, 2021 at 7:50 AM <jenkins@jenkins.phx.ovirt.org> wrote:
Project: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/ Build: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2126/ Build Number: 2126 Build Status: Failure Triggered By: Started by timer
------------------------------------- Changes Since Last Success: ------------------------------------- Changes for Build #2126 [Michal Skrivanek] basic: skipping just the VNC console part of test_virtual_machines
----------------- Failed Tests: ----------------- 2 tests failed. FAILED: he-basic-suite-master.test-scenarios.test_012_local_maintenance_sdk.test_local_maintenance
Error Message: ovirtsdk4.Error: Failed to read response: [(<pycurl.Curl object at 0x5555faf11228>, 7, 'Failed to connect to 192.168.200.99 port 443: Connection refused')]
This looks very similar to the issue we have with dns/dig failures that cause the engine VM to go down, and it's similar, but different.
dig didn't fail (it now uses TCP), but something else caused the agent to stop the engine VM - a combination of high cpu load and low free memory, after restarting the engine VM as part of test_008.
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2126/... :
===================================================================================== MainThread::INFO::2021-08-03 06:46:55,068::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state ReinitializeFSM (score: 0) MainThread::INFO::2021-08-03 06:47:04,089::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:47:04,169::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (ReinitializeFSM-GlobalMaintenance) sent? ignored MainThread::INFO::2021-08-03 06:47:05,249::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 3400) MainThread::INFO::2021-08-03 06:47:14,439::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:47:25,526::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 814 due to cpu load MainThread::INFO::2021-08-03 06:47:25,527::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2586) MainThread::INFO::2021-08-03 06:47:25,537::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:47:26,029::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2586) MainThread::INFO::2021-08-03 06:47:35,050::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:47:35,576::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2586) MainThread::INFO::2021-08-03 06:47:45,597::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:47:46,521::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2586) MainThread::INFO::2021-08-03 06:47:55,577::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:47:56,559::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2586) MainThread::INFO::2021-08-03 06:47:56,559::hosted_engine::525::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Best remote host ost-he-basic-suite-master-host-1 (id: 2, score: 3400) MainThread::INFO::2021-08-03 06:48:05,633::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:48:06,188::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 820 due to cpu load MainThread::INFO::2021-08-03 06:48:06,188::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2580) MainThread::INFO::2021-08-03 06:48:16,256::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:48:16,950::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 831 due to cpu load MainThread::INFO::2021-08-03 06:48:16,951::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2569) MainThread::INFO::2021-08-03 06:48:26,053::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:48:26,999::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 839 due to cpu load MainThread::INFO::2021-08-03 06:48:26,999::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2561) MainThread::INFO::2021-08-03 06:48:36,026::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:48:36,802::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 844 due to cpu load MainThread::INFO::2021-08-03 06:48:36,802::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2556) MainThread::INFO::2021-08-03 06:48:45,827::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:48:46,401::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 849 due to cpu load MainThread::INFO::2021-08-03 06:48:46,401::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2551) MainThread::INFO::2021-08-03 06:48:56,588::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (GlobalMaintenance-ReinitializeFSM) sent? ignored MainThread::INFO::2021-08-03 06:48:58,685::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state ReinitializeFSM (score: 0) MainThread::INFO::2021-08-03 06:49:05,729::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (ReinitializeFSM-EngineStarting) sent? ignored MainThread::INFO::2021-08-03 06:49:06,146::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 830 due to cpu load MainThread::INFO::2021-08-03 06:49:06,146::states::72::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_penalize_memory) Penalizing score by 400 due to free memory 1782 being lower than required 3171 MainThread::INFO::2021-08-03 06:49:06,146::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 2170) MainThread::INFO::2021-08-03 06:49:06,150::state_decorators::95::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineUp'> MainThread::INFO::2021-08-03 06:49:06,172::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStarting-EngineUp) sent? ignored MainThread::INFO::2021-08-03 06:49:06,178::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUp (score: 2570) MainThread::ERROR::2021-08-03 06:49:16,197::states::398::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Host ost-he-basic-suite-master-host-1 (id 2) score is significantly better than local score, shutting down VM on this host =====================================================================================
I looked a bit at /var/log/messages of the host, and while there is quite some noise there, can't tell specifically what might have caused the high cpu load.
Also looked at logs of previous runs, and there is indeed a similar pattern, where the cpuload on the host causes the agent to penalize the score, but according to the agent log the load goes down faster, up to the point of being quite low when the engine is up, and only then we exit global maintenance, allowing the agent to take actions.
I now ran it again, but think that otherwise I'll ignore this for now, unless we see more similar failures. If we do, we might want to check/monitor/log the cpu load on the hosts, and/or change test_008 to wait, after the engine is up, until the cpu load on the host goes down a bit. Before the port to pytest we waited a hard-coded 5 minutes and I changed that then to only wait until the engine VM is not migrating, and this worked more-or-less ok. We might need to refine that, although I'd rather not introduce another arbitrary long delays but wait for some condition.
I also noticed that this ran was started at "03:04 AM" (UTC), by timer, whereas all previous timer-based runs started at 01:55 AM, somewhat earlier - perhaps there are some other things that run at that time that cause load.
Checked some more and found another difference: The total cpu use was not very different between the failed and the last successful runs, but load caused by the engine vm was different between them - in the failed run, engine kept being '0.0' for a longer time, compared with the successful run. Since the agent tries to ignore the load by the engine vm, the final result is significantly different. The load-by -engine-vm is calculated based on the result of a VDSM getStats call on the engine VM. Indeed, in vdsm.log, cpuUser and cpuSys are both starting with '0.00' for both runs, but the failed one is '0.00' for a longer time - 45 seconds between VM.create and first-non-zero cpuUser on the successful run, vs 93 seconds for the failed one. Does this make sense (from VDSM's POV)? If so, perhaps there is a way to get from VDSM the answer to this question: "Is this 0.00 CPU usage a result of still-initial, non-accurate, non-complete monitoring? Or is it really not using the CPU at all?" and use it in the agent somehow. -- Didi

On 3. 8. 2021, at 11:43, Yedidyah Bar David <didi@redhat.com> wrote:
On Tue, Aug 3, 2021 at 10:05 AM Yedidyah Bar David <didi@redhat.com <mailto:didi@redhat.com>> wrote:
On Tue, Aug 3, 2021 at 7:50 AM <jenkins@jenkins.phx.ovirt.org> wrote:
Project: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/ Build: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2126/ Build Number: 2126 Build Status: Failure Triggered By: Started by timer
------------------------------------- Changes Since Last Success: ------------------------------------- Changes for Build #2126 [Michal Skrivanek] basic: skipping just the VNC console part of test_virtual_machines
----------------- Failed Tests: ----------------- 2 tests failed. FAILED: he-basic-suite-master.test-scenarios.test_012_local_maintenance_sdk.test_local_maintenance
Error Message: ovirtsdk4.Error: Failed to read response: [(<pycurl.Curl object at 0x5555faf11228>, 7, 'Failed to connect to 192.168.200.99 port 443: Connection refused')]
This looks very similar to the issue we have with dns/dig failures that cause the engine VM to go down, and it's similar, but different.
dig didn't fail (it now uses TCP), but something else caused the agent to stop the engine VM - a combination of high cpu load and low free memory, after restarting the engine VM as part of test_008.
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2126/... :
===================================================================================== MainThread::INFO::2021-08-03 06:46:55,068::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state ReinitializeFSM (score: 0) MainThread::INFO::2021-08-03 06:47:04,089::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:47:04,169::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (ReinitializeFSM-GlobalMaintenance) sent? ignored MainThread::INFO::2021-08-03 06:47:05,249::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 3400) MainThread::INFO::2021-08-03 06:47:14,439::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:47:25,526::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 814 due to cpu load MainThread::INFO::2021-08-03 06:47:25,527::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2586) MainThread::INFO::2021-08-03 06:47:25,537::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:47:26,029::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2586) MainThread::INFO::2021-08-03 06:47:35,050::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:47:35,576::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2586) MainThread::INFO::2021-08-03 06:47:45,597::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:47:46,521::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2586) MainThread::INFO::2021-08-03 06:47:55,577::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:47:56,559::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2586) MainThread::INFO::2021-08-03 06:47:56,559::hosted_engine::525::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Best remote host ost-he-basic-suite-master-host-1 (id: 2, score: 3400) MainThread::INFO::2021-08-03 06:48:05,633::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:48:06,188::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 820 due to cpu load MainThread::INFO::2021-08-03 06:48:06,188::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2580) MainThread::INFO::2021-08-03 06:48:16,256::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:48:16,950::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 831 due to cpu load MainThread::INFO::2021-08-03 06:48:16,951::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2569) MainThread::INFO::2021-08-03 06:48:26,053::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:48:26,999::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 839 due to cpu load MainThread::INFO::2021-08-03 06:48:26,999::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2561) MainThread::INFO::2021-08-03 06:48:36,026::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:48:36,802::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 844 due to cpu load MainThread::INFO::2021-08-03 06:48:36,802::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2556) MainThread::INFO::2021-08-03 06:48:45,827::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:48:46,401::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 849 due to cpu load MainThread::INFO::2021-08-03 06:48:46,401::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2551) MainThread::INFO::2021-08-03 06:48:56,588::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (GlobalMaintenance-ReinitializeFSM) sent? ignored MainThread::INFO::2021-08-03 06:48:58,685::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state ReinitializeFSM (score: 0) MainThread::INFO::2021-08-03 06:49:05,729::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (ReinitializeFSM-EngineStarting) sent? ignored MainThread::INFO::2021-08-03 06:49:06,146::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 830 due to cpu load MainThread::INFO::2021-08-03 06:49:06,146::states::72::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_penalize_memory) Penalizing score by 400 due to free memory 1782 being lower than required 3171 MainThread::INFO::2021-08-03 06:49:06,146::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 2170) MainThread::INFO::2021-08-03 06:49:06,150::state_decorators::95::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineUp'> MainThread::INFO::2021-08-03 06:49:06,172::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStarting-EngineUp) sent? ignored MainThread::INFO::2021-08-03 06:49:06,178::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUp (score: 2570) MainThread::ERROR::2021-08-03 06:49:16,197::states::398::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Host ost-he-basic-suite-master-host-1 (id 2) score is significantly better than local score, shutting down VM on this host =====================================================================================
I looked a bit at /var/log/messages of the host, and while there is quite some noise there, can't tell specifically what might have caused the high cpu load.
Also looked at logs of previous runs, and there is indeed a similar pattern, where the cpuload on the host causes the agent to penalize the score, but according to the agent log the load goes down faster, up to the point of being quite low when the engine is up, and only then we exit global maintenance, allowing the agent to take actions.
I now ran it again, but think that otherwise I'll ignore this for now, unless we see more similar failures. If we do, we might want to check/monitor/log the cpu load on the hosts, and/or change test_008 to wait, after the engine is up, until the cpu load on the host goes down a bit. Before the port to pytest we waited a hard-coded 5 minutes and I changed that then to only wait until the engine VM is not migrating, and this worked more-or-less ok. We might need to refine that, although I'd rather not introduce another arbitrary long delays but wait for some condition.
I also noticed that this ran was started at "03:04 AM" (UTC), by timer, whereas all previous timer-based runs started at 01:55 AM, somewhat earlier - perhaps there are some other things that run at that time that cause load.
Checked some more and found another difference: The total cpu use was not very different between the failed and the last successful runs, but load caused by the engine vm was different between them - in the failed run, engine kept being '0.0' for a longer time, compared with the successful run. Since the agent tries to ignore the load by the engine vm, the final result is significantly different. The load-by -engine-vm is calculated based on the result of a VDSM getStats call on the engine VM. Indeed, in vdsm.log, cpuUser and cpuSys are both starting with '0.00' for both runs, but the failed one is '0.00' for a longer time - 45 seconds between VM.create and first-non-zero cpuUser on the successful run, vs 93 seconds for the failed one. Does this make sense (from VDSM's POV)? If so, perhaps there is a way to get from VDSM the answer to this question: "Is this 0.00 CPU usage a result of still-initial, non-accurate, non-complete monitoring? Or is it really not using the CPU at all?" and use it in the agent somehow.
I think it’s futile to try to understand that on jenkins.ovirt.org runs. It’s too unpredictable since it runs in a shared environment, you can’t know what’s going on on the same physical host. One of the reasons we try to move away from it to beaker-based runs that run dedicated. The problem is you have no metrics for the actual host, so the numbers OST or vdsm see can be very misleading We do have sar monitoring of OST VMs, we do not capture anything from the host (the host running OST code), but I would guess that even if we would, then for mock env it’s probably going to be misleading anyway. Thanks, michal
-- Didi _______________________________________________ Devel mailing list -- devel@ovirt.org <mailto:devel@ovirt.org> To unsubscribe send an email to devel-leave@ovirt.org <mailto:devel-leave@ovirt.org> Privacy Statement: https://www.ovirt.org/privacy-policy.html <https://www.ovirt.org/privacy-policy.html> oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ <https://www.ovirt.org/community/about/community-guidelines/> List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/RZ3PACAXJ3TIH2... <https://lists.ovirt.org/archives/list/devel@ovirt.org/message/RZ3PACAXJ3TIH25YVUNIVWIVIXDZGPEF/>

On Tue, Aug 3, 2021 at 10:27 PM Michal Skrivanek < michal.skrivanek@redhat.com> wrote:
On 3. 8. 2021, at 11:43, Yedidyah Bar David <didi@redhat.com> wrote:
On Tue, Aug 3, 2021 at 10:05 AM Yedidyah Bar David <didi@redhat.com> wrote:
On Tue, Aug 3, 2021 at 7:50 AM <jenkins@jenkins.phx.ovirt.org> wrote:
Project: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/ Build: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2126/ Build Number: 2126 Build Status: Failure Triggered By: Started by timer
------------------------------------- Changes Since Last Success: ------------------------------------- Changes for Build #2126 [Michal Skrivanek] basic: skipping just the VNC console part of test_virtual_machines
----------------- Failed Tests: ----------------- 2 tests failed. FAILED: he-basic-suite-master.test-scenarios.test_012_local_maintenance_sdk.test_local_maintenance
Error Message: ovirtsdk4.Error: Failed to read response: [(<pycurl.Curl object at 0x5555faf11228>, 7, 'Failed to connect to 192.168.200.99 port 443: Connection refused')]
This looks very similar to the issue we have with dns/dig failures that cause the engine VM to go down, and it's similar, but different.
dig didn't fail (it now uses TCP), but something else caused the agent to stop the engine VM - a combination of high cpu load and low free memory, after restarting the engine VM as part of test_008.
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2126/... :
===================================================================================== MainThread::INFO::2021-08-03
06:46:55,068::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state ReinitializeFSM (score: 0) MainThread::INFO::2021-08-03
06:47:04,089::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03
06:47:04,169::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (ReinitializeFSM-GlobalMaintenance) sent? ignored MainThread::INFO::2021-08-03
06:47:05,249::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 3400) MainThread::INFO::2021-08-03
06:47:14,439::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03
06:47:25,526::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 814 due to cpu load MainThread::INFO::2021-08-03
06:47:25,527::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2586) MainThread::INFO::2021-08-03
06:47:25,537::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03
06:47:26,029::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2586) MainThread::INFO::2021-08-03
06:47:35,050::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03
06:47:35,576::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2586) MainThread::INFO::2021-08-03
06:47:45,597::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03
06:47:46,521::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2586) MainThread::INFO::2021-08-03
06:47:55,577::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03
06:47:56,559::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2586) MainThread::INFO::2021-08-03
06:47:56,559::hosted_engine::525::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Best remote host ost-he-basic-suite-master-host-1 (id: 2, score: 3400) MainThread::INFO::2021-08-03
06:48:05,633::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03
06:48:06,188::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 820 due to cpu load MainThread::INFO::2021-08-03
06:48:06,188::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2580) MainThread::INFO::2021-08-03
06:48:16,256::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03
06:48:16,950::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 831 due to cpu load MainThread::INFO::2021-08-03
06:48:16,951::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2569) MainThread::INFO::2021-08-03
06:48:26,053::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03
06:48:26,999::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 839 due to cpu load MainThread::INFO::2021-08-03
06:48:26,999::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2561) MainThread::INFO::2021-08-03
06:48:36,026::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03
06:48:36,802::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 844 due to cpu load MainThread::INFO::2021-08-03
06:48:36,802::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2556) MainThread::INFO::2021-08-03
06:48:45,827::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03
06:48:46,401::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 849 due to cpu load MainThread::INFO::2021-08-03
06:48:46,401::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2551) MainThread::INFO::2021-08-03
06:48:56,588::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (GlobalMaintenance-ReinitializeFSM) sent? ignored MainThread::INFO::2021-08-03
06:48:58,685::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state ReinitializeFSM (score: 0) MainThread::INFO::2021-08-03
06:49:05,729::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (ReinitializeFSM-EngineStarting) sent? ignored MainThread::INFO::2021-08-03
06:49:06,146::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 830 due to cpu load MainThread::INFO::2021-08-03
06:49:06,146::states::72::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_penalize_memory) Penalizing score by 400 due to free memory 1782 being lower than required 3171 MainThread::INFO::2021-08-03
06:49:06,146::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 2170) MainThread::INFO::2021-08-03
06:49:06,150::state_decorators::95::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineUp'> MainThread::INFO::2021-08-03
06:49:06,172::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStarting-EngineUp) sent? ignored MainThread::INFO::2021-08-03
06:49:06,178::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUp (score: 2570) MainThread::ERROR::2021-08-03
06:49:16,197::states::398::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Host ost-he-basic-suite-master-host-1 (id 2) score is significantly better than local score, shutting down VM on this host
=====================================================================================
I looked a bit at /var/log/messages of the host, and while there is quite some noise there, can't tell specifically what might have caused the high cpu load.
Also looked at logs of previous runs, and there is indeed a similar pattern, where the cpuload on the host causes the agent to penalize the score, but according to the agent log the load goes down faster, up to the point of being quite low when the engine is up, and only then we exit global maintenance, allowing the agent to take actions.
I now ran it again, but think that otherwise I'll ignore this for now, unless we see more similar failures. If we do, we might want to check/monitor/log the cpu load on the hosts, and/or change test_008 to wait, after the engine is up, until the cpu load on the host goes down a bit. Before the port to pytest we waited a hard-coded 5 minutes and I changed that then to only wait until the engine VM is not migrating, and this worked more-or-less ok. We might need to refine that, although I'd rather not introduce another arbitrary long delays but wait for some condition.
I also noticed that this ran was started at "03:04 AM" (UTC), by timer, whereas all previous timer-based runs started at 01:55 AM, somewhat earlier - perhaps there are some other things that run at that time that cause load.
Checked some more and found another difference: The total cpu use was not very different between the failed and the last successful runs, but load caused by the engine vm was different between them - in the failed run, engine kept being '0.0' for a longer time, compared with the successful run. Since the agent tries to ignore the load by the engine vm, the final result is significantly different. The load-by -engine-vm is calculated based on the result of a VDSM getStats call on the engine VM. Indeed, in vdsm.log, cpuUser and cpuSys are both starting with '0.00' for both runs, but the failed one is '0.00' for a longer time - 45 seconds between VM.create and first-non-zero cpuUser on the successful run, vs 93 seconds for the failed one. Does this make sense (from VDSM's POV)? If so, perhaps there is a way to get from VDSM the answer to this question: "Is this 0.00 CPU usage a result of still-initial, non-accurate, non-complete monitoring? Or is it really not using the CPU at all?" and use it in the agent somehow.
I think it’s futile to try to understand that on jenkins.ovirt.org runs. It’s too unpredictable since it runs in a shared environment, you can’t know what’s going on on the same physical host.
I understand that I can't get/expect stable performance/load there. That's not my question. I asked about VDSM. I don't know the code there well. I assume that when it starts a VM, it initializes something somewhere with 'cpuUser: 0.00', and once its gets _some_ data, whatever that might be, it starts reporting (based on) this data. My question was if there is a reliable way to know if '0.00' is just real data, or initial-not-yet-real data.
One of the reasons we try to move away from it to beaker-based runs that run dedicated. The problem is you have no metrics for the actual host, so the numbers OST or vdsm see can be very misleading We do have sar monitoring of OST VMs, we do not capture anything from the host (the host running OST code), but I would guess that even if we would, then for mock env it’s probably going to be misleading anyway.
Again: Perhaps the actual reported load would be misleading, because it's on a VM, which does not own its CPUs. I'd still expect '0.00' to mean that, or have a way to know that it actually means "I don't know yet". Best regards, -- Didi

On 4. 8. 2021, at 7:38, Yedidyah Bar David <didi@redhat.com> wrote:
On Tue, Aug 3, 2021 at 10:27 PM Michal Skrivanek <michal.skrivanek@redhat.com <mailto:michal.skrivanek@redhat.com>> wrote:
On 3. 8. 2021, at 11:43, Yedidyah Bar David <didi@redhat.com <mailto:didi@redhat.com>> wrote:
On Tue, Aug 3, 2021 at 10:05 AM Yedidyah Bar David <didi@redhat.com <mailto:didi@redhat.com>> wrote:
On Tue, Aug 3, 2021 at 7:50 AM <jenkins@jenkins.phx.ovirt.org <mailto:jenkins@jenkins.phx.ovirt.org>> wrote:
Project: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/ <https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/> Build: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2126/ <https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2126/> Build Number: 2126 Build Status: Failure Triggered By: Started by timer
------------------------------------- Changes Since Last Success: ------------------------------------- Changes for Build #2126 [Michal Skrivanek] basic: skipping just the VNC console part of test_virtual_machines
----------------- Failed Tests: ----------------- 2 tests failed. FAILED: he-basic-suite-master.test-scenarios.test_012_local_maintenance_sdk.test_local_maintenance
Error Message: ovirtsdk4.Error: Failed to read response: [(<pycurl.Curl object at 0x5555faf11228>, 7, 'Failed to connect to 192.168.200.99 port 443: Connection refused')]
This looks very similar to the issue we have with dns/dig failures that cause the engine VM to go down, and it's similar, but different.
dig didn't fail (it now uses TCP), but something else caused the agent to stop the engine VM - a combination of high cpu load and low free memory, after restarting the engine VM as part of test_008.
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2126/... <https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2126/artifact/exported-artifacts/test_logs/ost-he-basic-suite-master-host-0/var/log/ovirt-hosted-engine-ha/agent.log> :
===================================================================================== MainThread::INFO::2021-08-03 06:46:55,068::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state ReinitializeFSM (score: 0) MainThread::INFO::2021-08-03 06:47:04,089::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:47:04,169::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (ReinitializeFSM-GlobalMaintenance) sent? ignored MainThread::INFO::2021-08-03 06:47:05,249::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 3400) MainThread::INFO::2021-08-03 06:47:14,439::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:47:25,526::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 814 due to cpu load MainThread::INFO::2021-08-03 06:47:25,527::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2586) MainThread::INFO::2021-08-03 06:47:25,537::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:47:26,029::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2586) MainThread::INFO::2021-08-03 06:47:35,050::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:47:35,576::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2586) MainThread::INFO::2021-08-03 06:47:45,597::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:47:46,521::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2586) MainThread::INFO::2021-08-03 06:47:55,577::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:47:56,559::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2586) MainThread::INFO::2021-08-03 06:47:56,559::hosted_engine::525::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Best remote host ost-he-basic-suite-master-host-1 (id: 2, score: 3400) MainThread::INFO::2021-08-03 06:48:05,633::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:48:06,188::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 820 due to cpu load MainThread::INFO::2021-08-03 06:48:06,188::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2580) MainThread::INFO::2021-08-03 06:48:16,256::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:48:16,950::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 831 due to cpu load MainThread::INFO::2021-08-03 06:48:16,951::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2569) MainThread::INFO::2021-08-03 06:48:26,053::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:48:26,999::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 839 due to cpu load MainThread::INFO::2021-08-03 06:48:26,999::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2561) MainThread::INFO::2021-08-03 06:48:36,026::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:48:36,802::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 844 due to cpu load MainThread::INFO::2021-08-03 06:48:36,802::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2556) MainThread::INFO::2021-08-03 06:48:45,827::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03 06:48:46,401::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 849 due to cpu load MainThread::INFO::2021-08-03 06:48:46,401::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2551) MainThread::INFO::2021-08-03 06:48:56,588::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (GlobalMaintenance-ReinitializeFSM) sent? ignored MainThread::INFO::2021-08-03 06:48:58,685::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state ReinitializeFSM (score: 0) MainThread::INFO::2021-08-03 06:49:05,729::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (ReinitializeFSM-EngineStarting) sent? ignored MainThread::INFO::2021-08-03 06:49:06,146::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 830 due to cpu load MainThread::INFO::2021-08-03 06:49:06,146::states::72::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_penalize_memory) Penalizing score by 400 due to free memory 1782 being lower than required 3171 MainThread::INFO::2021-08-03 06:49:06,146::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 2170) MainThread::INFO::2021-08-03 06:49:06,150::state_decorators::95::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineUp'> MainThread::INFO::2021-08-03 06:49:06,172::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStarting-EngineUp) sent? ignored MainThread::INFO::2021-08-03 06:49:06,178::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUp (score: 2570) MainThread::ERROR::2021-08-03 06:49:16,197::states::398::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Host ost-he-basic-suite-master-host-1 (id 2) score is significantly better than local score, shutting down VM on this host =====================================================================================
I looked a bit at /var/log/messages of the host, and while there is quite some noise there, can't tell specifically what might have caused the high cpu load.
Also looked at logs of previous runs, and there is indeed a similar pattern, where the cpuload on the host causes the agent to penalize the score, but according to the agent log the load goes down faster, up to the point of being quite low when the engine is up, and only then we exit global maintenance, allowing the agent to take actions.
I now ran it again, but think that otherwise I'll ignore this for now, unless we see more similar failures. If we do, we might want to check/monitor/log the cpu load on the hosts, and/or change test_008 to wait, after the engine is up, until the cpu load on the host goes down a bit. Before the port to pytest we waited a hard-coded 5 minutes and I changed that then to only wait until the engine VM is not migrating, and this worked more-or-less ok. We might need to refine that, although I'd rather not introduce another arbitrary long delays but wait for some condition.
I also noticed that this ran was started at "03:04 AM" (UTC), by timer, whereas all previous timer-based runs started at 01:55 AM, somewhat earlier - perhaps there are some other things that run at that time that cause load.
Checked some more and found another difference: The total cpu use was not very different between the failed and the last successful runs, but load caused by the engine vm was different between them - in the failed run, engine kept being '0.0' for a longer time, compared with the successful run. Since the agent tries to ignore the load by the engine vm, the final result is significantly different. The load-by -engine-vm is calculated based on the result of a VDSM getStats call on the engine VM. Indeed, in vdsm.log, cpuUser and cpuSys are both starting with '0.00' for both runs, but the failed one is '0.00' for a longer time - 45 seconds between VM.create and first-non-zero cpuUser on the successful run, vs 93 seconds for the failed one. Does this make sense (from VDSM's POV)? If so, perhaps there is a way to get from VDSM the answer to this question: "Is this 0.00 CPU usage a result of still-initial, non-accurate, non-complete monitoring? Or is it really not using the CPU at all?" and use it in the agent somehow.
I think it’s futile to try to understand that on jenkins.ovirt.org <http://jenkins.ovirt.org/> runs. It’s too unpredictable since it runs in a shared environment, you can’t know what’s going on on the same physical host.
I understand that I can't get/expect stable performance/load there. That's not my question. I asked about VDSM. I don't know the code there well. I assume that when it starts a VM, it initializes something somewhere with 'cpuUser: 0.00', and once its gets _some_ data, whatever that might be, it starts reporting (based on) this data. My question was if there is a reliable way to know if '0.00' is just real data, or initial-not-yet-real data.
I don’t really know for sure, but AFAICT it should be real data from the start. Maybe for the first interval, but afterwards it’s always a libvirt reported value
One of the reasons we try to move away from it to beaker-based runs that run dedicated. The problem is you have no metrics for the actual host, so the numbers OST or vdsm see can be very misleading We do have sar monitoring of OST VMs, we do not capture anything from the host (the host running OST code), but I would guess that even if we would, then for mock env it’s probably going to be misleading anyway.
Again: Perhaps the actual reported load would be misleading, because it's on a VM, which does not own its CPUs. I'd still expect '0.00' to mean that, or have a way to know that it actually means "I don't know yet".
Best regards, -- Didi

On Wed, Aug 4, 2021 at 1:56 PM Michal Skrivanek <michal.skrivanek@redhat.com> wrote:
I don’t really know for sure, but AFAICT it should be real data from the start. Maybe for the first interval, but afterwards it’s always a libvirt reported value
Adding Nir. Not sure who else... sorry. This now happened again: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2129/ Console has: 06:25:25 2021-08-05 03:25:25+0000,873 INFO [root] Starting the engine VM... (test_008_restart_he_vm:96) broker.log has (I think it only logs once a minute): Thread-4::INFO::2021-08-05 05:25:31,995::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load) System load total=0.8164, engine=0.0000, non-engine=0.8164 Thread-4::INFO::2021-08-05 05:26:32,072::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load) System load total=0.8480, engine=0.0000, non-engine=0.8480 Thread-4::INFO::2021-08-05 05:27:32,175::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load) System load total=0.7572, engine=0.2656, non-engine=0.4916 vdsm.log [1] has: 2021-08-05 05:25:29,017+0200 DEBUG (jsonrpc/4) [jsonrpc.JsonRpcServer] Calling 'VM.create' in bridge... 2021-08-05 05:25:31,991+0200 DEBUG (jsonrpc/7) [api] FINISH getStats response={'status': {'code': 0, 'message': 'Done'}, 'statsList': [{'statusTime': '2152587436', 'status': 'WaitForLaunch', 'vmId': '230ea8e8-e365-46cd-98fa-e9d6a653306f', 'vmName': 'HostedEngine', 'vmType': 'kvm', 'kvmEnable': 'true', 'acpiEnable': 'true', 'elapsedTime': '2', 'monitorResponse': '0', 'clientIp': '', 'timeOffset': '0', 'cpuUser': '0.00', 'cpuSys': '0.00',... and 17 more such [2] lines. Line 11 is the first one with cpuUser != 0.00, at '2021-08-05 05:27:02', 92 seconds later. Incidentally (or not), this is also the first line with 'network' in it. There are other differences along the way - e.g. I see status moving from WaitForLaunch to 'Powering up' and to 'Up', but the first 'Up' line is number 7 - 40 seconds before cpuUser>0. I'd like to clarify that I do not see this mainly as an OST issue, but more as a general HE HA issue - if users start global maint, then restart the engine vm, then exit global maint too quickly, the reported high cpu load might make the machine go down. In OST, I can easily just add another 60 seconds or so delay after the engine is up. Of course we can do the same also in HA, and I'd be for doing this, if we do not get any more information (or find out that this is a recently-introduced bug and fix it). [1] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2129/... [2] grep -i " 05:2[5678].*api. finish getStats.*cpuUser':" Thanks and best regards, -- Didi

On Thu, Aug 5, 2021 at 9:31 AM Yedidyah Bar David <didi@redhat.com> wrote:
On Wed, Aug 4, 2021 at 1:56 PM Michal Skrivanek <michal.skrivanek@redhat.com> wrote:
I don’t really know for sure, but AFAICT it should be real data from the start. Maybe for the first interval, but afterwards it’s always a libvirt reported value
Adding Nir. Not sure who else... sorry.
This now happened again:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2129/
Console has:
06:25:25 2021-08-05 03:25:25+0000,873 INFO [root] Starting the engine VM... (test_008_restart_he_vm:96)
broker.log has (I think it only logs once a minute):
Thread-4::INFO::2021-08-05 05:25:31,995::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load) System load total=0.8164, engine=0.0000, non-engine=0.8164 Thread-4::INFO::2021-08-05 05:26:32,072::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load) System load total=0.8480, engine=0.0000, non-engine=0.8480 Thread-4::INFO::2021-08-05 05:27:32,175::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load) System load total=0.7572, engine=0.2656, non-engine=0.4916
vdsm.log [1] has:
2021-08-05 05:25:29,017+0200 DEBUG (jsonrpc/4) [jsonrpc.JsonRpcServer] Calling 'VM.create' in bridge...
2021-08-05 05:25:31,991+0200 DEBUG (jsonrpc/7) [api] FINISH getStats response={'status': {'code': 0, 'message': 'Done'}, 'statsList': [{'statusTime': '2152587436', 'status': 'WaitForLaunch', 'vmId': '230ea8e8-e365-46cd-98fa-e9d6a653306f', 'vmName': 'HostedEngine', 'vmType': 'kvm', 'kvmEnable': 'true', 'acpiEnable': 'true', 'elapsedTime': '2', 'monitorResponse': '0', 'clientIp': '', 'timeOffset': '0', 'cpuUser': '0.00', 'cpuSys': '0.00',...
and 17 more such [2] lines. Line 11 is the first one with cpuUser != 0.00, at '2021-08-05 05:27:02', 92 seconds later. Incidentally (or not), this is also the first line with 'network' in it. There are other differences along the way - e.g. I see status moving from WaitForLaunch to 'Powering up' and to 'Up', but the first 'Up' line is number 7 - 40 seconds before cpuUser>0.
I'd like to clarify that I do not see this mainly as an OST issue, but more as a general HE HA issue - if users start global maint, then restart the engine vm, then exit global maint too quickly, the reported high cpu load might make the machine go down. In OST, I can easily just add another 60 seconds or so delay after the engine is up. Of course we can do the same also in HA, and I'd be for doing this, if we do not get any more information (or find out that this is a recently-introduced bug and fix it).
Happened again: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2132/ Pushed this, for now: Pushed https://gerrit.ovirt.org/c/ovirt-system-tests/+/116134 he: test_008_restart_he_vm.py: Wait until host score is good enough I'd still like to get more input from vdsm people. I don't mind just opening a bug. -- Didi

On Sun, Aug 8, 2021 at 10:14 AM Yedidyah Bar David <didi@redhat.com> wrote:
On Thu, Aug 5, 2021 at 9:31 AM Yedidyah Bar David <didi@redhat.com> wrote:
On Wed, Aug 4, 2021 at 1:56 PM Michal Skrivanek <michal.skrivanek@redhat.com> wrote:
I don’t really know for sure, but AFAICT it should be real data from the start. Maybe for the first interval, but afterwards it’s always a libvirt reported value
Adding Nir. Not sure who else... sorry.
This now happened again:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2129/
Console has:
06:25:25 2021-08-05 03:25:25+0000,873 INFO [root] Starting the engine VM... (test_008_restart_he_vm:96)
broker.log has (I think it only logs once a minute):
Thread-4::INFO::2021-08-05 05:25:31,995::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load) System load total=0.8164, engine=0.0000, non-engine=0.8164 Thread-4::INFO::2021-08-05 05:26:32,072::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load) System load total=0.8480, engine=0.0000, non-engine=0.8480 Thread-4::INFO::2021-08-05 05:27:32,175::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load) System load total=0.7572, engine=0.2656, non-engine=0.4916
vdsm.log [1] has:
2021-08-05 05:25:29,017+0200 DEBUG (jsonrpc/4) [jsonrpc.JsonRpcServer] Calling 'VM.create' in bridge...
2021-08-05 05:25:31,991+0200 DEBUG (jsonrpc/7) [api] FINISH getStats response={'status': {'code': 0, 'message': 'Done'}, 'statsList': [{'statusTime': '2152587436', 'status': 'WaitForLaunch', 'vmId': '230ea8e8-e365-46cd-98fa-e9d6a653306f', 'vmName': 'HostedEngine', 'vmType': 'kvm', 'kvmEnable': 'true', 'acpiEnable': 'true', 'elapsedTime': '2', 'monitorResponse': '0', 'clientIp': '', 'timeOffset': '0', 'cpuUser': '0.00', 'cpuSys': '0.00',...
and 17 more such [2] lines. Line 11 is the first one with cpuUser != 0.00, at '2021-08-05 05:27:02', 92 seconds later. Incidentally (or not), this is also the first line with 'network' in it. There are other differences along the way - e.g. I see status moving from WaitForLaunch to 'Powering up' and to 'Up', but the first 'Up' line is number 7 - 40 seconds before cpuUser>0.
Milan should be able to help with this. In storage monitoring we avoid this issue by reporting actual=False before we got the first monitoring results, so engine can wait for the actual results. https://github.com/oVirt/vdsm/blob/4309a39492040300e1b983eb583e8847f5cc7538/...
I'd like to clarify that I do not see this mainly as an OST issue, but more as a general HE HA issue - if users start global maint, then restart the engine vm, then exit global maint too quickly, the reported high cpu load might make the machine go down. In OST, I can easily just add another 60 seconds or so delay after the engine is up. Of course we can do the same also in HA, and I'd be for doing this, if we do not get any more information (or find out that this is a recently-introduced bug and fix it).
If this is a real issue you should be able to reproduce this on a real system. Nir

On Mon, Aug 9, 2021 at 1:39 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Sun, Aug 8, 2021 at 10:14 AM Yedidyah Bar David <didi@redhat.com> wrote:
On Thu, Aug 5, 2021 at 9:31 AM Yedidyah Bar David <didi@redhat.com> wrote:
On Wed, Aug 4, 2021 at 1:56 PM Michal Skrivanek <michal.skrivanek@redhat.com> wrote:
I don’t really know for sure, but AFAICT it should be real data from the start. Maybe for the first interval, but afterwards it’s always a libvirt reported value
Adding Nir. Not sure who else... sorry.
This now happened again:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2129/
Console has:
06:25:25 2021-08-05 03:25:25+0000,873 INFO [root] Starting the engine VM... (test_008_restart_he_vm:96)
broker.log has (I think it only logs once a minute):
Thread-4::INFO::2021-08-05 05:25:31,995::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load) System load total=0.8164, engine=0.0000, non-engine=0.8164 Thread-4::INFO::2021-08-05 05:26:32,072::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load) System load total=0.8480, engine=0.0000, non-engine=0.8480 Thread-4::INFO::2021-08-05 05:27:32,175::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load) System load total=0.7572, engine=0.2656, non-engine=0.4916
vdsm.log [1] has:
2021-08-05 05:25:29,017+0200 DEBUG (jsonrpc/4) [jsonrpc.JsonRpcServer] Calling 'VM.create' in bridge...
2021-08-05 05:25:31,991+0200 DEBUG (jsonrpc/7) [api] FINISH getStats response={'status': {'code': 0, 'message': 'Done'}, 'statsList': [{'statusTime': '2152587436', 'status': 'WaitForLaunch', 'vmId': '230ea8e8-e365-46cd-98fa-e9d6a653306f', 'vmName': 'HostedEngine', 'vmType': 'kvm', 'kvmEnable': 'true', 'acpiEnable': 'true', 'elapsedTime': '2', 'monitorResponse': '0', 'clientIp': '', 'timeOffset': '0', 'cpuUser': '0.00', 'cpuSys': '0.00',...
and 17 more such [2] lines. Line 11 is the first one with cpuUser != 0.00, at '2021-08-05 05:27:02', 92 seconds later. Incidentally (or not), this is also the first line with 'network' in it. There are other differences along the way - e.g. I see status moving from WaitForLaunch to 'Powering up' and to 'Up', but the first 'Up' line is number 7 - 40 seconds before cpuUser>0.
Milan should be able to help with this.
Thanks.
In storage monitoring we avoid this issue by reporting actual=False before we got the first monitoring results, so engine can wait for the actual results. https://github.com/oVirt/vdsm/blob/4309a39492040300e1b983eb583e8847f5cc7538/...
Makes sense. That's indeed what I was looking for, for VM cpu usage.
I'd like to clarify that I do not see this mainly as an OST issue, but more as a general HE HA issue - if users start global maint, then restart the engine vm, then exit global maint too quickly, the reported high cpu load might make the machine go down. In OST, I can easily just add another 60 seconds or so delay after the engine is up. Of course we can do the same also in HA, and I'd be for doing this, if we do not get any more information (or find out that this is a recently-introduced bug and fix it).
If this is a real issue you should be able to reproduce this on a real system.
In "real", you might refer to two different things: 1. OST is a different environment - has ridiculously little memory/cpu, etc., or something else that is not expected or not recommended for a real system. 2. The _flow_ is not real. As in, it's unlikely that a real user will exit global maintenance so quickly after starting the engine VM, without looking around a bit more. I agree with both - and even if it's eventually considered a real bug, I'd not consider it severe. But just saying "OST is not a real system" is not something I can completely agree with. We have a balance/tradeoff here between trying to imitate "real systems" as accurately as possible and between doing this efficiently/effectively. I do not think there is a deliberate design choice to make it arbitrarily different from real systems. Best regards, -- Didi

Yedidyah Bar David <didi@redhat.com> writes:
On Tue, Aug 3, 2021 at 10:27 PM Michal Skrivanek < michal.skrivanek@redhat.com> wrote:
On 3. 8. 2021, at 11:43, Yedidyah Bar David <didi@redhat.com> wrote:
On Tue, Aug 3, 2021 at 10:05 AM Yedidyah Bar David <didi@redhat.com> wrote:
On Tue, Aug 3, 2021 at 7:50 AM <jenkins@jenkins.phx.ovirt.org> wrote:
Project: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/ Build: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2126/ Build Number: 2126 Build Status: Failure Triggered By: Started by timer
------------------------------------- Changes Since Last Success: ------------------------------------- Changes for Build #2126 [Michal Skrivanek] basic: skipping just the VNC console part of test_virtual_machines
----------------- Failed Tests: ----------------- 2 tests failed. FAILED: he-basic-suite-master.test-scenarios.test_012_local_maintenance_sdk.test_local_maintenance
Error Message: ovirtsdk4.Error: Failed to read response: [(<pycurl.Curl object at 0x5555faf11228>, 7, 'Failed to connect to 192.168.200.99 port 443: Connection refused')]
This looks very similar to the issue we have with dns/dig failures that cause the engine VM to go down, and it's similar, but different.
dig didn't fail (it now uses TCP), but something else caused the agent to stop the engine VM - a combination of high cpu load and low free memory, after restarting the engine VM as part of test_008.
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2126/... :
===================================================================================== MainThread::INFO::2021-08-03
06:46:55,068::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state ReinitializeFSM (score: 0) MainThread::INFO::2021-08-03
06:47:04,089::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03
06:47:04,169::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (ReinitializeFSM-GlobalMaintenance) sent? ignored MainThread::INFO::2021-08-03
06:47:05,249::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 3400) MainThread::INFO::2021-08-03
06:47:14,439::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03
06:47:25,526::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 814 due to cpu load MainThread::INFO::2021-08-03
06:47:25,527::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2586) MainThread::INFO::2021-08-03
06:47:25,537::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03
06:47:26,029::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2586) MainThread::INFO::2021-08-03
06:47:35,050::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03
06:47:35,576::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2586) MainThread::INFO::2021-08-03
06:47:45,597::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03
06:47:46,521::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2586) MainThread::INFO::2021-08-03
06:47:55,577::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03
06:47:56,559::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2586) MainThread::INFO::2021-08-03
06:47:56,559::hosted_engine::525::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Best remote host ost-he-basic-suite-master-host-1 (id: 2, score: 3400) MainThread::INFO::2021-08-03
06:48:05,633::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03
06:48:06,188::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 820 due to cpu load MainThread::INFO::2021-08-03
06:48:06,188::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2580) MainThread::INFO::2021-08-03
06:48:16,256::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03
06:48:16,950::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 831 due to cpu load MainThread::INFO::2021-08-03
06:48:16,951::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2569) MainThread::INFO::2021-08-03
06:48:26,053::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03
06:48:26,999::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 839 due to cpu load MainThread::INFO::2021-08-03
06:48:26,999::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2561) MainThread::INFO::2021-08-03
06:48:36,026::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03
06:48:36,802::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 844 due to cpu load MainThread::INFO::2021-08-03
06:48:36,802::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2556) MainThread::INFO::2021-08-03
06:48:45,827::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03
06:48:46,401::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 849 due to cpu load MainThread::INFO::2021-08-03
06:48:46,401::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2551) MainThread::INFO::2021-08-03
06:48:56,588::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (GlobalMaintenance-ReinitializeFSM) sent? ignored MainThread::INFO::2021-08-03
06:48:58,685::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state ReinitializeFSM (score: 0) MainThread::INFO::2021-08-03
06:49:05,729::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (ReinitializeFSM-EngineStarting) sent? ignored MainThread::INFO::2021-08-03
06:49:06,146::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 830 due to cpu load MainThread::INFO::2021-08-03
06:49:06,146::states::72::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_penalize_memory) Penalizing score by 400 due to free memory 1782 being lower than required 3171 MainThread::INFO::2021-08-03
06:49:06,146::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 2170) MainThread::INFO::2021-08-03
06:49:06,150::state_decorators::95::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineUp'> MainThread::INFO::2021-08-03
06:49:06,172::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStarting-EngineUp) sent? ignored MainThread::INFO::2021-08-03
06:49:06,178::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUp (score: 2570) MainThread::ERROR::2021-08-03
06:49:16,197::states::398::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Host ost-he-basic-suite-master-host-1 (id 2) score is significantly better than local score, shutting down VM on this host
=====================================================================================
I looked a bit at /var/log/messages of the host, and while there is quite some noise there, can't tell specifically what might have caused the high cpu load.
Also looked at logs of previous runs, and there is indeed a similar pattern, where the cpuload on the host causes the agent to penalize the score, but according to the agent log the load goes down faster, up to the point of being quite low when the engine is up, and only then we exit global maintenance, allowing the agent to take actions.
I now ran it again, but think that otherwise I'll ignore this for now, unless we see more similar failures. If we do, we might want to check/monitor/log the cpu load on the hosts, and/or change test_008 to wait, after the engine is up, until the cpu load on the host goes down a bit. Before the port to pytest we waited a hard-coded 5 minutes and I changed that then to only wait until the engine VM is not migrating, and this worked more-or-less ok. We might need to refine that, although I'd rather not introduce another arbitrary long delays but wait for some condition.
I also noticed that this ran was started at "03:04 AM" (UTC), by timer, whereas all previous timer-based runs started at 01:55 AM, somewhat earlier - perhaps there are some other things that run at that time that cause load.
Checked some more and found another difference: The total cpu use was not very different between the failed and the last successful runs, but load caused by the engine vm was different between them - in the failed run, engine kept being '0.0' for a longer time, compared with the successful run. Since the agent tries to ignore the load by the engine vm, the final result is significantly different. The load-by -engine-vm is calculated based on the result of a VDSM getStats call on the engine VM. Indeed, in vdsm.log, cpuUser and cpuSys are both starting with '0.00' for both runs, but the failed one is '0.00' for a longer time - 45 seconds between VM.create and first-non-zero cpuUser on the successful run, vs 93 seconds for the failed one. Does this make sense (from VDSM's POV)? If so, perhaps there is a way to get from VDSM the answer to this question: "Is this 0.00 CPU usage a result of still-initial, non-accurate, non-complete monitoring? Or is it really not using the CPU at all?" and use it in the agent somehow.
I think it’s futile to try to understand that on jenkins.ovirt.org runs. It’s too unpredictable since it runs in a shared environment, you can’t know what’s going on on the same physical host.
I understand that I can't get/expect stable performance/load there. That's not my question. I asked about VDSM. I don't know the code there well. I assume that when it starts a VM, it initializes something somewhere with 'cpuUser: 0.00',
Yes.
and once its gets _some_ data, whatever that might be, it starts reporting (based on) this data.
Yes. It retrieves the data in 15 seconds intervals by default and it needs 2 retrievals to compute the data (the difference between the 2 samples). It gets the data from libvirt for running VMs. And then it waits until Engine calls getStats, which I believe is also every 15 seconds. So under normal circumstances it should take about 15-45 seconds after the VM starts being running in libvirt before first real data is reported to Engine. I can think about the following circumstances when it could be delayed: - The VM is detected as non-responsive in libvirt. Then the VM is not queried for its data this time and the following 40 seconds. - The executor gets overfilled. - Engine doesn't ask for the data for some reason. - There is something very wrong in the system. None of these should happen if everything is all right but sometimes not everything is all right, especially when there is something stuck in QEMU monitoring (e.g. because of temporary storage problems).
My question was if there is a reliable way to know if '0.00' is just real data, or initial-not-yet-real data.
From this data itself not but from the accompanied data probably yes. I'd say that if memory stats are present then it's a real value.
One of the reasons we try to move away from it to beaker-based runs that run dedicated. The problem is you have no metrics for the actual host, so the numbers OST or vdsm see can be very misleading We do have sar monitoring of OST VMs, we do not capture anything from the host (the host running OST code), but I would guess that even if we would, then for mock env it’s probably going to be misleading anyway.
Again: Perhaps the actual reported load would be misleading, because it's on a VM, which does not own its CPUs. I'd still expect '0.00' to mean that, or have a way to know that it actually means "I don't know yet".
Best regards,

On Tue, Aug 10, 2021 at 12:05 PM Milan Zamazal <mzamazal@redhat.com> wrote:
Yedidyah Bar David <didi@redhat.com> writes:
On Tue, Aug 3, 2021 at 10:27 PM Michal Skrivanek < michal.skrivanek@redhat.com> wrote:
On 3. 8. 2021, at 11:43, Yedidyah Bar David <didi@redhat.com> wrote:
On Tue, Aug 3, 2021 at 10:05 AM Yedidyah Bar David <didi@redhat.com> wrote:
On Tue, Aug 3, 2021 at 7:50 AM <jenkins@jenkins.phx.ovirt.org> wrote:
Project: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/ Build: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2126/ Build Number: 2126 Build Status: Failure Triggered By: Started by timer
------------------------------------- Changes Since Last Success: ------------------------------------- Changes for Build #2126 [Michal Skrivanek] basic: skipping just the VNC console part of test_virtual_machines
----------------- Failed Tests: ----------------- 2 tests failed. FAILED: he-basic-suite-master.test-scenarios.test_012_local_maintenance_sdk.test_local_maintenance
Error Message: ovirtsdk4.Error: Failed to read response: [(<pycurl.Curl object at 0x5555faf11228>, 7, 'Failed to connect to 192.168.200.99 port 443: Connection refused')]
This looks very similar to the issue we have with dns/dig failures that cause the engine VM to go down, and it's similar, but different.
dig didn't fail (it now uses TCP), but something else caused the agent to stop the engine VM - a combination of high cpu load and low free memory, after restarting the engine VM as part of test_008.
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2126/... :
===================================================================================== MainThread::INFO::2021-08-03
06:46:55,068::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state ReinitializeFSM (score: 0) MainThread::INFO::2021-08-03
06:47:04,089::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03
06:47:04,169::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (ReinitializeFSM-GlobalMaintenance) sent? ignored MainThread::INFO::2021-08-03
06:47:05,249::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 3400) MainThread::INFO::2021-08-03
06:47:14,439::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03
06:47:25,526::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 814 due to cpu load MainThread::INFO::2021-08-03
06:47:25,527::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2586) MainThread::INFO::2021-08-03
06:47:25,537::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03
06:47:26,029::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2586) MainThread::INFO::2021-08-03
06:47:35,050::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03
06:47:35,576::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2586) MainThread::INFO::2021-08-03
06:47:45,597::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03
06:47:46,521::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2586) MainThread::INFO::2021-08-03
06:47:55,577::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03
06:47:56,559::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2586) MainThread::INFO::2021-08-03
06:47:56,559::hosted_engine::525::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Best remote host ost-he-basic-suite-master-host-1 (id: 2, score: 3400) MainThread::INFO::2021-08-03
06:48:05,633::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03
06:48:06,188::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 820 due to cpu load MainThread::INFO::2021-08-03
06:48:06,188::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2580) MainThread::INFO::2021-08-03
06:48:16,256::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03
06:48:16,950::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 831 due to cpu load MainThread::INFO::2021-08-03
06:48:16,951::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2569) MainThread::INFO::2021-08-03
06:48:26,053::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03
06:48:26,999::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 839 due to cpu load MainThread::INFO::2021-08-03
06:48:26,999::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2561) MainThread::INFO::2021-08-03
06:48:36,026::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03
06:48:36,802::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 844 due to cpu load MainThread::INFO::2021-08-03
06:48:36,802::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2556) MainThread::INFO::2021-08-03
06:48:45,827::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance detected MainThread::INFO::2021-08-03
06:48:46,401::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 849 due to cpu load MainThread::INFO::2021-08-03
06:48:46,401::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state GlobalMaintenance (score: 2551) MainThread::INFO::2021-08-03
06:48:56,588::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (GlobalMaintenance-ReinitializeFSM) sent? ignored MainThread::INFO::2021-08-03
06:48:58,685::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state ReinitializeFSM (score: 0) MainThread::INFO::2021-08-03
06:49:05,729::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (ReinitializeFSM-EngineStarting) sent? ignored MainThread::INFO::2021-08-03
06:49:06,146::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 830 due to cpu load MainThread::INFO::2021-08-03
06:49:06,146::states::72::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_penalize_memory) Penalizing score by 400 due to free memory 1782 being lower than required 3171 MainThread::INFO::2021-08-03
06:49:06,146::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 2170) MainThread::INFO::2021-08-03
06:49:06,150::state_decorators::95::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineUp'> MainThread::INFO::2021-08-03
06:49:06,172::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStarting-EngineUp) sent? ignored MainThread::INFO::2021-08-03
06:49:06,178::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUp (score: 2570) MainThread::ERROR::2021-08-03
06:49:16,197::states::398::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Host ost-he-basic-suite-master-host-1 (id 2) score is significantly better than local score, shutting down VM on this host
=====================================================================================
I looked a bit at /var/log/messages of the host, and while there is quite some noise there, can't tell specifically what might have caused the high cpu load.
Also looked at logs of previous runs, and there is indeed a similar pattern, where the cpuload on the host causes the agent to penalize the score, but according to the agent log the load goes down faster, up to the point of being quite low when the engine is up, and only then we exit global maintenance, allowing the agent to take actions.
I now ran it again, but think that otherwise I'll ignore this for now, unless we see more similar failures. If we do, we might want to check/monitor/log the cpu load on the hosts, and/or change test_008 to wait, after the engine is up, until the cpu load on the host goes down a bit. Before the port to pytest we waited a hard-coded 5 minutes and I changed that then to only wait until the engine VM is not migrating, and this worked more-or-less ok. We might need to refine that, although I'd rather not introduce another arbitrary long delays but wait for some condition.
I also noticed that this ran was started at "03:04 AM" (UTC), by timer, whereas all previous timer-based runs started at 01:55 AM, somewhat earlier - perhaps there are some other things that run at that time that cause load.
Checked some more and found another difference: The total cpu use was not very different between the failed and the last successful runs, but load caused by the engine vm was different between them - in the failed run, engine kept being '0.0' for a longer time, compared with the successful run. Since the agent tries to ignore the load by the engine vm, the final result is significantly different. The load-by -engine-vm is calculated based on the result of a VDSM getStats call on the engine VM. Indeed, in vdsm.log, cpuUser and cpuSys are both starting with '0.00' for both runs, but the failed one is '0.00' for a longer time - 45 seconds between VM.create and first-non-zero cpuUser on the successful run, vs 93 seconds for the failed one. Does this make sense (from VDSM's POV)? If so, perhaps there is a way to get from VDSM the answer to this question: "Is this 0.00 CPU usage a result of still-initial, non-accurate, non-complete monitoring? Or is it really not using the CPU at all?" and use it in the agent somehow.
I think it’s futile to try to understand that on jenkins.ovirt.org runs. It’s too unpredictable since it runs in a shared environment, you can’t know what’s going on on the same physical host.
I understand that I can't get/expect stable performance/load there. That's not my question. I asked about VDSM. I don't know the code there well. I assume that when it starts a VM, it initializes something somewhere with 'cpuUser: 0.00',
Yes.
and once its gets _some_ data, whatever that might be, it starts reporting (based on) this data.
Yes. It retrieves the data in 15 seconds intervals by default and it needs 2 retrievals to compute the data (the difference between the 2 samples). It gets the data from libvirt for running VMs. And then it waits until Engine calls getStats, which I believe is also every 15 seconds. So under normal circumstances it should take about 15-45 seconds after the VM starts being running in libvirt before first real data is reported to Engine.
I can think about the following circumstances when it could be delayed:
- The VM is detected as non-responsive in libvirt. Then the VM is not queried for its data this time and the following 40 seconds.
- The executor gets overfilled.
- Engine doesn't ask for the data for some reason.
- There is something very wrong in the system.
None of these should happen if everything is all right but sometimes not everything is all right, especially when there is something stuck in QEMU monitoring (e.g. because of temporary storage problems).
My question was if there is a reliable way to know if '0.00' is just real data, or initial-not-yet-real data.
From this data itself not but from the accompanied data probably yes. I'd say that if memory stats are present then it's a real value.
That's indeed what I see also in the referenced vdsm.log. Now pushed: https://gerrit.ovirt.org/c/ovirt-hosted-engine-ha/+/116145 Please review. Thanks! -- Didi
participants (4)
-
Michal Skrivanek
-
Milan Zamazal
-
Nir Soffer
-
Yedidyah Bar David