On 4. 8. 2021, at 7:38, Yedidyah Bar David <didi(a)redhat.com>
wrote:
On Tue, Aug 3, 2021 at 10:27 PM Michal Skrivanek <michal.skrivanek(a)redhat.com
<mailto:michal.skrivanek@redhat.com>> wrote:
> On 3. 8. 2021, at 11:43, Yedidyah Bar David <didi(a)redhat.com
<mailto:didi@redhat.com>> wrote:
>
> On Tue, Aug 3, 2021 at 10:05 AM Yedidyah Bar David <didi(a)redhat.com
<mailto:didi@redhat.com>> wrote:
>>
>> On Tue, Aug 3, 2021 at 7:50 AM <jenkins(a)jenkins.phx.ovirt.org
<mailto:jenkins@jenkins.phx.ovirt.org>> wrote:
>>>
>>> Project:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/
<
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/>
>>> Build:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2126/
<
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/21...
>>> Build Number: 2126
>>> Build Status: Failure
>>> Triggered By: Started by timer
>>>
>>> -------------------------------------
>>> Changes Since Last Success:
>>> -------------------------------------
>>> Changes for Build #2126
>>> [Michal Skrivanek] basic: skipping just the VNC console part of
test_virtual_machines
>>>
>>>
>>>
>>>
>>> -----------------
>>> Failed Tests:
>>> -----------------
>>> 2 tests failed.
>>> FAILED:
he-basic-suite-master.test-scenarios.test_012_local_maintenance_sdk.test_local_maintenance
>>>
>>> Error Message:
>>> ovirtsdk4.Error: Failed to read response: [(<pycurl.Curl object at
0x5555faf11228>, 7, 'Failed to connect to 192.168.200.99 port 443: Connection
refused')]
>>
>> This looks very similar to the issue we have with dns/dig failures
>> that cause the engine VM to go down, and it's similar, but different.
>>
>> dig didn't fail (it now uses TCP), but something else caused the agent
>> to stop the engine VM - a combination of high cpu load and low free
>> memory, after restarting the engine VM as part of test_008.
>>
>>
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/21...
<
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/21...
>> :
>>
>>
=====================================================================================
>> MainThread::INFO::2021-08-03
>>
06:46:55,068::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>> Current state ReinitializeFSM (score: 0)
>> MainThread::INFO::2021-08-03
>>
06:47:04,089::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
>> Global maintenance detected
>> MainThread::INFO::2021-08-03
>>
06:47:04,169::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
>> Success, was notification of state_transition
>> (ReinitializeFSM-GlobalMaintenance) sent? ignored
>> MainThread::INFO::2021-08-03
>>
06:47:05,249::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>> Current state GlobalMaintenance (score: 3400)
>> MainThread::INFO::2021-08-03
>>
06:47:14,439::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
>> Global maintenance detected
>> MainThread::INFO::2021-08-03
>>
06:47:25,526::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
>> Penalizing score by 814 due to cpu load
>> MainThread::INFO::2021-08-03
>>
06:47:25,527::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>> Current state GlobalMaintenance (score: 2586)
>> MainThread::INFO::2021-08-03
>>
06:47:25,537::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
>> Global maintenance detected
>> MainThread::INFO::2021-08-03
>>
06:47:26,029::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>> Current state GlobalMaintenance (score: 2586)
>> MainThread::INFO::2021-08-03
>>
06:47:35,050::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
>> Global maintenance detected
>> MainThread::INFO::2021-08-03
>>
06:47:35,576::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>> Current state GlobalMaintenance (score: 2586)
>> MainThread::INFO::2021-08-03
>>
06:47:45,597::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
>> Global maintenance detected
>> MainThread::INFO::2021-08-03
>>
06:47:46,521::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>> Current state GlobalMaintenance (score: 2586)
>> MainThread::INFO::2021-08-03
>>
06:47:55,577::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
>> Global maintenance detected
>> MainThread::INFO::2021-08-03
>>
06:47:56,559::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>> Current state GlobalMaintenance (score: 2586)
>> MainThread::INFO::2021-08-03
>>
06:47:56,559::hosted_engine::525::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>> Best remote host ost-he-basic-suite-master-host-1 (id: 2, score: 3400)
>> MainThread::INFO::2021-08-03
>>
06:48:05,633::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
>> Global maintenance detected
>> MainThread::INFO::2021-08-03
>>
06:48:06,188::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
>> Penalizing score by 820 due to cpu load
>> MainThread::INFO::2021-08-03
>>
06:48:06,188::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>> Current state GlobalMaintenance (score: 2580)
>> MainThread::INFO::2021-08-03
>>
06:48:16,256::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
>> Global maintenance detected
>> MainThread::INFO::2021-08-03
>>
06:48:16,950::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
>> Penalizing score by 831 due to cpu load
>> MainThread::INFO::2021-08-03
>>
06:48:16,951::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>> Current state GlobalMaintenance (score: 2569)
>> MainThread::INFO::2021-08-03
>>
06:48:26,053::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
>> Global maintenance detected
>> MainThread::INFO::2021-08-03
>>
06:48:26,999::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
>> Penalizing score by 839 due to cpu load
>> MainThread::INFO::2021-08-03
>>
06:48:26,999::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>> Current state GlobalMaintenance (score: 2561)
>> MainThread::INFO::2021-08-03
>>
06:48:36,026::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
>> Global maintenance detected
>> MainThread::INFO::2021-08-03
>>
06:48:36,802::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
>> Penalizing score by 844 due to cpu load
>> MainThread::INFO::2021-08-03
>>
06:48:36,802::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>> Current state GlobalMaintenance (score: 2556)
>> MainThread::INFO::2021-08-03
>>
06:48:45,827::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
>> Global maintenance detected
>> MainThread::INFO::2021-08-03
>>
06:48:46,401::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
>> Penalizing score by 849 due to cpu load
>> MainThread::INFO::2021-08-03
>>
06:48:46,401::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>> Current state GlobalMaintenance (score: 2551)
>> MainThread::INFO::2021-08-03
>>
06:48:56,588::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
>> Success, was notification of state_transition
>> (GlobalMaintenance-ReinitializeFSM) sent? ignored
>> MainThread::INFO::2021-08-03
>>
06:48:58,685::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>> Current state ReinitializeFSM (score: 0)
>> MainThread::INFO::2021-08-03
>>
06:49:05,729::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
>> Success, was notification of state_transition
>> (ReinitializeFSM-EngineStarting) sent? ignored
>> MainThread::INFO::2021-08-03
>>
06:49:06,146::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
>> Penalizing score by 830 due to cpu load
>> MainThread::INFO::2021-08-03
>>
06:49:06,146::states::72::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_penalize_memory)
>> Penalizing score by 400 due to free memory 1782 being lower than
>> required 3171
>> MainThread::INFO::2021-08-03
>>
06:49:06,146::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>> Current state EngineStarting (score: 2170)
>> MainThread::INFO::2021-08-03
>>
06:49:06,150::state_decorators::95::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
>> Timeout cleared while transitioning <class
>> 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class
>> 'ovirt_hosted_engine_ha.agent.states.EngineUp'>
>> MainThread::INFO::2021-08-03
>>
06:49:06,172::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
>> Success, was notification of state_transition
>> (EngineStarting-EngineUp) sent? ignored
>> MainThread::INFO::2021-08-03
>>
06:49:06,178::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>> Current state EngineUp (score: 2570)
>> MainThread::ERROR::2021-08-03
>>
06:49:16,197::states::398::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
>> Host ost-he-basic-suite-master-host-1 (id 2) score is significantly
>> better than local score, shutting down VM on this host
>>
=====================================================================================
>>
>> I looked a bit at /var/log/messages of the host, and while there is
>> quite some noise there, can't tell specifically what might have caused
>> the high cpu load.
>>
>> Also looked at logs of previous runs, and there is indeed a similar
>> pattern, where the cpuload on the host causes the agent to penalize
>> the score, but according to the agent log the load goes down faster,
>> up to the point of being quite low when the engine is up, and only
>> then we exit global maintenance, allowing the agent to take actions.
>>
>> I now ran it again, but think that otherwise I'll ignore this for now,
>> unless we see more similar failures. If we do, we might want to
>> check/monitor/log the cpu load on the hosts, and/or change test_008 to
>> wait, after the engine is up, until the cpu load on the host goes down
>> a bit. Before the port to pytest we waited a hard-coded 5 minutes and
>> I
>> changed that then to only wait until the engine VM is not migrating,
>> and this worked more-or-less ok. We might need to refine that,
>> although I'd rather not introduce another arbitrary long delays but
>> wait for some condition.
>>
>> I also noticed that this ran was started at "03:04 AM" (UTC), by
>> timer, whereas all previous timer-based runs started at 01:55 AM,
>> somewhat earlier - perhaps there are some other things that run at
>> that time that cause load.
>
> Checked some more and found another difference: The total cpu use was
> not very different between the failed and the last successful runs,
> but load caused by the engine vm was different between them - in the
> failed run, engine kept being '0.0' for a longer time, compared with
> the successful run. Since the agent tries to ignore the load by the
> engine vm, the final result is significantly different. The load-by
> -engine-vm is calculated based on the result of a VDSM getStats call
> on the engine VM. Indeed, in vdsm.log, cpuUser and cpuSys are both
> starting with '0.00' for both runs, but the failed one is '0.00' for
> a longer time - 45 seconds between VM.create and first-non-zero
> cpuUser on the successful run, vs 93 seconds for the failed one.
> Does this make sense (from VDSM's POV)? If so, perhaps there is
> a way to get from VDSM the answer to this question: "Is this 0.00
> CPU usage a result of still-initial, non-accurate, non-complete
> monitoring? Or is it really not using the CPU at all?" and use it
> in the agent somehow.
I think it’s futile to try to understand that on
jenkins.ovirt.org
<
http://jenkins.ovirt.org/> runs. It’s too unpredictable since it runs in a shared
environment, you can’t know what’s going on on the same physical host.
I understand that I can't get/expect stable performance/load there. That's not my
question. I asked about VDSM. I don't know the code there well. I assume that when it
starts a VM, it initializes something somewhere with 'cpuUser: 0.00', and once its
gets _some_ data, whatever that might be, it starts reporting (based on) this data. My
question was if there is a reliable way to know if '0.00' is just real data, or
initial-not-yet-real data.
I don’t really know for sure, but AFAICT it should be real data from the start.
Maybe for the first interval, but afterwards it’s always a libvirt reported value
One of the reasons we try to move away from it to beaker-based runs
that run dedicated.
The problem is you have no metrics for the actual host, so the numbers OST or vdsm see
can be very misleading
We do have sar monitoring of OST VMs, we do not capture anything from the host (the host
running OST code), but I would guess that even if we would, then for mock env it’s
probably going to be misleading anyway.
Again: Perhaps the actual reported load would be misleading, because it's on a VM,
which does not own its CPUs. I'd still expect '0.00' to mean that, or have a
way to know that it actually means "I don't know yet".
Best regards,
--
Didi