[ovirt-devel] Re: OST HE: Engine VM went down due to cpu load (was: [oVirt Jenkins] ovirt-system-tests_he-basic-suite-master - Build # 2126 - Failure!)

Monday, 9 August 2021

On Sun, Aug 8, 2021 at 10:14 AM Yedidyah Bar David <didi(a)redhat.com&gt; wrote:
...

 On Thu, Aug 5, 2021 at 9:31 AM Yedidyah Bar David <didi(a)redhat.com&gt; wrote:
 >
 > On Wed, Aug 4, 2021 at 1:56 PM Michal Skrivanek
 > <michal.skrivanek(a)redhat.com&gt; wrote:
 > > I don’t really know for sure, but AFAICT it should be real data from the
start.
 > > Maybe for the first interval, but afterwards it’s always a libvirt reported
value
 >
 > Adding Nir. Not sure who else... sorry.
 >
 > This now happened again:
 >
 > https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2129/
 >
 > Console has:
 >
 > 06:25:25 2021-08-05 03:25:25+0000,873 INFO    [root] Starting the
 > engine VM... (test_008_restart_he_vm:96)
 >
 > broker.log has (I think it only logs once a minute):
 >
 > Thread-4::INFO::2021-08-05
 >
05:25:31,995::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load)
 > System load total=0.8164, engine=0.0000, non-engine=0.8164
 > Thread-4::INFO::2021-08-05
 >
05:26:32,072::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load)
 > System load total=0.8480, engine=0.0000, non-engine=0.8480
 > Thread-4::INFO::2021-08-05
 >
05:27:32,175::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load)
 > System load total=0.7572, engine=0.2656, non-engine=0.4916
 >
 > vdsm.log [1] has:
 >
 > 2021-08-05 05:25:29,017+0200 DEBUG (jsonrpc/4) [jsonrpc.JsonRpcServer]
 > Calling 'VM.create' in bridge...
 >
 > 2021-08-05 05:25:31,991+0200 DEBUG (jsonrpc/7) [api] FINISH getStats
 > response={'status': {'code': 0, 'message': 'Done'},
'statsList':
 > [{'statusTime': '2152587436', 'status':
'WaitForLaunch', 'vmId':
 > '230ea8e8-e365-46cd-98fa-e9d6a653306f', 'vmName':
'HostedEngine',
 > 'vmType': 'kvm', 'kvmEnable': 'true',
'acpiEnable': 'true',
 > 'elapsedTime': '2', 'monitorResponse': '0',
'clientIp': '',
 > 'timeOffset': '0', 'cpuUser': '0.00',
'cpuSys': '0.00',...
 >
 > and 17 more such [2] lines. Line 11 is the first one with cpuUser !=
 > 0.00, at '2021-08-05 05:27:02', 92 seconds later. Incidentally (or
 > not), this is also the first line with 'network' in it. There are
 > other differences along the way - e.g. I see status moving from
 > WaitForLaunch to 'Powering up' and to 'Up', but the first
'Up' line is
 > number 7 - 40 seconds before cpuUser>0. 
Milan should be able to help with this.

In storage monitoring we avoid this issue by reporting actual=False
before we got the first monitoring results, so engine can wait for the actual
results.
https://github.com/oVirt/vdsm/blob/4309a39492040300e1b983eb583e8847f5cc75...

...
 > I'd like to clarify that I do not see this mainly as an OST
issue, but
 > more as a general HE HA issue - if users start global maint, then
 > restart the engine vm, then exit global maint too quickly, the
 > reported high cpu load might make the machine go down. In OST, I can
 > easily just add another 60 seconds or so delay after the engine is up.
 > Of course we can do the same also in HA, and I'd be for doing this, if
 > we do not get any more information (or find out that this is a
 > recently-introduced bug and fix it). 
If this is a real issue you should be able to reproduce this on a real system.

Nir

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

[ovirt-devel] Re: OST HE: Engine VM went down due to cpu load (was: [oVirt Jenkins] ovirt-system-tests_he-basic-suite-master - Build # 2126 - Failure!)