[ovirt-devel] Re: OST HE: Engine VM went down due to cpu load (was: [oVirt Jenkins] ovirt-system-tests_he-basic-suite-master - Build # 2126 - Failure!)

9 Aug 2021

      On Mon, Aug 9, 2021 at 1:39 PM Nir Soffer <nsoffer@redhat.com> wrote:
...
On Sun, Aug 8, 2021 at 10:14 AM Yedidyah Bar David <didi@redhat.com> wrote:
...
On Thu, Aug 5, 2021 at 9:31 AM Yedidyah Bar David <didi@redhat.com> wrote:
...
On Wed, Aug 4, 2021 at 1:56 PM Michal Skrivanek
<michal.skrivanek@redhat.com> wrote:
...
I don’t really know for sure, but AFAICT it should be real data from the start.
Maybe for the first interval, but afterwards it’s always a libvirt reported value
Adding Nir. Not sure who else... sorry.
This now happened again:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2129/
Console has:
06:25:25 2021-08-05 03:25:25+0000,873 INFO    [root] Starting the
engine VM... (test_008_restart_he_vm:96)
broker.log has (I think it only logs once a minute):
Thread-4::INFO::2021-08-05
05:25:31,995::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load)
System load total=0.8164, engine=0.0000, non-engine=0.8164
Thread-4::INFO::2021-08-05
05:26:32,072::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load)
System load total=0.8480, engine=0.0000, non-engine=0.8480
Thread-4::INFO::2021-08-05
05:27:32,175::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load)
System load total=0.7572, engine=0.2656, non-engine=0.4916
vdsm.log [1] has:
2021-08-05 05:25:29,017+0200 DEBUG (jsonrpc/4) [jsonrpc.JsonRpcServer]
Calling 'VM.create' in bridge...
2021-08-05 05:25:31,991+0200 DEBUG (jsonrpc/7) [api] FINISH getStats
response={'status': {'code': 0, 'message': 'Done'}, 'statsList':
[{'statusTime': '2152587436', 'status': 'WaitForLaunch', 'vmId':
'230ea8e8-e365-46cd-98fa-e9d6a653306f', 'vmName': 'HostedEngine',
'vmType': 'kvm', 'kvmEnable': 'true', 'acpiEnable': 'true',
'elapsedTime': '2', 'monitorResponse': '0', 'clientIp': '',
'timeOffset': '0', 'cpuUser': '0.00', 'cpuSys': '0.00',...
and 17 more such [2] lines. Line 11 is the first one with cpuUser !=
0.00, at '2021-08-05 05:27:02', 92 seconds later. Incidentally (or
not), this is also the first line with 'network' in it. There are
other differences along the way - e.g. I see status moving from
WaitForLaunch to 'Powering up' and to 'Up', but the first 'Up' line is
number 7 - 40 seconds before cpuUser>0.
Milan should be able to help with this.
Thanks.
...
In storage monitoring we avoid this issue by reporting actual=False
before we got the first monitoring results, so engine can wait for the actual
results.
https://github.com/oVirt/vdsm/blob/4309a39492040300e1b983eb583e8847f5cc7538/...
Makes sense. That's indeed what I was looking for, for VM cpu usage.
...
...
...
I'd like to clarify that I do not see this mainly as an OST issue, but
more as a general HE HA issue - if users start global maint, then
restart the engine vm, then exit global maint too quickly, the
reported high cpu load might make the machine go down. In OST, I can
easily just add another 60 seconds or so delay after the engine is up.
Of course we can do the same also in HA, and I'd be for doing this, if
we do not get any more information (or find out that this is a
recently-introduced bug and fix it).
If this is a real issue you should be able to reproduce this on a real system.
In "real", you might refer to two different things:

1. OST is a different environment - has ridiculously little memory/cpu, etc.,
or something else that is not expected or not recommended for a real system.

2. The _flow_ is not real. As in, it's unlikely that a real user will exit
global maintenance so quickly after starting the engine VM, without looking
around a bit more.

I agree with both - and even if it's eventually considered a real bug, I'd
not consider it severe. But just saying "OST is not a real system" is not
something I can completely agree with. We have a balance/tradeoff here between
trying to imitate "real systems" as accurately as possible and between doing
this efficiently/effectively. I do not think there is a deliberate design
choice to make it arbitrarily different from real systems.

Best regards,
-- 
Didi