On Wed, Aug 4, 2021 at 1:56 PM Michal Skrivanek
<michal.skrivanek(a)redhat.com> wrote:
I don’t really know for sure, but AFAICT it should be real data from
the start.
Maybe for the first interval, but afterwards it’s always a libvirt reported value
Adding Nir. Not sure who else... sorry.
This now happened again:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2129/
Console has:
06:25:25 2021-08-05 03:25:25+0000,873 INFO [root] Starting the
engine VM... (test_008_restart_he_vm:96)
broker.log has (I think it only logs once a minute):
Thread-4::INFO::2021-08-05
05:25:31,995::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load)
System load total=0.8164, engine=0.0000, non-engine=0.8164
Thread-4::INFO::2021-08-05
05:26:32,072::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load)
System load total=0.8480, engine=0.0000, non-engine=0.8480
Thread-4::INFO::2021-08-05
05:27:32,175::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load)
System load total=0.7572, engine=0.2656, non-engine=0.4916
vdsm.log [1] has:
2021-08-05 05:25:29,017+0200 DEBUG (jsonrpc/4) [jsonrpc.JsonRpcServer]
Calling 'VM.create' in bridge...
2021-08-05 05:25:31,991+0200 DEBUG (jsonrpc/7) [api] FINISH getStats
response={'status': {'code': 0, 'message': 'Done'},
'statsList':
[{'statusTime': '2152587436', 'status': 'WaitForLaunch',
'vmId':
'230ea8e8-e365-46cd-98fa-e9d6a653306f', 'vmName': 'HostedEngine',
'vmType': 'kvm', 'kvmEnable': 'true',
'acpiEnable': 'true',
'elapsedTime': '2', 'monitorResponse': '0',
'clientIp': '',
'timeOffset': '0', 'cpuUser': '0.00', 'cpuSys':
'0.00',...
and 17 more such [2] lines. Line 11 is the first one with cpuUser !=
0.00, at '2021-08-05 05:27:02', 92 seconds later. Incidentally (or
not), this is also the first line with 'network' in it. There are
other differences along the way - e.g. I see status moving from
WaitForLaunch to 'Powering up' and to 'Up', but the first 'Up'
line is
number 7 - 40 seconds before cpuUser>0.
I'd like to clarify that I do not see this mainly as an OST issue, but
more as a general HE HA issue - if users start global maint, then
restart the engine vm, then exit global maint too quickly, the
reported high cpu load might make the machine go down. In OST, I can
easily just add another 60 seconds or so delay after the engine is up.
Of course we can do the same also in HA, and I'd be for doing this, if
we do not get any more information (or find out that this is a
recently-introduced bug and fix it).
[1]
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/21...
[2] grep -i " 05:2[5678].*api. finish getStats.*cpuUser':"
Thanks and best regards,
--
Didi