On Sun, Aug 8, 2021 at 10:14 AM Yedidyah Bar David <didi(a)redhat.com> wrote:
On Thu, Aug 5, 2021 at 9:31 AM Yedidyah Bar David <didi(a)redhat.com> wrote:
>
> On Wed, Aug 4, 2021 at 1:56 PM Michal Skrivanek
> <michal.skrivanek(a)redhat.com> wrote:
> > I don’t really know for sure, but AFAICT it should be real data from the
start.
> > Maybe for the first interval, but afterwards it’s always a libvirt reported
value
>
> Adding Nir. Not sure who else... sorry.
>
> This now happened again:
>
>
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2129/
>
> Console has:
>
> 06:25:25 2021-08-05 03:25:25+0000,873 INFO [root] Starting the
> engine VM... (test_008_restart_he_vm:96)
>
> broker.log has (I think it only logs once a minute):
>
> Thread-4::INFO::2021-08-05
>
05:25:31,995::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load)
> System load total=0.8164, engine=0.0000, non-engine=0.8164
> Thread-4::INFO::2021-08-05
>
05:26:32,072::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load)
> System load total=0.8480, engine=0.0000, non-engine=0.8480
> Thread-4::INFO::2021-08-05
>
05:27:32,175::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load)
> System load total=0.7572, engine=0.2656, non-engine=0.4916
>
> vdsm.log [1] has:
>
> 2021-08-05 05:25:29,017+0200 DEBUG (jsonrpc/4) [jsonrpc.JsonRpcServer]
> Calling 'VM.create' in bridge...
>
> 2021-08-05 05:25:31,991+0200 DEBUG (jsonrpc/7) [api] FINISH getStats
> response={'status': {'code': 0, 'message': 'Done'},
'statsList':
> [{'statusTime': '2152587436', 'status':
'WaitForLaunch', 'vmId':
> '230ea8e8-e365-46cd-98fa-e9d6a653306f', 'vmName':
'HostedEngine',
> 'vmType': 'kvm', 'kvmEnable': 'true',
'acpiEnable': 'true',
> 'elapsedTime': '2', 'monitorResponse': '0',
'clientIp': '',
> 'timeOffset': '0', 'cpuUser': '0.00',
'cpuSys': '0.00',...
>
> and 17 more such [2] lines. Line 11 is the first one with cpuUser !=
> 0.00, at '2021-08-05 05:27:02', 92 seconds later. Incidentally (or
> not), this is also the first line with 'network' in it. There are
> other differences along the way - e.g. I see status moving from
> WaitForLaunch to 'Powering up' and to 'Up', but the first
'Up' line is
> number 7 - 40 seconds before cpuUser>0.
Milan should be able to help with this.
In storage monitoring we avoid this issue by reporting actual=False
before we got the first monitoring results, so engine can wait for the actual
results.
> I'd like to clarify that I do not see this mainly as an OST
issue, but
> more as a general HE HA issue - if users start global maint, then
> restart the engine vm, then exit global maint too quickly, the
> reported high cpu load might make the machine go down. In OST, I can
> easily just add another 60 seconds or so delay after the engine is up.
> Of course we can do the same also in HA, and I'd be for doing this, if
> we do not get any more information (or find out that this is a
> recently-introduced bug and fix it).
If this is a real issue you should be able to reproduce this on a real system.
Nir