On Mon, Aug 9, 2021 at 1:39 PM Nir Soffer <nsoffer(a)redhat.com> wrote:
On Sun, Aug 8, 2021 at 10:14 AM Yedidyah Bar David <didi(a)redhat.com> wrote:
>
> On Thu, Aug 5, 2021 at 9:31 AM Yedidyah Bar David <didi(a)redhat.com> wrote:
> >
> > On Wed, Aug 4, 2021 at 1:56 PM Michal Skrivanek
> > <michal.skrivanek(a)redhat.com> wrote:
> > > I don’t really know for sure, but AFAICT it should be real data from the
start.
> > > Maybe for the first interval, but afterwards it’s always a libvirt
reported value
> >
> > Adding Nir. Not sure who else... sorry.
> >
> > This now happened again:
> >
> >
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2129/
> >
> > Console has:
> >
> > 06:25:25 2021-08-05 03:25:25+0000,873 INFO [root] Starting the
> > engine VM... (test_008_restart_he_vm:96)
> >
> > broker.log has (I think it only logs once a minute):
> >
> > Thread-4::INFO::2021-08-05
> >
05:25:31,995::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load)
> > System load total=0.8164, engine=0.0000, non-engine=0.8164
> > Thread-4::INFO::2021-08-05
> >
05:26:32,072::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load)
> > System load total=0.8480, engine=0.0000, non-engine=0.8480
> > Thread-4::INFO::2021-08-05
> >
05:27:32,175::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load)
> > System load total=0.7572, engine=0.2656, non-engine=0.4916
> >
> > vdsm.log [1] has:
> >
> > 2021-08-05 05:25:29,017+0200 DEBUG (jsonrpc/4) [jsonrpc.JsonRpcServer]
> > Calling 'VM.create' in bridge...
> >
> > 2021-08-05 05:25:31,991+0200 DEBUG (jsonrpc/7) [api] FINISH getStats
> > response={'status': {'code': 0, 'message':
'Done'}, 'statsList':
> > [{'statusTime': '2152587436', 'status':
'WaitForLaunch', 'vmId':
> > '230ea8e8-e365-46cd-98fa-e9d6a653306f', 'vmName':
'HostedEngine',
> > 'vmType': 'kvm', 'kvmEnable': 'true',
'acpiEnable': 'true',
> > 'elapsedTime': '2', 'monitorResponse': '0',
'clientIp': '',
> > 'timeOffset': '0', 'cpuUser': '0.00',
'cpuSys': '0.00',...
> >
> > and 17 more such [2] lines. Line 11 is the first one with cpuUser !=
> > 0.00, at '2021-08-05 05:27:02', 92 seconds later. Incidentally (or
> > not), this is also the first line with 'network' in it. There are
> > other differences along the way - e.g. I see status moving from
> > WaitForLaunch to 'Powering up' and to 'Up', but the first
'Up' line is
> > number 7 - 40 seconds before cpuUser>0.
Milan should be able to help with this.
Thanks.
In storage monitoring we avoid this issue by reporting actual=False
before we got the first monitoring results, so engine can wait for the actual
results.
https://github.com/oVirt/vdsm/blob/4309a39492040300e1b983eb583e8847f5cc75...
Makes sense. That's indeed what I was looking for, for VM cpu usage.
> > I'd like to clarify that I do not see this mainly as an OST issue, but
> > more as a general HE HA issue - if users start global maint, then
> > restart the engine vm, then exit global maint too quickly, the
> > reported high cpu load might make the machine go down. In OST, I can
> > easily just add another 60 seconds or so delay after the engine is up.
> > Of course we can do the same also in HA, and I'd be for doing this, if
> > we do not get any more information (or find out that this is a
> > recently-introduced bug and fix it).
If this is a real issue you should be able to reproduce this on a real system.
In "real", you might refer to two different things:
1. OST is a different environment - has ridiculously little memory/cpu, etc.,
or something else that is not expected or not recommended for a real system.
2. The _flow_ is not real. As in, it's unlikely that a real user will exit
global maintenance so quickly after starting the engine VM, without looking
around a bit more.
I agree with both - and even if it's eventually considered a real bug, I'd
not consider it severe. But just saying "OST is not a real system" is not
something I can completely agree with. We have a balance/tradeoff here between
trying to imitate "real systems" as accurately as possible and between doing
this efficiently/effectively. I do not think there is a deliberate design
choice to make it arbitrarily different from real systems.
Best regards,
--
Didi