
On Mon, Aug 9, 2021 at 1:39 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Sun, Aug 8, 2021 at 10:14 AM Yedidyah Bar David <didi@redhat.com> wrote:
On Thu, Aug 5, 2021 at 9:31 AM Yedidyah Bar David <didi@redhat.com> wrote:
On Wed, Aug 4, 2021 at 1:56 PM Michal Skrivanek <michal.skrivanek@redhat.com> wrote:
I don’t really know for sure, but AFAICT it should be real data from the start. Maybe for the first interval, but afterwards it’s always a libvirt reported value
Adding Nir. Not sure who else... sorry.
This now happened again:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2129/
Console has:
06:25:25 2021-08-05 03:25:25+0000,873 INFO [root] Starting the engine VM... (test_008_restart_he_vm:96)
broker.log has (I think it only logs once a minute):
Thread-4::INFO::2021-08-05 05:25:31,995::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load) System load total=0.8164, engine=0.0000, non-engine=0.8164 Thread-4::INFO::2021-08-05 05:26:32,072::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load) System load total=0.8480, engine=0.0000, non-engine=0.8480 Thread-4::INFO::2021-08-05 05:27:32,175::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load) System load total=0.7572, engine=0.2656, non-engine=0.4916
vdsm.log [1] has:
2021-08-05 05:25:29,017+0200 DEBUG (jsonrpc/4) [jsonrpc.JsonRpcServer] Calling 'VM.create' in bridge...
2021-08-05 05:25:31,991+0200 DEBUG (jsonrpc/7) [api] FINISH getStats response={'status': {'code': 0, 'message': 'Done'}, 'statsList': [{'statusTime': '2152587436', 'status': 'WaitForLaunch', 'vmId': '230ea8e8-e365-46cd-98fa-e9d6a653306f', 'vmName': 'HostedEngine', 'vmType': 'kvm', 'kvmEnable': 'true', 'acpiEnable': 'true', 'elapsedTime': '2', 'monitorResponse': '0', 'clientIp': '', 'timeOffset': '0', 'cpuUser': '0.00', 'cpuSys': '0.00',...
and 17 more such [2] lines. Line 11 is the first one with cpuUser != 0.00, at '2021-08-05 05:27:02', 92 seconds later. Incidentally (or not), this is also the first line with 'network' in it. There are other differences along the way - e.g. I see status moving from WaitForLaunch to 'Powering up' and to 'Up', but the first 'Up' line is number 7 - 40 seconds before cpuUser>0.
Milan should be able to help with this.
Thanks.
In storage monitoring we avoid this issue by reporting actual=False before we got the first monitoring results, so engine can wait for the actual results. https://github.com/oVirt/vdsm/blob/4309a39492040300e1b983eb583e8847f5cc7538/...
Makes sense. That's indeed what I was looking for, for VM cpu usage.
I'd like to clarify that I do not see this mainly as an OST issue, but more as a general HE HA issue - if users start global maint, then restart the engine vm, then exit global maint too quickly, the reported high cpu load might make the machine go down. In OST, I can easily just add another 60 seconds or so delay after the engine is up. Of course we can do the same also in HA, and I'd be for doing this, if we do not get any more information (or find out that this is a recently-introduced bug and fix it).
If this is a real issue you should be able to reproduce this on a real system.
In "real", you might refer to two different things: 1. OST is a different environment - has ridiculously little memory/cpu, etc., or something else that is not expected or not recommended for a real system. 2. The _flow_ is not real. As in, it's unlikely that a real user will exit global maintenance so quickly after starting the engine VM, without looking around a bit more. I agree with both - and even if it's eventually considered a real bug, I'd not consider it severe. But just saying "OST is not a real system" is not something I can completely agree with. We have a balance/tradeoff here between trying to imitate "real systems" as accurately as possible and between doing this efficiently/effectively. I do not think there is a deliberate design choice to make it arbitrarily different from real systems. Best regards, -- Didi