[ OST Failure Report ] [ oVirt Master (log-collector/engine/vdsm) ] [ 23/25/26-03-2018 ] [004_basic_sanity.hotplug_cpu ]

Hi, We have a failure that seems to be random and happening in several projects. from what I can see, we are failing due to a timing issue in the test itself because we are querying the vm after its been destroyed in engine. looking at engine, I can see that the vm was actually shut down, I would like to disable this test until we can fix the issue since it already failed about 7 different patches from different projects. *Link to Job:http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/6521/ <http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/6521/>Link to all logs:http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/6521/artifact/... <http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/6521/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-004_basic_sanity.py/>(Relevant) error snippet from the log: <error>* *engine: 018-03-23 14:44:14,156-04 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-5) [] Failed to destroy VM '7020944f-cd26-4c31-b381-0ce6be733fd7' because VM does not exist, ignoring2018-03-23 14:44:14,156-04 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-5) [] FINISH, DestroyVDSCommand, log id: 646766a2018-03-23 14:44:14,156-04 DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (ForkJoinPool-1-worker-5) [] method: runVdsCommand, params: [Destroy, DestroyVmVDSCommandParameters:{hostId='c40b11df-7053-437e-8acc-c5e91c1694bc', vmId='7020944f-cd26-4c31-b381-0ce6be733fd7', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='true'}], timeElapsed: 14ms2018-03-23 14:44:14,156-04 DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (ForkJoinPool-1-worker-5) [] method: getVmManager, params: [7020944f-cd26-4c31-b381-0ce6be733fd7], timeElapsed: 0ms2018-03-23 14:44:14,156-04 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-5) [] VM '7020944f-cd26-4c31-b381-0ce6be733fd7'(vm2) moved from 'Up' --> 'Down'2018-03-23 14:44:14,156-04 DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (ForkJoinPool-1-worker-5) [] method: getVmManager, params: [7020944f-cd26-4c31-b381-0ce6be733fd7], timeElapsed: 0ms2018-03-23 14:44:14,177-04 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-4) [a463501f-f01a-4048-93ca-02647308394b] EVENT_ID: USER_STOP_VM(33), VM vm2 powered off by admin@internal-authz (Host: lago-basic-suite-master-host-0).* *vdsm: 2018-03-23 14:44:14,024-0400 INFO (libvirt/events) [virt.vm] (vmId='7020944f-cd26-4c31-b381-0ce6be733fd7') Changed state to Down: Admin shut down from the engine (code=6) (vm:1684)2018-03-23 14:44:14,028-0400 INFO (jsonrpc/1) [api.virt] FINISH destroy return={'status': {'message': 'Machine destroyed', 'code': 0}} from=::ffff:192.168.201.4,53454, flow_id=a463501f-f01a-4048-93ca-02647308394b, vmId=7020944f-cd26-4c31-b381-0ce6be733fd7 (api:52)2018-03-23 14:44:14,028-0400 INFO (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call VM.destroy succeeded in 1.72 seconds (__init__:311)2018-03-23 14:44:14,031-0400 INFO (periodic/2) [Executor] Worker was discarded (executor:305)2018-03-23 14:44:14,033-0400 INFO (libvirt/events) [virt.vm] (vmId='7020944f-cd26-4c31-b381-0ce6be733fd7') Stopping connection (guestagent:440)2018-03-23 14:44:14,033-0400 ERROR (libvirt/events) [vds] Error running VM callback (clientIF:666)Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 629, in dispatchLibvirtEvents v.onLibvirtLifecycleEvent(event, detail, None)AttributeError: 'NoneType' object has no attribute 'onLibvirtLifecycleEvent'2018-03-23 14:44:14,078-0400 INFO (jsonrpc/4) [api.virt] START destroy(gracefulAttempts=1) from=::ffff:192.168.201.4,53454, vmId=7020944f-cd26-4c31-b381-0ce6be733fd7 (api:46)2018-03-23 14:44:14,078-0400 INFO (jsonrpc/4) [api] FINISH destroy error=Virtual machine does not exist: {'vmId': u'7020944f-cd26-4c31-b381-0ce6be733fd7'} (api:127)2018-03-23 14:44:14,079-0400 INFO (jsonrpc/4) [api.virt] FINISH destroy return={'status': {'message': "Virtual machine does not exist: {'vmId': u'7020944f-cd26-4c31-b381-0ce6be733fd7'}", 'code': 1}} from=::ffff:192.168.201.4,53454, vmId=7020944f-cd26-4c31-b381-0ce6be733fd7 (api:52)* *</error>*

Dafna Ron <dron@redhat.com> writes:
We have a failure that seems to be random and happening in several projects.
Does this failure occur only recently or has it been present for ages?
from what I can see, we are failing due to a timing issue in the test itself because we are querying the vm after its been destroyed in engine. looking at engine, I can see that the vm was actually shut down,
No, the VM was shut down in another test. It's already running again in hotplug_cpu.
I would like to disable this test until we can fix the issue since it already failed about 7 different patches from different projects.
I can see that Gal has already increased the timeout. I think the test could be split to reduce the waiting delay, I'll post a patch for that.
*Link to Job:http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/6521/ <http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/6521/>Link to all logs:http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/6521/artifact/... <http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/6521/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-004_basic_sanity.py/>(Relevant) error snippet from the log: <error>*
*engine: 018-03-23 14:44:14,156-04 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-5) [] Failed to destroy VM '7020944f-cd26-4c31-b381-0ce6be733fd7' because VM does not exist, ignoring2018-03-23 14:44:14,156-04 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-5) [] FINISH, DestroyVDSCommand, log id: 646766a2018-03-23 14:44:14,156-04 DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (ForkJoinPool-1-worker-5) [] method: runVdsCommand, params: [Destroy, DestroyVmVDSCommandParameters:{hostId='c40b11df-7053-437e-8acc-c5e91c1694bc', vmId='7020944f-cd26-4c31-b381-0ce6be733fd7', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='true'}], timeElapsed: 14ms2018-03-23 14:44:14,156-04 DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (ForkJoinPool-1-worker-5) [] method: getVmManager, params: [7020944f-cd26-4c31-b381-0ce6be733fd7], timeElapsed: 0ms2018-03-23 14:44:14,156-04 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-5) [] VM '7020944f-cd26-4c31-b381-0ce6be733fd7'(vm2) moved from 'Up' --> 'Down'2018-03-23 14:44:14,156-04 DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (ForkJoinPool-1-worker-5) [] method: getVmManager, params: [7020944f-cd26-4c31-b381-0ce6be733fd7], timeElapsed: 0ms2018-03-23 14:44:14,177-04 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-4) [a463501f-f01a-4048-93ca-02647308394b] EVENT_ID: USER_STOP_VM(33), VM vm2 powered off by admin@internal-authz (Host: lago-basic-suite-master-host-0).*
*vdsm: 2018-03-23 14:44:14,024-0400 INFO (libvirt/events) [virt.vm] (vmId='7020944f-cd26-4c31-b381-0ce6be733fd7') Changed state to Down: Admin shut down from the engine (code=6) (vm:1684)2018-03-23 14:44:14,028-0400 INFO (jsonrpc/1) [api.virt] FINISH destroy return={'status': {'message': 'Machine destroyed', 'code': 0}} from=::ffff:192.168.201.4,53454, flow_id=a463501f-f01a-4048-93ca-02647308394b, vmId=7020944f-cd26-4c31-b381-0ce6be733fd7 (api:52)2018-03-23 14:44:14,028-0400 INFO (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call VM.destroy succeeded in 1.72 seconds (__init__:311)2018-03-23 14:44:14,031-0400 INFO (periodic/2) [Executor] Worker was discarded (executor:305)2018-03-23 14:44:14,033-0400 INFO (libvirt/events) [virt.vm] (vmId='7020944f-cd26-4c31-b381-0ce6be733fd7') Stopping connection (guestagent:440)2018-03-23 14:44:14,033-0400 ERROR (libvirt/events) [vds] Error running VM callback (clientIF:666)Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 629, in dispatchLibvirtEvents v.onLibvirtLifecycleEvent(event, detail, None)AttributeError: 'NoneType' object has no attribute 'onLibvirtLifecycleEvent'2018-03-23 14:44:14,078-0400 INFO (jsonrpc/4) [api.virt] START destroy(gracefulAttempts=1) from=::ffff:192.168.201.4,53454, vmId=7020944f-cd26-4c31-b381-0ce6be733fd7 (api:46)2018-03-23 14:44:14,078-0400 INFO (jsonrpc/4) [api] FINISH destroy error=Virtual machine does not exist: {'vmId': u'7020944f-cd26-4c31-b381-0ce6be733fd7'} (api:127)2018-03-23 14:44:14,079-0400 INFO (jsonrpc/4) [api.virt] FINISH destroy return={'status': {'message': "Virtual machine does not exist: {'vmId': u'7020944f-cd26-4c31-b381-0ce6be733fd7'}", 'code': 1}} from=::ffff:192.168.201.4,53454, vmId=7020944f-cd26-4c31-b381-0ce6be733fd7 (api:52)*
*</error>* _______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

Milan Zamazal <mzamazal@redhat.com> writes:
Dafna Ron <dron@redhat.com> writes:
We have a failure that seems to be random and happening in several projects.
Does this failure occur only recently or has it been present for ages?
from what I can see, we are failing due to a timing issue in the test itself because we are querying the vm after its been destroyed in engine. looking at engine, I can see that the vm was actually shut down,
No, the VM was shut down in another test. It's already running again in hotplug_cpu.
I would like to disable this test until we can fix the issue since it already failed about 7 different patches from different projects.
I can see that Gal has already increased the timeout. I think the test could be split to reduce the waiting delay, I'll post a patch for that.
BTW I think the primary cause of the trouble are the infamous Cirros networking recovery problems.

Thanks Milan, Can you post the fix that you added on the mail? Thanks, Dafna On Wed, Mar 28, 2018 at 9:23 AM, Milan Zamazal <mzamazal@redhat.com> wrote:
Milan Zamazal <mzamazal@redhat.com> writes:
Dafna Ron <dron@redhat.com> writes:
We have a failure that seems to be random and happening in several projects.
Does this failure occur only recently or has it been present for ages?
from what I can see, we are failing due to a timing issue in the test itself because we are querying the vm after its been destroyed in engine. looking at engine, I can see that the vm was actually shut down,
No, the VM was shut down in another test. It's already running again in hotplug_cpu.
I would like to disable this test until we can fix the issue since it already failed about 7 different patches from different projects.
I can see that Gal has already increased the timeout. I think the test could be split to reduce the waiting delay, I'll post a patch for that.
BTW I think the primary cause of the trouble are the infamous Cirros networking recovery problems.

Dafna Ron <dron@redhat.com> writes:
Can you post the fix that you added on the mail?
I split hotplug_cpu test in https://gerrit.ovirt.org/89551 We'll see whether it is actually useful once the disk_operations failure is fixed. If it still causes problem then hotplug_cpu_guest_check introduced by the patch should be disabled until we can log into the guest OS reliably.
On Wed, Mar 28, 2018 at 9:23 AM, Milan Zamazal <mzamazal@redhat.com> wrote:
Milan Zamazal <mzamazal@redhat.com> writes:
Dafna Ron <dron@redhat.com> writes:
We have a failure that seems to be random and happening in several projects.
Does this failure occur only recently or has it been present for ages?
from what I can see, we are failing due to a timing issue in the test itself because we are querying the vm after its been destroyed in engine. looking at engine, I can see that the vm was actually shut down,
No, the VM was shut down in another test. It's already running again in hotplug_cpu.
I would like to disable this test until we can fix the issue since it already failed about 7 different patches from different projects.
I can see that Gal has already increased the timeout. I think the test could be split to reduce the waiting delay, I'll post a patch for that.
BTW I think the primary cause of the trouble are the infamous Cirros networking recovery problems.

Milan Zamazal <mzamazal@redhat.com> writes:
Dafna Ron <dron@redhat.com> writes:
Can you post the fix that you added on the mail?
I split hotplug_cpu test in https://gerrit.ovirt.org/89551
After some thought, I have a different fix: https://gerrit.ovirt.org/89589 This one checks whether ssh (and not only network) is available after resume and if not then it reboots the VM as if network wasn't available.
We'll see whether it is actually useful once the disk_operations failure is fixed. If it still causes problem then hotplug_cpu_guest_check introduced by the patch should be disabled until we can log into the guest OS reliably.
On Wed, Mar 28, 2018 at 9:23 AM, Milan Zamazal <mzamazal@redhat.com> wrote:
Milan Zamazal <mzamazal@redhat.com> writes:
Dafna Ron <dron@redhat.com> writes:
We have a failure that seems to be random and happening in several projects.
Does this failure occur only recently or has it been present for ages?
from what I can see, we are failing due to a timing issue in the test itself because we are querying the vm after its been destroyed in engine. looking at engine, I can see that the vm was actually shut down,
No, the VM was shut down in another test. It's already running again in hotplug_cpu.
I would like to disable this test until we can fix the issue since it already failed about 7 different patches from different projects.
I can see that Gal has already increased the timeout. I think the test could be split to reduce the waiting delay, I'll post a patch for that.
BTW I think the primary cause of the trouble are the infamous Cirros networking recovery problems.
participants (2)
-
Dafna Ron
-
Milan Zamazal