[ovirt-devel] [ OST Failure Report ] [ oVirt Master ] [ 05-12-2017 ] [ 006_migrations.migrate_vm ]

Milan Zamazal mzamazal at redhat.com
Mon Dec 11 16:44:40 UTC 2017


Michal Skrivanek <michal.skrivanek at redhat.com> writes:

> Milan,
>
> log [1], VM b3962e5c-08e3-444e-910e-ea675fa1a5c7
> migration away finished at 2017-12-05 06:26:24,515-0500
> incoming migration of the same VM back, at 2017-12-05 06:26:46,614-0500
>
> seems to me the migration away didn’t really properly clean up the VM. Milane,
> can you check that if logs matches? Perhaps orphaned libvirt’s xml?

It seems that, for unknown reasons, Engine didn’t call Destroy on the
source after the first migration.  So while the VM was no longer running
there, it was still present in Vdsm (in Down status) and thus the
following create call was rejected in API.

There are weird messages in engine.log during the first migration such
as the following ones, I don’t know whether they are harmless or not:

  2017-12-05 06:26:19,799-05 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-89) [] VM 'b3962e5c-08e3-444e-910e-ea675fa1a5c7'(vm0) was unexpectedly detected as 'MigratingTo' on VDS '56bf5bdd-8a95-4b7a-9c72-7206d6b59e38'(lago-basic-suite-master-host-1) (expected on '104bd4da-24bb-4368-a5ca-21a465aca70e')
  2017-12-05 06:26:19,799-05 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-89) [] VM 'b3962e5c-08e3-444e-910e-ea675fa1a5c7' is migrating to VDS '56bf5bdd-8a95-4b7a-9c72-7206d6b59e38'(lago-basic-suite-master-host-1) ignoring it in the refresh until migration is done

The migration completion event was successfully obtained by Engine:

  2017-12-05 06:26:24,542-05 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-2) [] EVENT_ID: VM_MIGRATION_DONE(63), Migration completed (VM: vm0, Source: lago-basic-suite-master-host-0, Destination: lago-basic-suite-master-host-1, Duration: 6 seconds, Total: 6 seconds, Actual downtime: 73ms)

> Otherwise it would need reproduction in CI and some more logs….
>
> [1]
> http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4278/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-006_migrations.py/lago-basic-suite-master-host-0/_var_log/vdsm/vdsm.log
> <http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4278/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-006_migrations.py/lago-basic-suite-master-host-0/_var_log/vdsm/vdsm.log>
>
>> On 5 Dec 2017, at 13:26, Dafna Ron <dron at redhat.com> wrote:
>> 
>> Hi, 
>> We had a failure for test 006_migrations.migrate_vm on master. 
>> 
>> There was a libvirt disruption in the migration src  and after that vdsm reported the migration as failed because the vm already exists which makes me suspect a split bran case. 
>> The patch reported has nothing to do with this issue and I think we simply stumbled on a race condition which can cause a split brain. 
>> Please note that there are several metrics related issues reported in vdsm logs as well. 
>> 
>> Link and headline of suspected patches: 
>> 
>> Not related 
>> 
>> Link to Job:
>> 
>> 
>> http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4278/ <http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4278/>
>> 
>> Link to all logs:
>> 
>> 
>> http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4278/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-006_migrations.py/ <http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4278/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-006_migrations.py/>
>> 
>> (Relevant) error snippet from the log: 
>> <error>
>> 
>> Engine: 
>> 2017-12-05 06:26:48,546-05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-385) [] EVENT_ID: VM_MIGRATION_TO_SERVER_FAILED(120), Migration failed  (VM: vm0, Source: lago-
>> basic-suite-master-host-1, Destination: lago-basic-suite-master-host-0).
>> 
>> 
>> dst:
>> 
>> 2017-12-05 06:26:46,615-0500 WARN  (jsonrpc/6) [vds] vm b3962e5c-08e3-444e-910e-ea675fa1a5c7 already exists (API:179)
>> 2017-12-05 06:26:46,615-0500 ERROR (jsonrpc/6) [api] FINISH create error=Virtual machine already exists (api:124)
>> Traceback (most recent call last):
>>   File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 117, in method
>>     ret = func(*args, **kwargs)
>>   File "/usr/lib/python2.7/site-packages/vdsm/API.py", line 180, in create
>>     raise exception.VMExists()
>> VMExists: Virtual machine already exists
>> 2017-12-05 06:26:46,620-0500 INFO  (jsonrpc/6) [api.virt] FINISH create return={'status': {'message': 'Virtual machine already exists', 'code': 4}} from=::ffff:192.168.201.3,50394 (api:52)
>> 2017-12-05 06:26:46,620-0500 INFO  (jsonrpc/6) [api.virt] FINISH migrationCreate return={'status': {'message': 'Virtual machine already exists', 'code': 4}} from=::ffff:192.168.201.3,50394 (api:52)
>> 2017-12-05 06:26:46,620-0500 INFO  (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call VM.migrationCreate failed (error 4) in 0.03 seconds (__init__:573)
>> 2017-12-05 06:26:46,624-0500 INFO  (jsonrpc/2) [api.virt] START destroy(gracefulAttempts=1) from=::ffff:192.168.201.3,50394 (api:46)
>> 2017-12-05 06:26:46,624-0500 INFO  (jsonrpc/2) [virt.vm] (vmId='b3962e5c-08e3-444e-910e-ea675fa1a5c7') Release VM resources (vm:4967)
>> 2017-12-05 06:26:46,625-0500 WARN  (jsonrpc/2) [virt.vm] (vmId='b3962e5c-08e3-444e-910e-ea675fa1a5c7') trying to set state to Powering down when already Down (vm:575)
>> 2017-12-05 06:26:46,625-0500 INFO  (jsonrpc/2) [virt.vm] (vmId='b3962e5c-08e3-444e-910e-ea675fa1a5c7') Stopping connection (guestagent:435)
>> 2017-12-05 06:26:46,625-0500 INFO  (jsonrpc/2) [virt.vm] (vmId='b3962e5c-08e3-444e-910e-ea675fa1a5c7') _destroyVmGraceful attempt #0 (vm:5004)
>> 2017-12-05 06:26:46,626-0500 WARN  (jsonrpc/2) [virt.vm] (vmId='b3962e5c-08e3-444e-910e-ea675fa1a5c7') VM 'b3962e5c-08e3-444e-910e-ea675fa1a5c7' couldn't be destroyed in libvirt: Requested operation is not valid: domain is not running (vm
>> :5025)
>> 2017-12-05 06:26:46,627-0500 INFO  (jsonrpc/2) [vdsm.api] START teardownImage(sdUUID='952bb427-b88c-4fbe-99ef-49970d3aaf70', spUUID='9dcfeaaf-96b7-4e26-a327-5570e0e39261', imgUUID='e6eadbae-ec7a-48f4-8832-64a622a12bef', volUUID=None) from
>> =::ffff:192.168.201.3,50394, task_id=2da93725-5533-4354-a369-751eb44f9ef2 (api:46)
>> 
>> scr 
>> 
>> 2017-12-05 06:26:46,623-0500 ERROR (migsrc/b3962e5c) [virt.vm] (vmId='b3962e5c-08e3-444e-910e-ea675fa1a5c7') migration destination error: Virtual machine already exists (migration:290)
>> 
>> disruption on dst: 
>> 
>> 2017-12-05 06:20:04,662-0500 INFO  (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call VM.shutdown succeeded in 0.00 seconds (__init__:573)
>> 2017-12-05 06:20:04,676-0500 ERROR (Thread-1) [root] Shutdown by QEMU Guest Agent failed (vm:5097)
>> Traceback (most recent call last):
>>   File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5088, in qemuGuestAgentShutdown
>>     self._dom.shutdownFlags(libvirt.VIR_DOMAIN_SHUTDOWN_GUEST_AGENT)
>>   File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f
>>     ret = attr(*args, **kwargs)
>>   File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 126, in wrapper
>>     ret = f(*args, **kwargs)
>>   File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 512, in wrapper
>>     return func(inst, *args, **kwargs)
>>   File "/usr/lib64/python2.7/site-packages/libvirt.py", line 2403, in shutdownFlags
>>     if ret == -1: raise libvirtError ('virDomainShutdownFlags() failed', dom=self)
>> libvirtError: Guest agent is not responding: QEMU guest agent is not connected
>> 2017-12-05 06:20:04,697-0500 INFO  (libvirt/events) [virt.vm] (vmId='b3962e5c-08e3-444e-910e-ea675fa1a5c7') block threshold 1 exceeded on 'vda' (/rhev/data-center/mnt/blockSD/952bb427-b88c-4fbe-99ef-49970d3aaf70/images/e6eadbae-ec7a-48f4-
>> 8832-64a622a12bef/1f063c73-f5cd-44e8-a84f-7810857f82df) (drivemonitor:162)
>> 2017-12-05 06:20:04,698-0500 INFO  (libvirt/events) [virt.vm] (vmId='b3962e5c-08e3-444e-910e-ea675fa1a5c7') drive 'vda' threshold exceeded (storage:872)
>> 2017-12-05 06:20:05,889-0500 INFO  (jsonrpc/7) [api.host] START getAllVmStats() from=::1,41118 (api:46)
>> 2017-12-05 06:20:05,891-0500 INFO  (jsonrpc/7) [api.host] FINISH getAllVmStats return={'status': {'message': 'Done', 'code': 0}, 'statsList': (suppressed)} from=::1,41118 (api:52)
>> 2017-12-05 06:20:05,892-0500 INFO  (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call Host.getAllVmStats succeeded in 0.00 seconds (__init__:573)
>> 2017-12-05 06:20:07,466-0500 INFO  (libvirt/events) [virt.vm] (vmId='b3962e5c-08e3-444e-910e-ea675fa1a5c7') underlying process disconnected (vm:1024)
>> 2017-12-05 06:20:07,469-0500 INFO  (libvirt/events) [virt.vm] (vmId='b3962e5c-08e3-444e-910e-ea675fa1a5c7') Release VM resources (vm:4967)
>> 2017-12-05 06:20:07,469-0500 INFO  (libvirt/events) [virt.vm] (vmId='b3962e5c-08e3-444e-910e-ea675fa1a5c7') Stopping connection (guestagent:435)
>> 2017-12-05 06:20:07,469-0500 INFO  (libvirt/events) [vdsm.api] START teardownImage(sdUUID='952bb427-b88c-4fbe-99ef-49970d3aaf70', spUUID='9dcfeaaf-96b7-4e26-a327-5570e0e39261', imgUUID='e6eadbae-ec7a-48f4-8832-64a622a12bef', volUUID=None)
>>  from=internal, task_id=9efedf46-d3be-4e41-b7f9-a074ed6344f6 (api:46)
>> 
>> 
>> </error>
>> 
>> 
>> 
>> 
>> _______________________________________________
>> Devel mailing list
>> Devel at ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/devel


More information about the Devel mailing list