On Mon, Dec 11, 2017 at 6:44 PM, Milan Zamazal <mzamazal@redhat.com> wrote:
Michal Skrivanek <michal.skrivanek@redhat.com> writes:

> Milan,
>
> log [1], VM b3962e5c-08e3-444e-910e-ea675fa1a5c7
> migration away finished at 2017-12-05 06:26:24,515-0500
> incoming migration of the same VM back, at 2017-12-05 06:26:46,614-0500
>
> seems to me the migration away didn’t really properly clean up the VM. Milane,
> can you check that if logs matches? Perhaps orphaned libvirt’s xml?

It seems that, for unknown reasons, Engine didn’t call Destroy on the
source after the first migration.  So while the VM was no longer running
there, it was still present in Vdsm (in Down status) and thus the
following create call was rejected in API.

There are weird messages in engine.log during the first migration such
as the following ones, I don’t know whether they are harmless or not:

  2017-12-05 06:26:19,799-05 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-89) [] VM 'b3962e5c-08e3-444e-910e-ea675fa1a5c7'(vm0) was unexpectedly detected as 'MigratingTo' on VDS '56bf5bdd-8a95-4b7a-9c72-7206d6b59e38'(lago-basic-suite-master-host-1) (expected on '104bd4da-24bb-4368-a5ca-21a465aca70e')
  2017-12-05 06:26:19,799-05 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-89) [] VM 'b3962e5c-08e3-444e-910e-ea675fa1a5c7' is migrating to VDS '56bf5bdd-8a95-4b7a-9c72-7206d6b59e38'(lago-basic-suite-master-host-1) ignoring it in the refresh until migration is done

These logs are completely harmless - they only indicate that the VM's run_on_vds still points to the source host and the hand-over was not done yet.
 

The migration completion event was successfully obtained by Engine:

  2017-12-05 06:26:24,542-05 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-2) [] EVENT_ID: VM_MIGRATION_DONE(63), Migration completed (VM: vm0, Source: lago-basic-suite-master-host-0, Destination: lago-basic-suite-master-host-1, Duration: 6 seconds, Total: 6 seconds, Actual downtime: 73ms)

> Otherwise it would need reproduction in CI and some more logs….
>
> [1]
> http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4278/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-006_migrations.py/lago-basic-suite-master-host-0/_var_log/vdsm/vdsm.log
> <http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4278/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-006_migrations.py/lago-basic-suite-master-host-0/_var_log/vdsm/vdsm.log>
>
>> On 5 Dec 2017, at 13:26, Dafna Ron <dron@redhat.com> wrote:
>>
>> Hi,
>> We had a failure for test 006_migrations.migrate_vm on master.
>>
>> There was a libvirt disruption in the migration src  and after that vdsm reported the migration as failed because the vm already exists which makes me suspect a split bran case.
>> The patch reported has nothing to do with this issue and I think we simply stumbled on a race condition which can cause a split brain.
>> Please note that there are several metrics related issues reported in vdsm logs as well.
>>
>> Link and headline of suspected patches:
>>
>> Not related
>>
>> Link to Job:
>>
>>
>> http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4278/ <http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4278/>
>>
>> Link to all logs:
>>
>>
>> http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4278/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-006_migrations.py/ <http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4278/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-006_migrations.py/>
>>
>> (Relevant) error snippet from the log:
>> <error>
>>
>> Engine:
>> 2017-12-05 06:26:48,546-05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-385) [] EVENT_ID: VM_MIGRATION_TO_SERVER_FAILED(120), Migration failed  (VM: vm0, Source: lago-
>> basic-suite-master-host-1, Destination: lago-basic-suite-master-host-0).
>>
>>
>> dst:
>>
>> 2017-12-05 06:26:46,615-0500 WARN  (jsonrpc/6) [vds] vm b3962e5c-08e3-444e-910e-ea675fa1a5c7 already exists (API:179)
>> 2017-12-05 06:26:46,615-0500 ERROR (jsonrpc/6) [api] FINISH create error=Virtual machine already exists (api:124)
>> Traceback (most recent call last):
>>   File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 117, in method
>>     ret = func(*args, **kwargs)
>>   File "/usr/lib/python2.7/site-packages/vdsm/API.py", line 180, in create
>>     raise exception.VMExists()
>> VMExists: Virtual machine already exists
>> 2017-12-05 06:26:46,620-0500 INFO  (jsonrpc/6) [api.virt] FINISH create return={'status': {'message': 'Virtual machine already exists', 'code': 4}} from=::ffff:192.168.201.3,50394 (api:52)
>> 2017-12-05 06:26:46,620-0500 INFO  (jsonrpc/6) [api.virt] FINISH migrationCreate return={'status': {'message': 'Virtual machine already exists', 'code': 4}} from=::ffff:192.168.201.3,50394 (api:52)
>> 2017-12-05 06:26:46,620-0500 INFO  (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call VM.migrationCreate failed (error 4) in 0.03 seconds (__init__:573)
>> 2017-12-05 06:26:46,624-0500 INFO  (jsonrpc/2) [api.virt] START destroy(gracefulAttempts=1) from=::ffff:192.168.201.3,50394 (api:46)
>> 2017-12-05 06:26:46,624-0500 INFO  (jsonrpc/2) [virt.vm] (vmId='b3962e5c-08e3-444e-910e-ea675fa1a5c7') Release VM resources (vm:4967)
>> 2017-12-05 06:26:46,625-0500 WARN  (jsonrpc/2) [virt.vm] (vmId='b3962e5c-08e3-444e-910e-ea675fa1a5c7') trying to set state to Powering down when already Down (vm:575)
>> 2017-12-05 06:26:46,625-0500 INFO  (jsonrpc/2) [virt.vm] (vmId='b3962e5c-08e3-444e-910e-ea675fa1a5c7') Stopping connection (guestagent:435)
>> 2017-12-05 06:26:46,625-0500 INFO  (jsonrpc/2) [virt.vm] (vmId='b3962e5c-08e3-444e-910e-ea675fa1a5c7') _destroyVmGraceful attempt #0 (vm:5004)
>> 2017-12-05 06:26:46,626-0500 WARN  (jsonrpc/2) [virt.vm] (vmId='b3962e5c-08e3-444e-910e-ea675fa1a5c7') VM 'b3962e5c-08e3-444e-910e-ea675fa1a5c7' couldn't be destroyed in libvirt: Requested operation is not valid: domain is not running (vm
>> :5025)
>> 2017-12-05 06:26:46,627-0500 INFO  (jsonrpc/2) [vdsm.api] START teardownImage(sdUUID='952bb427-b88c-4fbe-99ef-49970d3aaf70', spUUID='9dcfeaaf-96b7-4e26-a327-5570e0e39261', imgUUID='e6eadbae-ec7a-48f4-8832-64a622a12bef', volUUID=None) from
>> =::ffff:192.168.201.3,50394, task_id=2da93725-5533-4354-a369-751eb44f9ef2 (api:46)
>>
>> scr
>>
>> 2017-12-05 06:26:46,623-0500 ERROR (migsrc/b3962e5c) [virt.vm] (vmId='b3962e5c-08e3-444e-910e-ea675fa1a5c7') migration destination error: Virtual machine already exists (migration:290)
>>
>> disruption on dst:
>>
>> 2017-12-05 06:20:04,662-0500 INFO  (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call VM.shutdown succeeded in 0.00 seconds (__init__:573)
>> 2017-12-05 06:20:04,676-0500 ERROR (Thread-1) [root] Shutdown by QEMU Guest Agent failed (vm:5097)
>> Traceback (most recent call last):
>>   File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5088, in qemuGuestAgentShutdown
>>     self._dom.shutdownFlags(libvirt.VIR_DOMAIN_SHUTDOWN_GUEST_AGENT)
>>   File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f
>>     ret = attr(*args, **kwargs)
>>   File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 126, in wrapper
>>     ret = f(*args, **kwargs)
>>   File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 512, in wrapper
>>     return func(inst, *args, **kwargs)
>>   File "/usr/lib64/python2.7/site-packages/libvirt.py", line 2403, in shutdownFlags
>>     if ret == -1: raise libvirtError ('virDomainShutdownFlags() failed', dom=self)
>> libvirtError: Guest agent is not responding: QEMU guest agent is not connected
>> 2017-12-05 06:20:04,697-0500 INFO  (libvirt/events) [virt.vm] (vmId='b3962e5c-08e3-444e-910e-ea675fa1a5c7') block threshold 1 exceeded on 'vda' (/rhev/data-center/mnt/blockSD/952bb427-b88c-4fbe-99ef-49970d3aaf70/images/e6eadbae-ec7a-48f4-
>> 8832-64a622a12bef/1f063c73-f5cd-44e8-a84f-7810857f82df) (drivemonitor:162)
>> 2017-12-05 06:20:04,698-0500 INFO  (libvirt/events) [virt.vm] (vmId='b3962e5c-08e3-444e-910e-ea675fa1a5c7') drive 'vda' threshold exceeded (storage:872)
>> 2017-12-05 06:20:05,889-0500 INFO  (jsonrpc/7) [api.host] START getAllVmStats() from=::1,41118 (api:46)
>> 2017-12-05 06:20:05,891-0500 INFO  (jsonrpc/7) [api.host] FINISH getAllVmStats return={'status': {'message': 'Done', 'code': 0}, 'statsList': (suppressed)} from=::1,41118 (api:52)
>> 2017-12-05 06:20:05,892-0500 INFO  (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call Host.getAllVmStats succeeded in 0.00 seconds (__init__:573)
>> 2017-12-05 06:20:07,466-0500 INFO  (libvirt/events) [virt.vm] (vmId='b3962e5c-08e3-444e-910e-ea675fa1a5c7') underlying process disconnected (vm:1024)
>> 2017-12-05 06:20:07,469-0500 INFO  (libvirt/events) [virt.vm] (vmId='b3962e5c-08e3-444e-910e-ea675fa1a5c7') Release VM resources (vm:4967)
>> 2017-12-05 06:20:07,469-0500 INFO  (libvirt/events) [virt.vm] (vmId='b3962e5c-08e3-444e-910e-ea675fa1a5c7') Stopping connection (guestagent:435)
>> 2017-12-05 06:20:07,469-0500 INFO  (libvirt/events) [vdsm.api] START teardownImage(sdUUID='952bb427-b88c-4fbe-99ef-49970d3aaf70', spUUID='9dcfeaaf-96b7-4e26-a327-5570e0e39261', imgUUID='e6eadbae-ec7a-48f4-8832-64a622a12bef', volUUID=None)
>>  from=internal, task_id=9efedf46-d3be-4e41-b7f9-a074ed6344f6 (api:46)
>>
>>
>> </error>
>>
>>
>>
>>
>> _______________________________________________
>> Devel mailing list
>> Devel@ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/devel
_______________________________________________
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel