On Mon, Dec 11, 2017 at 6:44 PM, Milan Zamazal <mzamazal(a)redhat.com> wrote:
Michal Skrivanek <michal.skrivanek(a)redhat.com> writes:
> Milan,
>
> log [1], VM b3962e5c-08e3-444e-910e-ea675fa1a5c7
> migration away finished at 2017-12-05 06:26:24,515-0500
> incoming migration of the same VM back, at 2017-12-05 06:26:46,614-0500
>
> seems to me the migration away didn’t really properly clean up the VM.
Milane,
> can you check that if logs matches? Perhaps orphaned libvirt’s xml?
It seems that, for unknown reasons, Engine didn’t call Destroy on the
source after the first migration. So while the VM was no longer running
there, it was still present in Vdsm (in Down status) and thus the
following create call was rejected in API.
There are weird messages in engine.log during the first migration such
as the following ones, I don’t know whether they are harmless or not:
2017-12-05 06:26:19,799-05 INFO [org.ovirt.engine.core.
vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-89)
[] VM 'b3962e5c-08e3-444e-910e-ea675fa1a5c7'(vm0) was unexpectedly
detected as 'MigratingTo' on VDS '56bf5bdd-8a95-4b7a-9c72-
7206d6b59e38'(lago-basic-suite-master-host-1) (expected on
'104bd4da-24bb-4368-a5ca-21a465aca70e')
2017-12-05 06:26:19,799-05 INFO [org.ovirt.engine.core.
vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-89)
[] VM 'b3962e5c-08e3-444e-910e-ea675fa1a5c7' is migrating to VDS
'56bf5bdd-8a95-4b7a-9c72-7206d6b59e38'(lago-basic-suite-master-host-1)
ignoring it in the refresh until migration is done
These logs are completely harmless - they only indicate that the VM's
run_on_vds still points to the source host and the hand-over was not done
yet.
The migration completion event was successfully obtained by Engine:
2017-12-05 06:26:24,542-05 INFO [org.ovirt.engine.core.dal.
dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-2) []
EVENT_ID: VM_MIGRATION_DONE(63), Migration completed (VM: vm0, Source:
lago-basic-suite-master-host-0, Destination: lago-basic-suite-master-host-1,
Duration: 6 seconds, Total: 6 seconds, Actual downtime: 73ms)
> Otherwise it would need reproduction in CI and some more logs….
>
> [1]
>
http://jenkins.ovirt.org/job/ovirt-master_change-queue-
tester/4278/artifact/exported-artifacts/basic-suit-master-
el7/test_logs/basic-suite-master/post-006_migrations.py/
lago-basic-suite-master-host-0/_var_log/vdsm/vdsm.log
> <
http://jenkins.ovirt.org/job/ovirt-master_change-queue-
tester/4278/artifact/exported-artifacts/basic-suit-master-
el7/test_logs/basic-suite-master/post-006_migrations.py/
lago-basic-suite-master-host-0/_var_log/vdsm/vdsm.log>
>
>> On 5 Dec 2017, at 13:26, Dafna Ron <dron(a)redhat.com> wrote:
>>
>> Hi,
>> We had a failure for test 006_migrations.migrate_vm on master.
>>
>> There was a libvirt disruption in the migration src and after that
vdsm reported the migration as failed because the vm already exists which
makes me suspect a split bran case.
>> The patch reported has nothing to do with this issue and I think we
simply stumbled on a race condition which can cause a split brain.
>> Please note that there are several metrics related issues reported in
vdsm logs as well.
>>
>> Link and headline of suspected patches:
>>
>> Not related
>>
>> Link to Job:
>>
>>
>>
http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4278/ <
http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4278/>
>>
>> Link to all logs:
>>
>>
>>
http://jenkins.ovirt.org/job/ovirt-master_change-queue-
tester/4278/artifact/exported-artifacts/basic-suit-master-
el7/test_logs/basic-suite-master/post-006_migrations.py/ <
http://jenkins.ovirt.org/job/ovirt-master_change-queue-
tester/4278/artifact/exported-artifacts/basic-suit-master-
el7/test_logs/basic-suite-master/post-006_migrations.py/>
>>
>> (Relevant) error snippet from the log:
>> <error>
>>
>> Engine:
>> 2017-12-05 06:26:48,546-05 ERROR [org.ovirt.engine.core.dal.
dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-385)
[] EVENT_ID: VM_MIGRATION_TO_SERVER_FAILED(120), Migration failed (VM:
vm0, Source: lago-
>> basic-suite-master-host-1, Destination: lago-basic-suite-master-host-
0).
>>
>>
>> dst:
>>
>> 2017-12-05 06:26:46,615-0500 WARN (jsonrpc/6) [vds] vm
b3962e5c-08e3-444e-910e-ea675fa1a5c7 already exists (API:179)
>> 2017-12-05 06:26:46,615-0500 ERROR (jsonrpc/6) [api] FINISH create
error=Virtual machine already exists (api:124)
>> Traceback (most recent call last):
>> File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line
117, in method
>> ret = func(*args, **kwargs)
>> File "/usr/lib/python2.7/site-packages/vdsm/API.py", line 180, in
create
>> raise exception.VMExists()
>> VMExists: Virtual machine already exists
>> 2017-12-05 06:26:46,620-0500 INFO (jsonrpc/6) [api.virt] FINISH create
return={'status': {'message': 'Virtual machine already exists',
'code': 4}}
from=::ffff:192.168.201.3,50394 (api:52)
>> 2017-12-05 06:26:46,620-0500 INFO (jsonrpc/6) [api.virt] FINISH
migrationCreate return={'status': {'message': 'Virtual machine
already
exists', 'code': 4}} from=::ffff:192.168.201.3,50394 (api:52)
>> 2017-12-05 06:26:46,620-0500 INFO (jsonrpc/6) [jsonrpc.JsonRpcServer]
RPC call VM.migrationCreate failed (error 4) in 0.03 seconds (__init__:573)
>> 2017-12-05 06:26:46,624-0500 INFO (jsonrpc/2) [api.virt] START
destroy(gracefulAttempts=1) from=::ffff:192.168.201.3,50394 (api:46)
>> 2017-12-05 06:26:46,624-0500 INFO (jsonrpc/2) [virt.vm]
(vmId='b3962e5c-08e3-444e-910e-ea675fa1a5c7') Release VM resources
(vm:4967)
>> 2017-12-05 06:26:46,625-0500 WARN (jsonrpc/2) [virt.vm]
(vmId='b3962e5c-08e3-444e-910e-ea675fa1a5c7') trying to set state to
Powering down when already Down (vm:575)
>> 2017-12-05 06:26:46,625-0500 INFO (jsonrpc/2) [virt.vm]
(vmId='b3962e5c-08e3-444e-910e-ea675fa1a5c7') Stopping connection
(guestagent:435)
>> 2017-12-05 06:26:46,625-0500 INFO (jsonrpc/2) [virt.vm]
(vmId='b3962e5c-08e3-444e-910e-ea675fa1a5c7') _destroyVmGraceful attempt
#0 (vm:5004)
>> 2017-12-05 06:26:46,626-0500 WARN (jsonrpc/2) [virt.vm]
(vmId='b3962e5c-08e3-444e-910e-ea675fa1a5c7') VM
'b3962e5c-08e3-444e-910e-ea675fa1a5c7'
couldn't be destroyed in libvirt: Requested operation is not valid: domain
is not running (vm
>> :5025)
>> 2017-12-05 06:26:46,627-0500 INFO (jsonrpc/2) [vdsm.api] START
teardownImage(sdUUID='952bb427-b88c-4fbe-99ef-49970d3aaf70',
spUUID='9dcfeaaf-96b7-4e26-a327-5570e0e39261',
imgUUID='e6eadbae-ec7a-48f4-8832-64a622a12bef', volUUID=None) from
>> =::ffff:192.168.201.3,50394, task_id=2da93725-5533-4354-a369-751eb44f9ef2
(api:46)
>>
>> scr
>>
>> 2017-12-05 06:26:46,623-0500 ERROR (migsrc/b3962e5c) [virt.vm]
(vmId='b3962e5c-08e3-444e-910e-ea675fa1a5c7') migration destination
error: Virtual machine already exists (migration:290)
>>
>> disruption on dst:
>>
>> 2017-12-05 06:20:04,662-0500 INFO (jsonrpc/3) [jsonrpc.JsonRpcServer]
RPC call VM.shutdown succeeded in 0.00 seconds (__init__:573)
>> 2017-12-05 06:20:04,676-0500 ERROR (Thread-1) [root] Shutdown by QEMU
Guest Agent failed (vm:5097)
>> Traceback (most recent call last):
>> File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5088,
in qemuGuestAgentShutdown
>> self._dom.shutdownFlags(libvirt.VIR_DOMAIN_SHUTDOWN_GUEST_AGENT)
>> File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py",
line
98, in f
>> ret = attr(*args, **kwargs)
>> File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py",
line 126, in wrapper
>> ret = f(*args, **kwargs)
>> File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 512, in
wrapper
>> return func(inst, *args, **kwargs)
>> File "/usr/lib64/python2.7/site-packages/libvirt.py", line 2403, in
shutdownFlags
>> if ret == -1: raise libvirtError ('virDomainShutdownFlags()
failed', dom=self)
>> libvirtError: Guest agent is not responding: QEMU guest agent is not
connected
>> 2017-12-05 06:20:04,697-0500 INFO (libvirt/events) [virt.vm]
(vmId='b3962e5c-08e3-444e-910e-ea675fa1a5c7') block threshold 1 exceeded
on 'vda' (/rhev/data-center/mnt/blockSD/952bb427-b88c-4fbe-
99ef-49970d3aaf70/images/e6eadbae-ec7a-48f4-
>> 8832-64a622a12bef/1f063c73-f5cd-44e8-a84f-7810857f82df)
(drivemonitor:162)
>> 2017-12-05 06:20:04,698-0500 INFO (libvirt/events) [virt.vm]
(vmId='b3962e5c-08e3-444e-910e-ea675fa1a5c7') drive 'vda' threshold
exceeded (storage:872)
>> 2017-12-05 06:20:05,889-0500 INFO (jsonrpc/7) [api.host] START
getAllVmStats() from=::1,41118 (api:46)
>> 2017-12-05 06:20:05,891-0500 INFO (jsonrpc/7) [api.host] FINISH
getAllVmStats return={'status': {'message': 'Done',
'code': 0},
'statsList': (suppressed)} from=::1,41118 (api:52)
>> 2017-12-05 06:20:05,892-0500 INFO (jsonrpc/7) [jsonrpc.JsonRpcServer]
RPC call Host.getAllVmStats succeeded in 0.00 seconds (__init__:573)
>> 2017-12-05 06:20:07,466-0500 INFO (libvirt/events) [virt.vm]
(vmId='b3962e5c-08e3-444e-910e-ea675fa1a5c7') underlying process
disconnected (vm:1024)
>> 2017-12-05 06:20:07,469-0500 INFO (libvirt/events) [virt.vm]
(vmId='b3962e5c-08e3-444e-910e-ea675fa1a5c7') Release VM resources
(vm:4967)
>> 2017-12-05 06:20:07,469-0500 INFO (libvirt/events) [virt.vm]
(vmId='b3962e5c-08e3-444e-910e-ea675fa1a5c7') Stopping connection
(guestagent:435)
>> 2017-12-05 06:20:07,469-0500 INFO (libvirt/events) [vdsm.api] START
teardownImage(sdUUID='952bb427-b88c-4fbe-99ef-49970d3aaf70',
spUUID='9dcfeaaf-96b7-4e26-a327-5570e0e39261',
imgUUID='e6eadbae-ec7a-48f4-8832-64a622a12bef', volUUID=None)
>> from=internal, task_id=9efedf46-d3be-4e41-b7f9-a074ed6344f6 (api:46)
>>
>>
>> </error>
>>
>>
>>
>>
>> _______________________________________________
>> Devel mailing list
>> Devel(a)ovirt.org
>>
http://lists.ovirt.org/mailman/listinfo/devel
_______________________________________________
Devel mailing list
Devel(a)ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel