Subject: [ OST Failure Report ] [ oVirt Master ] [ Jan 15th 2018 ] [ 006_migrations.migrate_vm ]

Thu Jan 18 16:36:57 UTC 2018

On Wed, Jan 17, 2018 at 9:41 PM, Milan Zamazal <mzamazal at redhat.com> wrote:

> Dafna Ron <dron at redhat.com> writes:
>
> > We had a failure in test 006_migrations.migrate_vm
> > <http://jenkins.ovirt.org/job/ovirt-master_change-queue-
> tester/4842/testReport/junit/%28root%29/006_migrations/migrate_vm/>.
> >
> > the migration failed with reason "VMExists"
>
> There are two migrations in 006_migrations.migrate_vm.  The first one
> succeeded, but if I'm looking correctly into the logs, Engine didn't
> send Destroy to the source host after the migration had finished.  Then
> the second migration gets rejected by Vdsm, because Vdsm still keeps the
> former Vm object instance in Down status.
>
> Since the test succeeds most of the time, it looks like some timing
> issue or border case.  Arik, is it a known problem?  If not, would you
> like to look into the logs, whether you can see what's happening?

Your analysis is correct. That's a nice one actually!

The statistics monitoring cycles of both hosts host-0 and host-1 were
scheduled in a way that they are executed almost at the same time [1].

Now, at 6:46:34 the VM was migrated from host-1 to host-0.
At 6:46:42 the migration succeeded - we got events from both hosts, but
only processed the one from the destination so the VM switched to Up.
The next statistics monitoring cycle was triggered at 6:46:44 - again, the
report of that VM from the source host was skipped because we processed the
one from the destination.
At 6:46:59, in the next statistics monitoring cycle, it happened again -
the report of the VM from the source host was skipped.
The next migration was triggered at 6:47:05 - the engine didn't manage to
process any report from the source host, so the VM remained Down there.

The probability of this to happen is extremely low.
However, I think we can make a little tweak to the monitoring code to avoid
this:
"If we get the VM as Down on an unexpected host (that is, not the host we
expect the VM to run on), do not lock the VM"
It should be safe since we don't update anything in this scenario.

[1] For instance:

2018-01-15 06:46:44,905-05 ... GetAllVmStatsVDSCommand ...
VdsIdVDSCommandParametersBase:{hostId='873a4d36-55fe-4be1-acb7-8de9c9123eb2'})

2018-01-15 06:46:44,932-05 ... GetAllVmStatsVDSCommand ...
VdsIdVDSCommandParametersBase:{hostId='31f09289-ec6c-42ff-a745-e82e8ac8e6b9'})
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/infra/attachments/20180118/088b6357/attachment-0001.html>