[ovirt-devel] Subject: [ OST Failure Report ] [ oVirt Master ] [ Jan 15th 2018 ] [ 006_migrations.migrate_vm ]
Michal Skrivanek
michal.skrivanek at redhat.com
Fri Jan 19 10:46:41 UTC 2018
> On 18 Jan 2018, at 17:36, Arik Hadas <ahadas at redhat.com> wrote:
>
>
>
> On Wed, Jan 17, 2018 at 9:41 PM, Milan Zamazal <mzamazal at redhat.com <mailto:mzamazal at redhat.com>> wrote:
> Dafna Ron <dron at redhat.com <mailto:dron at redhat.com>> writes:
>
> > We had a failure in test 006_migrations.migrate_vm
> > <http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4842/testReport/junit/%28root%29/006_migrations/migrate_vm/ <http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4842/testReport/junit/%28root%29/006_migrations/migrate_vm/>>.
> >
> > the migration failed with reason "VMExists"
>
> There are two migrations in 006_migrations.migrate_vm. The first one
> succeeded, but if I'm looking correctly into the logs, Engine didn't
> send Destroy to the source host after the migration had finished. Then
> the second migration gets rejected by Vdsm, because Vdsm still keeps the
> former Vm object instance in Down status.
>
> Since the test succeeds most of the time, it looks like some timing
> issue or border case. Arik, is it a known problem? If not, would you
> like to look into the logs, whether you can see what's happening?
>
> Your analysis is correct. That's a nice one actually!
>
> The statistics monitoring cycles of both hosts host-0 and host-1 were scheduled in a way that they are executed almost at the same time [1].
>
> Now, at 6:46:34 the VM was migrated from host-1 to host-0.
> At 6:46:42 the migration succeeded - we got events from both hosts, but only processed the one from the destination so the VM switched to Up.
> The next statistics monitoring cycle was triggered at 6:46:44 - again, the report of that VM from the source host was skipped because we processed the one from the destination.
> At 6:46:59, in the next statistics monitoring cycle, it happened again - the report of the VM from the source host was skipped.
> The next migration was triggered at 6:47:05 - the engine didn't manage to process any report from the source host, so the VM remained Down there.
>
> The probability of this to happen is extremely low.
Why wasn't the migration rerun?
> However, I think we can make a little tweak to the monitoring code to avoid this:
> "If we get the VM as Down on an unexpected host (that is, not the host we expect the VM to run on), do not lock the VM"
> It should be safe since we don't update anything in this scenario.
>
> [1] For instance:
> 2018-01-15 06:46:44,905-05 ... GetAllVmStatsVDSCommand ... VdsIdVDSCommandParametersBase:{hostId='873a4d36-55fe-4be1-acb7-8de9c9123eb2'})
> 2018-01-15 06:46:44,932-05 ... GetAllVmStatsVDSCommand ... VdsIdVDSCommandParametersBase:{hostId='31f09289-ec6c-42ff-a745-e82e8ac8e6b9'})
> _______________________________________________
> Devel mailing list
> Devel at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/infra/attachments/20180119/83eccf86/attachment.html>
More information about the Infra
mailing list