[ovirt-devel] Subject: [ OST Failure Report ] [ oVirt Master ] [ Jan 15th 2018 ] [ 006_migrations.migrate_vm ]

Fri Jan 19 10:46:41 UTC 2018

> On 18 Jan 2018, at 17:36, Arik Hadas <ahadas at redhat.com> wrote:
> 
> 
> 
> On Wed, Jan 17, 2018 at 9:41 PM, Milan Zamazal <mzamazal at redhat.com <mailto:mzamazal at redhat.com>> wrote:
> Dafna Ron <dron at redhat.com <mailto:dron at redhat.com>> writes:
> 
> > We had a failure in test 006_migrations.migrate_vm
> > <http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4842/testReport/junit/%28root%29/006_migrations/migrate_vm/ <http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4842/testReport/junit/%28root%29/006_migrations/migrate_vm/>>.
> >
> > the migration failed with reason "VMExists"
> 
> There are two migrations in 006_migrations.migrate_vm.  The first one
> succeeded, but if I'm looking correctly into the logs, Engine didn't
> send Destroy to the source host after the migration had finished.  Then
> the second migration gets rejected by Vdsm, because Vdsm still keeps the
> former Vm object instance in Down status.
> 
> Since the test succeeds most of the time, it looks like some timing
> issue or border case.  Arik, is it a known problem?  If not, would you
> like to look into the logs, whether you can see what's happening?
> 
> Your analysis is correct. That's a nice one actually!
> 
> The statistics monitoring cycles of both hosts host-0 and host-1 were scheduled in a way that they are executed almost at the same time [1].
> 
> Now, at 6:46:34 the VM was migrated from host-1 to host-0.
> At 6:46:42 the migration succeeded - we got events from both hosts, but only processed the one from the destination so the VM switched to Up.
> The next statistics monitoring cycle was triggered at 6:46:44 - again, the report of that VM from the source host was skipped because we processed the one from the destination.
> At 6:46:59, in the next statistics monitoring cycle, it happened again - the report of the VM from the source host was skipped.
> The next migration was triggered at 6:47:05 - the engine didn't manage to process any report from the source host, so the VM remained Down there. 
> 
> The probability of this to happen is extremely low.

Why wasn't the migration rerun?

> However, I think we can make a little tweak to the monitoring code to avoid this:
> "If we get the VM as Down on an unexpected host (that is, not the host we expect the VM to run on), do not lock the VM"
> It should be safe since we don't update anything in this scenario.
>  
> [1] For instance:
> 2018-01-15 06:46:44,905-05 ... GetAllVmStatsVDSCommand ... VdsIdVDSCommandParametersBase:{hostId='873a4d36-55fe-4be1-acb7-8de9c9123eb2'})
> 2018-01-15 06:46:44,932-05 ... GetAllVmStatsVDSCommand ... VdsIdVDSCommandParametersBase:{hostId='31f09289-ec6c-42ff-a745-e82e8ac8e6b9'})
> _______________________________________________
> Devel mailing list
> Devel at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/devel/attachments/20180119/83eccf86/attachment-0001.html>