[ovirt-devel] Subject: [ OST Failure Report ] [ oVirt Master ] [ Jan 15th 2018 ] [ 006_migrations.migrate_vm ]

Arik Hadas ahadas at redhat.com
Fri Jan 19 14:23:38 UTC 2018


On Fri, Jan 19, 2018 at 12:46 PM, Michal Skrivanek <
michal.skrivanek at redhat.com> wrote:

>
>
> On 18 Jan 2018, at 17:36, Arik Hadas <ahadas at redhat.com> wrote:
>
>
>
> On Wed, Jan 17, 2018 at 9:41 PM, Milan Zamazal <mzamazal at redhat.com>
> wrote:
>
>> Dafna Ron <dron at redhat.com> writes:
>>
>> > We had a failure in test 006_migrations.migrate_vm
>> > <http://jenkins.ovirt.org/job/ovirt-master_change-queue-test
>> er/4842/testReport/junit/%28root%29/006_migrations/migrate_vm/>.
>> >
>> > the migration failed with reason "VMExists"
>>
>> There are two migrations in 006_migrations.migrate_vm.  The first one
>> succeeded, but if I'm looking correctly into the logs, Engine didn't
>> send Destroy to the source host after the migration had finished.  Then
>> the second migration gets rejected by Vdsm, because Vdsm still keeps the
>> former Vm object instance in Down status.
>>
>> Since the test succeeds most of the time, it looks like some timing
>> issue or border case.  Arik, is it a known problem?  If not, would you
>> like to look into the logs, whether you can see what's happening?
>
>
> Your analysis is correct. That's a nice one actually!
>
> The statistics monitoring cycles of both hosts host-0 and host-1 were
> scheduled in a way that they are executed almost at the same time [1].
>
> Now, at 6:46:34 the VM was migrated from host-1 to host-0.
> At 6:46:42 the migration succeeded - we got events from both hosts, but
> only processed the one from the destination so the VM switched to Up.
> The next statistics monitoring cycle was triggered at 6:46:44 - again, the
> report of that VM from the source host was skipped because we processed the
> one from the destination.
> At 6:46:59, in the next statistics monitoring cycle, it happened again -
> the report of the VM from the source host was skipped.
> The next migration was triggered at 6:47:05 - the engine didn't manage to
> process any report from the source host, so the VM remained Down there.
>
> The probability of this to happen is extremely low.
>
>
> Why wasn't the migration rerun?
>

Good question, because a migration to a particular host (MigrateVmToServer)
was requested.
In this particular case, it seems that there are only two hosts defined so
changing it to MigrateVm wouldn't make any difference though.


>
> However, I think we can make a little tweak to the monitoring code to
> avoid this:
> "If we get the VM as Down on an unexpected host (that is, not the host we
> expect the VM to run on), do not lock the VM"
> It should be safe since we don't update anything in this scenario.
>
> [1] For instance:
> 2018-01-15 06:46:44,905-05 ... GetAllVmStatsVDSCommand ...
> VdsIdVDSCommandParametersBase:{hostId='873a4d36-55fe-4be1-
> acb7-8de9c9123eb2'})
> 2018-01-15 06:46:44,932-05 ... GetAllVmStatsVDSCommand ...
> VdsIdVDSCommandParametersBase:{hostId='31f09289-ec6c-42ff-
> a745-e82e8ac8e6b9'})
> _______________________________________________
> Devel mailing list
> Devel at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/devel
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/infra/attachments/20180119/e3ad8669/attachment-0001.html>


More information about the Infra mailing list