On Wed, Jan 17, 2018 at 9:41 PM, Milan Zamazal <mzamazal@redhat.com> wrote:
Dafna Ron <dron@redhat.com> writes:

> We had a failure in test 006_migrations.migrate_vm
> <http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4842/testReport/junit/%28root%29/006_migrations/migrate_vm/>.
>
> the migration failed with reason "VMExists"

There are two migrations in 006_migrations.migrate_vm.  The first one
succeeded, but if I'm looking correctly into the logs, Engine didn't
send Destroy to the source host after the migration had finished.  Then
the second migration gets rejected by Vdsm, because Vdsm still keeps the
former Vm object instance in Down status.

Since the test succeeds most of the time, it looks like some timing
issue or border case.  Arik, is it a known problem?  If not, would you
like to look into the logs, whether you can see what's happening?

Your analysis is correct. That's a nice one actually!

The statistics monitoring cycles of both hosts host-0 and host-1 were scheduled in a way that they are executed almost at the same time [1].

Now, at 6:46:34 the VM was migrated from host-1 to host-0.
At 6:46:42 the migration succeeded - we got events from both hosts, but only processed the one from the destination so the VM switched to Up.
The next statistics monitoring cycle was triggered at 6:46:44 - again, the report of that VM from the source host was skipped because we processed the one from the destination.
At 6:46:59, in the next statistics monitoring cycle, it happened again - the report of the VM from the source host was skipped.
The next migration was triggered at 6:47:05 - the engine didn't manage to process any report from the source host, so the VM remained Down there. 

The probability of this to happen is extremely low.
However, I think we can make a little tweak to the monitoring code to avoid this:
"If we get the VM as Down on an unexpected host (that is, not the host we expect the VM to run on), do not lock the VM"
It should be safe since we don't update anything in this scenario.
 
[1] For instance:

2018-01-15 06:46:44,905-05 ... GetAllVmStatsVDSCommand ... VdsIdVDSCommandParametersBase:{hostId='873a4d36-55fe-4be1-acb7-8de9c9123eb2'})

2018-01-15 06:46:44,932-05 ... GetAllVmStatsVDSCommand ... VdsIdVDSCommandParametersBase:{hostId='31f09289-ec6c-42ff-a745-e82e8ac8e6b9'})