Subject: [ OST Failure Report ] [ oVirt Master ] [ Jan 15th 2018 ] [ 006_migrations.migrate_vm ]

15 Jan 2018

--Apple-Mail=_D554CC52-6154-4DEA-A96A-56848A52349A
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=us-ascii

> On 18 Jan 2018, at 17:36, Arik Hadas <ahadas@redhat.com> wrote:
>=20
>=20
>=20
> On Wed, Jan 17, 2018 at 9:41 PM, Milan Zamazal <mzamazal@redhat.com =
<mailto:mzamazal@redhat.com>> wrote:
> Dafna Ron <dron@redhat.com <mailto:dron@redhat.com>> writes:
>=20
> > We had a failure in test 006_migrations.migrate_vm
> > =
<http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4842/testRe=
port/junit/%28root%29/006_migrations/migrate_vm/ =
<http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4842/testRe=
port/junit/%28root%29/006_migrations/migrate_vm/>>.
> >
> > the migration failed with reason "VMExists"
>=20
> There are two migrations in 006_migrations.migrate_vm.  The first one
> succeeded, but if I'm looking correctly into the logs, Engine didn't
> send Destroy to the source host after the migration had finished.  =
Then
> the second migration gets rejected by Vdsm, because Vdsm still keeps =
the
> former Vm object instance in Down status.
>=20
> Since the test succeeds most of the time, it looks like some timing
> issue or border case.  Arik, is it a known problem?  If not, would you
> like to look into the logs, whether you can see what's happening?
>=20
> Your analysis is correct. That's a nice one actually!
>=20
> The statistics monitoring cycles of both hosts host-0 and host-1 were =
scheduled in a way that they are executed almost at the same time [1].
>=20
> Now, at 6:46:34 the VM was migrated from host-1 to host-0.
> At 6:46:42 the migration succeeded - we got events from both hosts, =
but only processed the one from the destination so the VM switched to =
Up.
> The next statistics monitoring cycle was triggered at 6:46:44 - again, =
the report of that VM from the source host was skipped because we =
processed the one from the destination.
> At 6:46:59, in the next statistics monitoring cycle, it happened again =
- the report of the VM from the source host was skipped.
> The next migration was triggered at 6:47:05 - the engine didn't manage =
to process any report from the source host, so the VM remained Down =
there.=20
>=20
> The probability of this to happen is extremely low.

Why wasn't the migration rerun?

> However, I think we can make a little tweak to the monitoring code to =
avoid this:
> "If we get the VM as Down on an unexpected host (that is, not the host =
we expect the VM to run on), do not lock the VM"
> It should be safe since we don't update anything in this scenario.
> =20
> [1] For instance:
> 2018-01-15 06:46:44,905-05 ... GetAllVmStatsVDSCommand ... =
VdsIdVDSCommandParametersBase:{hostId=3D'873a4d36-55fe-4be1-acb7-8de9c9123=
eb2'})
> 2018-01-15 06:46:44,932-05 ... GetAllVmStatsVDSCommand ... =
VdsIdVDSCommandParametersBase:{hostId=3D'31f09289-ec6c-42ff-a745-e82e8ac8e=
6b9'})
> _______________________________________________
> Devel mailing list
> Devel@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/devel

--Apple-Mail=_D554CC52-6154-4DEA-A96A-56848A52349A
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=us-ascii

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html; =
charset=3Dus-ascii"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; line-break: after-white-space;" class=3D""><br =
class=3D""><div><br class=3D""><blockquote type=3D"cite" class=3D""><div =
class=3D"">On 18 Jan 2018, at 17:36, Arik Hadas <<a =
href=3D"mailto:ahadas@redhat.com" class=3D"">ahadas@redhat.com</a>> =
wrote:</div><br class=3D"Apple-interchange-newline"><div class=3D""><div =
dir=3D"ltr" class=3D""><br class=3D""><div class=3D"gmail_extra"><br =
class=3D""><div class=3D"gmail_quote">On Wed, Jan 17, 2018 at 9:41 PM, =
Milan Zamazal <span dir=3D"ltr" class=3D""><<a =
href=3D"mailto:mzamazal@redhat.com" target=3D"_blank" =
class=3D"">mzamazal@redhat.com</a>></span> wrote:<br =
class=3D""><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px =
0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(=
204,204,204);padding-left:1ex"><span class=3D"gmail-">Dafna Ron <<a =
href=3D"mailto:dron@redhat.com" class=3D"">dron@redhat.com</a>> =
writes:<br class=3D"">
<br class=3D"">
> We had a failure in test 006_migrations.migrate_vm<br class=3D"">
</span>> <<a =
href=3D"http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4842=
/testReport/junit/%28root%29/006_migrations/migrate_vm/" =
rel=3D"noreferrer" target=3D"_blank" =
class=3D"">http://jenkins.ovirt.org/job/<wbr =
class=3D"">ovirt-master_change-queue-<wbr =
class=3D"">tester/4842/testReport/junit/%<wbr =
class=3D"">28root%29/006_migrations/<wbr =
class=3D"">migrate_vm/</a>>.<br class=3D"">
<span class=3D"gmail-">><br class=3D"">
> the migration failed with reason "VMExists"<br class=3D"">
<br class=3D"">
</span>There are two migrations in 006_migrations.migrate_vm.  The =
first one<br class=3D"">
succeeded, but if I'm looking correctly into the logs, Engine didn't<br =
class=3D"">
send Destroy to the source host after the migration had finished.  =
Then<br class=3D"">
the second migration gets rejected by Vdsm, because Vdsm still keeps =
the<br class=3D"">
former Vm object instance in Down status.<br class=3D"">
<br class=3D"">
Since the test succeeds most of the time, it looks like some timing<br =
class=3D"">
issue or border case.  Arik, is it a known problem?  If not, =
would you<br class=3D"">
like to look into the logs, whether you can see what's =
happening?</blockquote><div class=3D""><br class=3D""></div><div =
class=3D"">Your analysis is correct. That's a nice one =
actually!</div><div class=3D""><br class=3D""></div><div class=3D"">The =
statistics monitoring cycles of both hosts host-0 and host-1 were =
scheduled in a way that they are executed almost at the same time =
[1].</div><div class=3D""><br class=3D""></div><div class=3D"">Now, at =
6:46:34 the VM was migrated from host-1 to host-0.</div><div class=3D"">At=
 6:46:42 the migration succeeded - we got events from both hosts, but =
only processed the one from the destination so the VM switched to =
Up.</div><div class=3D"">The next statistics monitoring cycle was =
triggered at 6:46:44 - again, the report of that VM from the source host =
was skipped because we processed the one from the destination.</div><div =
class=3D"">At 6:46:59, in the next statistics monitoring cycle, it =
happened again - the report of the VM from the source host was =
skipped.</div><div class=3D"">The next migration was triggered at =
6:47:05 - the engine didn't manage to process any report from the source =
host, so the VM remained Down there. </div><div class=3D""><br =
class=3D""></div><div class=3D"">The probability of this to happen is =
extremely low.</div></div></div></div></div></blockquote><div><br =
class=3D""></div></div><div>Why wasn't the migration =
rerun?</div><div><br class=3D""><blockquote type=3D"cite" class=3D""><div =
class=3D""><div dir=3D"ltr" class=3D""><div class=3D"gmail_extra"><div =
class=3D"gmail_quote"><div class=3D"">However, I think we can make a =
little tweak to the monitoring code to avoid this:</div><div =
class=3D"">"If we get the VM as Down on an unexpected host (that is, not =
the host we expect the VM to run on), do not lock the VM"</div><div =
class=3D"">It should be safe since we don't update anything in this =
scenario.</div><div class=3D""> </div><div class=3D"">[1] For =
instance:</div><div class=3D""><div style=3D"margin: 0px; font-stretch: =
normal; font-size: 11px; line-height: normal; font-family: Menlo;" =
class=3D""><span style=3D"font-variant-ligatures:no-common-ligatures" =
class=3D"">2018-01-15 06:46:44,905-05 =
... </span>GetAllVmStatsVDSCommand ... =
VdsIdVDSCommandParametersBase:{hostId=3D'873a4d36-55fe-4be1-acb7-8de9c9123=
eb2'})</div></div><div class=3D""><div style=3D"margin: 0px; =
font-stretch: normal; font-size: 11px; line-height: normal; font-family: =
Menlo;" class=3D""><span =
style=3D"font-variant-ligatures:no-common-ligatures" class=3D"">2018-01-15=
 06:46:44,932-05 ... </span>GetAllVmStatsVDSCommand ... =
VdsIdVDSCommandParametersBase:{hostId=3D'31f09289-ec6c-42ff-a745-e82e8ac8e=
6b9'})</div></div></div></div></div>
_______________________________________________<br class=3D"">Devel =
mailing list<br class=3D""><a href=3D"mailto:Devel@ovirt.org" =
class=3D"">Devel@ovirt.org</a><br =
class=3D"">http://lists.ovirt.org/mailman/listinfo/devel</div></blockquote=
></div><br class=3D""></body></html>=

--Apple-Mail=_D554CC52-6154-4DEA-A96A-56848A52349A--

Dafna Ron

Edward Haas

Milan Zamazal

Arik Hadas

Michal Skrivanek

Arik Hadas

tags

participants (5)