--Apple-Mail=_45C10755-3D4E-480D-978D-C839E4C9C674
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
charset=utf-8
On 19 Dec 2017, at 10:14, Arik Hadas <ahadas(a)redhat.com>
wrote:
=20
=20
=20
On Tue, Dec 19, 2017 at 12:20 AM, Michal Skrivanek =
<michal.skrivanek(a)redhat.com <mailto:michal.skrivanek@redhat.com>> =
wrote:
=20
> On 18 Dec 2017, at 13:21, Milan Zamazal <mzamazal(a)redhat.com =
<mailto:mzamazal@redhat.com>> wrote:
>
> Yedidyah Bar David <didi(a)redhat.com <mailto:didi@redhat.com>> =
writes:
>
>> On Mon, Dec 18, 2017 at 10:17 AM, Code Review <gerrit(a)ovirt.org =
<mailto:gerrit@ovirt.org>> wrote:
>>> Jenkins CI posted comments on this change.
>>>
>>
>>> View Change
>>>
>>> Patch set 3:Continuous-Integration -1
>>>
>>> Build Failed
>>>
>>> =
http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86=
_64/2882/ =
<
http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x8=
6_64/2882/>
>>> : FAILURE
>>
>> Console output of above job says:
>>
>> 08:13:34 # migrate_vm:
>> 08:16:37 * Collect artifacts:
>> 08:16:40 * Collect artifacts: Success (in 0:00:03)
>> 08:16:40 # migrate_vm: Success (in 0:03:06)
>> 08:16:40 # Results located at
>> =
/dev/shm/ost/deployment-basic-suite-master/default/006_migrations.py.junit=
.xml
>> 08:16:40 @ Run test: 006_migrations.py: Success (in
0:03:50)
>> 08:16:40 Error occured, aborting
>>
>> The file 006_migrations.py.junit.xml [1] says:
>>
>> <failure type=3D"exceptions.AssertionError" message=3D"False
!=3D =
True after
>> 180 seconds">
>
> Reading the logs, I can see the VM migrates normally and seems to be
> reported to Engine correctly. When Engine receives end-of-migration
> event, it sends Destroy to the source (which is correct), calls =
dumpxmls
> on the destination in the meantime (looks fine to me) and then
calls
=20
looks like a race between getallvmstats reporting VM as Down =
(statusTime:
4296271980) being processed, while there is a =
Down/MigrationSucceeded event arriving (with notify_time 4296273170) at =
about the same time
Unfortunately the vdsm.log is not in DEBUG level so there=E2=80=99s =
very little information as to why and what exactly did it send out.
@infra - can you enable debug log level for vdsm by default?=20
=20
It does look like a race to me - does it reproduce?=20
=20
> Destroy on the destination, which is weird and I don't understand =
why
> the Destroy is invoked.
>
> Arik, would you like to take a look? Maybe I overlooked something =
or
> maybe there's a bug. The logs are at
> =
http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86=
_64/2882/artifact/exported-artifacts/basic-suite-master__logs/test_logs/ba=
sic-suite-master/post-006_migrations.py/ =
<
http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x8=
6_64/2882/artifact/exported-artifacts/basic-suite-master__logs/test_logs/b=
asic-suite-master/post-006_migrations.py/>
> and the interesting things happen around 2017-12-18
03:13:43,758-05.
=20
So it looks like that:
1. the engine polls the VMs from the source host
2. right after #1 we get the down event with proper exit reason (=3D =
migration
succeeded) but the engine doesn't process it since the VM is =
being locked by the monitoring as part of processing that polling (to =
prevent two analysis of the same VM simultaneously).
3. the result of the polling is a VM in status Down and must probably
=
exit_status=3DNormal
4. the engine decides to abort the migration and thus the monitoring
=
thread of the source host destroys the VM on the destination host.
=20
Unfortunately we don't have the exit_reason that is returned by the =
polling.
However, the only option I can think of is that it is different than
=
MigrationSucceeded, because otherwise we would have hand-over the VM to =
the destination host rather than aborting the migration [1].
That part of the code recently changed as part of [2] - we used to =
hand-over the VM when we get from the source host:
status =3D Down + exit_status =3D Normal=20
And in the database: previous_status =3D MigrationFrom
But after that change we require:
status =3D Down + exit_status =3D Normal ** + exit_reason =3D =
MigrationSucceeded
**
And in the database: previous_status =3D MigrationFrom
=20
Long story short, is it possible that VDSM had set the status of the =
VM to Down
and exit_status to Normal but the exit_reason was not updated =
(yet?) to MigrationSucceeded?
ok, so there might be a plausible explanation
the guest drive mapping introduced a significant delay into the =
VM.getStats call since it tries to update the mapping when it detects a =
change. That is likely to happen on lifecycle changes. In the OST case =
it took 1.2s to finish the whole call, and in the meantime the migration =
has finished. The getStats() call is not written with possible state =
change in mind, so if it so happens and the state moves from anything to =
Down in the middle of it it returns a Down state without exitCode and =
exitReason which confuses engine. We started to use the exitReason code =
to differentiate the various flavors of Down in engine in ~4.1 and in =
this case it results in misleading =E2=80=9CVM powered off by admin=E2=80=9D=
case
we need to fix the VM.getStats() to handle VM state changes in the =
middle
we need to fix the guest drive mapping updates to handle cleanly =
situations when the VM is either not ready yet or already gone
Thanks,
michal
=20
[1] =
https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/=
vdsbroker/src/main/java/org/ovirt/engine/core/vdsbroker/monitoring/VmAnaly=
zer.java#L291 =
<
https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules=
/vdsbroker/src/main/java/org/ovirt/engine/core/vdsbroker/monitoring/VmAnal=
yzer.java#L291>
<
https://gerrit.ovirt.org/#/c/84387/>
=20
>
>> Can someone please have a look? Thanks.
>>
>> As a side note, if indeed this is the cause for the failure for the
>> job, it's misleading to say "migrate_vm: Success".
>>
>> [1]
>> =
http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86=
_64/2882/artifact/exported-artifacts/basic-suite-master__logs/006_migratio=
ns.py.junit.xml =
<
http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x8=
6_64/2882/artifact/exported-artifacts/basic-suite-master__logs/006_migrati=
ons.py.junit.xml>
>>
>>>
>>> To view, visit change 85177. To unsubscribe, visit settings.
>>>
>>> Gerrit-Project: ovirt-system-tests
>>> Gerrit-Branch: master
>>> Gerrit-MessageType: comment
>>> Gerrit-Change-Id: I7eb386744a2a2faf0acd734e0ba44be22dd590b5
>>> Gerrit-Change-Number: 85177
>>> Gerrit-PatchSet: 3
>>> Gerrit-Owner: Yedidyah Bar David <didi(a)redhat.com =
<mailto:didi@redhat.com>>
>>> Gerrit-Reviewer: Dafna Ron <dron(a)redhat.com =
<mailto:dron@redhat.com>>
>>> Gerrit-Reviewer: Eyal Edri <eedri(a)redhat.com =
<mailto:eedri@redhat.com>>
>>> Gerrit-Reviewer: Jenkins CI
>>> Gerrit-Reviewer: Sandro Bonazzola <sbonazzo(a)redhat.com =
<mailto:sbonazzo@redhat.com>>
>>> Gerrit-Reviewer: Yedidyah Bar David <didi(a)redhat.com
=
<mailto:didi@redhat.com>>
>>> Gerrit-Comment-Date: Mon, 18 Dec 2017 08:17:11 +0000
>>> Gerrit-HasComments: No
> _______________________________________________
> Devel mailing list
> Devel(a)ovirt.org <mailto:Devel@ovirt.org>
>
http://lists.ovirt.org/mailman/listinfo/devel =
<
http://lists.ovirt.org/mailman/listinfo/devel>
>
>
=20
=20
--Apple-Mail=_45C10755-3D4E-480D-978D-C839E4C9C674
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
charset=utf-8
<html><head><meta http-equiv=3D"Content-Type"
content=3D"text/html =
charset=3Dutf-8"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" =
class=3D""><br class=3D""><div><blockquote
type=3D"cite" class=3D""><div =
class=3D"">On 19 Dec 2017, at 10:14, Arik Hadas <<a =
href=3D"mailto:ahadas@redhat.com"
class=3D"">ahadas(a)redhat.com</a>&gt; =
wrote:</div><br class=3D"Apple-interchange-newline"><div
class=3D""><div =
dir=3D"ltr" class=3D""><br class=3D""><div
class=3D"gmail_extra"><br =
class=3D""><div class=3D"gmail_quote">On Tue, Dec 19, 2017 at
12:20 AM, =
Michal Skrivanek <span dir=3D"ltr" class=3D""><<a =
href=3D"mailto:michal.skrivanek@redhat.com" target=3D"_blank" =
class=3D"">michal.skrivanek(a)redhat.com</a>&gt;</span>
wrote:<br =
class=3D""><blockquote class=3D"gmail_quote"
style=3D"margin:0px 0px 0px =
0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(=
204,204,204);padding-left:1ex"><div class=3D"gmail-HOEnZb"><div
=
class=3D"gmail-h5"><br class=3D"">
> On 18 Dec 2017, at 13:21, Milan Zamazal <<a =
href=3D"mailto:mzamazal@redhat.com"
class=3D"">mzamazal(a)redhat.com</a>&gt;=
wrote:<br class=3D"">
><br class=3D"">
> Yedidyah Bar David <<a href=3D"mailto:didi@redhat.com" =
class=3D"">didi(a)redhat.com</a>&gt; writes:<br
class=3D"">
><br class=3D"">
>> On Mon, Dec 18, 2017 at 10:17 AM, Code Review <<a =
href=3D"mailto:gerrit@ovirt.org"
class=3D"">gerrit(a)ovirt.org</a>&gt; =
wrote:<br class=3D"">
>>> Jenkins CI posted comments on this change.<br
class=3D"">
>>><br class=3D"">
>><br class=3D"">
>>> View Change<br class=3D"">
>>><br class=3D"">
>>> Patch set 3:Continuous-Integration -1<br
class=3D"">
>>><br class=3D"">
>>> Build Failed<br class=3D"">
>>><br class=3D"">
>>> <a =
href=3D"http://jenkins.ovirt.org/job/ovirt-system-tests_master_check...
-el7-x86_64/2882/" rel=3D"noreferrer" target=3D"_blank" =
class=3D"">http://jenkins.ovirt.org/job/<wbr =
class=3D"">ovirt-system-tests_master_<wbr =
class=3D"">check-patch-el7-x86_64/2882/</a><br
class=3D"">
>>> : FAILURE<br class=3D"">
>><br class=3D"">
>> Console output of above job says:<br class=3D"">
>><br class=3D"">
>> 08:13:34 # migrate_vm:<br
class=3D"">
>> 08:16:37 * Collect artifacts:<br
class=3D"">
>> 08:16:40 * Collect artifacts: Success (in
=
0:00:03)<br class=3D"">
>> 08:16:40 # migrate_vm: Success (in 0:03:06)<br =
class=3D"">
>> 08:16:40 # Results located at<br
class=3D"">
>> /dev/shm/ost/deployment-basic-<wbr =
class=3D"">suite-master/default/006_<wbr =
class=3D"">migrations.py.junit.xml<br class=3D"">
>> 08:16:40 @ Run test: 006_migrations.py: Success (in 0:03:50)<br =
class=3D"">
>> 08:16:40 Error occured, aborting<br class=3D"">
>><br class=3D"">
>> The file 006_migrations.py.junit.xml [1] says:<br
class=3D"">
>><br class=3D"">
>> <failure type=3D"exceptions.<wbr
class=3D"">AssertionError" =
message=3D"False !=3D True after<br class=3D"">
>> 180 seconds"><br class=3D"">
><br class=3D"">
> Reading the logs, I can see the VM migrates normally and seems to =
be<br class=3D"">
> reported to Engine correctly. When Engine receives =
end-of-migration<br class=3D"">
> event, it sends Destroy to the source (which is correct), calls =
dumpxmls<br class=3D"">
> on the destination in the meantime (looks fine to me) and then =
calls<br class=3D"">
<br class=3D"">
</div></div>looks like a race between getallvmstats reporting VM as Down =
(statusTime: 4296271980) being processed, while there is a =
Down/MigrationSucceeded event arriving (with notify_time 4296273170) at =
about the same time<br class=3D"">
Unfortunately the vdsm.log is not in DEBUG level so there=E2=80=99s very =
little information as to why and what exactly did it send out.<br =
class=3D"">
@infra - can you enable debug log level for vdsm by =
default? </blockquote><blockquote class=3D"gmail_quote" =
style=3D"margin:0px 0px 0px =
0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(=
204,204,204);padding-left:1ex">
<br class=3D"">
It does look like a race to me - does it =
reproduce? </blockquote><blockquote class=3D"gmail_quote" =
style=3D"margin:0px 0px 0px =
0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(=
204,204,204);padding-left:1ex">
<div class=3D""><div class=3D"gmail-h5"><br
class=3D"">
> Destroy on the destination, which is weird and I don't understand =
why<br class=3D"">
> the Destroy is invoked.<br class=3D"">
><br class=3D"">
> Arik, would you like to take a look? Maybe I overlooked =
something or<br class=3D"">
> maybe there's a bug. The logs are at<br
class=3D"">
> <a =
href=3D"http://jenkins.ovirt.org/job/ovirt-system-tests_master_check...
-el7-x86_64/2882/artifact/exported-artifacts/basic-suite-master__logs/test=
_logs/basic-suite-master/post-006_migrations.py/" rel=3D"noreferrer" =
target=3D"_blank"
class=3D"">http://jenkins.ovirt.org/job/<wbr
=
class=3D"">ovirt-system-tests_master_<wbr =
class=3D"">check-patch-el7-x86_64/2882/<wbr =
class=3D"">artifact/exported-artifacts/<wbr =
class=3D"">basic-suite-master__logs/test_<wbr =
class=3D"">logs/basic-suite-master/post-<wbr =
class=3D"">006_migrations.py/</a><br class=3D"">
> and the interesting things happen around 2017-12-18 =
03:13:43,758-05.<br
class=3D""></div></div></blockquote><div =
class=3D""><br class=3D""></div><div
class=3D"">So it looks like =
that:</div><div class=3D"">1. the engine polls the VMs from the
source =
host</div><div class=3D"">2. right after #1 we get the down event
with =
proper exit reason (=3D migration succeeded) but the engine doesn't =
process it since the VM is being locked by the monitoring as part of =
processing that polling (to prevent two analysis of the same VM =
simultaneously).</div><div class=3D"">3. the result of the polling
is a =
VM in status Down and must probably exit_status=3DNormal</div><div =
class=3D"">4. the engine decides to abort the migration and thus the =
monitoring thread of the source host destroys the VM on the destination =
host.</div><div class=3D""><br
class=3D""></div><div =
class=3D"">Unfortunately we don't have the exit_reason that is returned
=
by the polling.</div><div class=3D"">However, the only option I can
=
think of is that it is different than MigrationSucceeded, because =
otherwise we would have hand-over the VM to the destination host rather =
than aborting the migration [1].</div><div class=3D"">That part of
the =
code recently changed as part of [2] - we used to hand-over the VM when =
we get from the source host:</div><div class=3D"">status =3D Down +
=
exit_status =3D Normal </div><div class=3D"">And in the
database: =
previous_status =3D MigrationFrom</div><div class=3D"">But after
that =
change we require:</div><div class=3D"">status =3D Down +
exit_status =3D =
Normal ** + exit_reason =3D MigrationSucceeded **<br
class=3D""></div><div=
class=3D"">And in the database: previous_status =3D =
MigrationFrom</div><div class=3D""><br
class=3D""></div><div =
class=3D"">Long story short, is it possible that VDSM had set the status =
of the VM to Down and exit_status to Normal but the exit_reason was not =
updated (yet?) to =
MigrationSucceeded?</div></div></div></div></div></blockquote><div><br
=
class=3D""></div>ok, so there might be a plausible =
explanation</div><div>the guest drive mapping introduced a significant =
delay into the VM.getStats call since it tries to update the mapping =
when it detects a change. That is likely to happen on lifecycle changes. =
In the OST case it took 1.2s to finish the whole call, and in the =
meantime the migration has finished. The getStats() call is not written =
with possible state change in mind, so if it so happens and the state =
moves from anything to Down in the middle of it it returns a Down state =
without exitCode and exitReason which confuses engine. We started to use =
the exitReason code to differentiate the various flavors of Down in =
engine in ~4.1 and in this case it results in misleading =E2=80=9CVM =
powered off by admin=E2=80=9D case</div><div><br
class=3D""></div><div>we =
need to fix the VM.getStats() to handle VM state changes in the =
middle</div><div>we need to fix the guest drive mapping updates to =
handle cleanly situations when the VM is either not ready yet or already =
gone</div><div><br =
class=3D""></div><div>Thanks,</div><div>michal</div><div><br
=
class=3D""></div><div><blockquote type=3D"cite"
class=3D""><div =
class=3D""><div dir=3D"ltr" class=3D""><div
class=3D"gmail_extra"><div =
class=3D"gmail_quote"><div class=3D""><br
class=3D""></div><div =
class=3D"">[1] <a =
href=3D"https://github.com/oVirt/ovirt-engine/blob/master/backend/ma...
modules/vdsbroker/src/main/java/org/ovirt/engine/core/vdsbroker/monitoring=
/VmAnalyzer.java#L291" =
class=3D"">https://github.com/oVirt/ovirt-engine/blob/master...
er/modules/vdsbroker/src/main/java/org/ovirt/engine/core/vdsbroker/monitor=
ing/VmAnalyzer.java#L291</a></div><div
class=3D"">[2] <a =
href=3D"https://gerrit.ovirt.org/#/c/84387/" =
class=3D"">https://gerrit.ovirt.org/#/c/84387/</a><...
=
class=3D""> </div><blockquote
class=3D"gmail_quote" =
style=3D"margin:0px 0px 0px =
0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(=
204,204,204);padding-left:1ex"><div class=3D""><div
class=3D"gmail-h5">
><br class=3D"">
>> Can someone please have a look? Thanks.<br class=3D"">
>><br class=3D"">
>> As a side note, if indeed this is the cause for the failure for =
the<br class=3D"">
>> job, it's misleading to say "migrate_vm: Success".<br
class=3D"">=
>><br class=3D"">
>> [1]<br class=3D"">
>> <a =
href=3D"http://jenkins.ovirt.org/job/ovirt-system-tests_master_check...
-el7-x86_64/2882/artifact/exported-artifacts/basic-suite-master__logs/006_=
migrations.py.junit.xml" rel=3D"noreferrer" target=3D"_blank" =
class=3D"">http://jenkins.ovirt.org/job/<wbr =
class=3D"">ovirt-system-tests_master_<wbr =
class=3D"">check-patch-el7-x86_64/2882/<wbr =
class=3D"">artifact/exported-artifacts/<wbr =
class=3D"">basic-suite-master__logs/006_<wbr =
class=3D"">migrations.py.junit.xml</a><br class=3D"">
>><br class=3D"">
>>><br class=3D"">
>>> To view, visit change 85177. To unsubscribe, visit =
settings.<br class=3D"">
>>><br class=3D"">
>>> Gerrit-Project: ovirt-system-tests<br
class=3D"">
>>> Gerrit-Branch: master<br class=3D"">
>>> Gerrit-MessageType: comment<br class=3D"">
>>> Gerrit-Change-Id: I7eb386744a2a2faf0acd734e0ba44<wbr =
class=3D"">be22dd590b5<br class=3D"">
>>> Gerrit-Change-Number: 85177<br class=3D"">
>>> Gerrit-PatchSet: 3<br class=3D"">
>>> Gerrit-Owner: Yedidyah Bar David <<a =
href=3D"mailto:didi@redhat.com"
class=3D"">didi(a)redhat.com</a>&gt;<br =
class=3D"">
>>> Gerrit-Reviewer: Dafna Ron <<a =
href=3D"mailto:dron@redhat.com"
class=3D"">dron(a)redhat.com</a>&gt;<br =
class=3D"">
>>> Gerrit-Reviewer: Eyal Edri <<a =
href=3D"mailto:eedri@redhat.com"
class=3D"">eedri(a)redhat.com</a>&gt;<br =
class=3D"">
>>> Gerrit-Reviewer: Jenkins CI<br class=3D"">
>>> Gerrit-Reviewer: Sandro Bonazzola <<a =
href=3D"mailto:sbonazzo@redhat.com" =
class=3D"">sbonazzo(a)redhat.com</a>&gt;<br
class=3D"">
>>> Gerrit-Reviewer: Yedidyah Bar David <<a =
href=3D"mailto:didi@redhat.com"
class=3D"">didi(a)redhat.com</a>&gt;<br =
class=3D"">
>>> Gerrit-Comment-Date: Mon, 18 Dec 2017 08:17:11 +0000<br =
class=3D"">
>>> Gerrit-HasComments: No<br class=3D"">
</div></div>> ______________________________<wbr =
class=3D"">_________________<br class=3D"">
> Devel mailing list<br class=3D"">
> <a href=3D"mailto:Devel@ovirt.org"
class=3D"">Devel(a)ovirt.org</a><br =
class=3D"">
> <a
href=3D"http://lists.ovirt.org/mailman/listinfo/devel" =
rel=3D"noreferrer" target=3D"_blank" =
class=3D"">http://lists.ovirt.org/<wbr =
class=3D"">mailman/listinfo/devel</a><br class=3D"">
><br class=3D"">
><br class=3D"">
<br class=3D"">
</blockquote></div><br class=3D""></div></div>
</div></blockquote></div><br
class=3D""></body></html>=
--Apple-Mail=_45C10755-3D4E-480D-978D-C839E4C9C674--