
19 Dec
2017
19 Dec
'17
12:36 p.m.
--Apple-Mail=_45C10755-3D4E-480D-978D-C839E4C9C674 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 > On 19 Dec 2017, at 10:14, Arik Hadas <ahadas@redhat.com> wrote: >=20 >=20 >=20 > On Tue, Dec 19, 2017 at 12:20 AM, Michal Skrivanek = <michal.skrivanek@redhat.com <mailto:michal.skrivanek@redhat.com>> = wrote: >=20 > > On 18 Dec 2017, at 13:21, Milan Zamazal <mzamazal@redhat.com = <mailto:mzamazal@redhat.com>> wrote: > > > > Yedidyah Bar David <didi@redhat.com <mailto:didi@redhat.com>> = writes: > > > >> On Mon, Dec 18, 2017 at 10:17 AM, Code Review <gerrit@ovirt.org = <mailto:gerrit@ovirt.org>> wrote: > >>> Jenkins CI posted comments on this change. > >>> > >> > >>> View Change > >>> > >>> Patch set 3:Continuous-Integration -1 > >>> > >>> Build Failed > >>> > >>> = http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86= _64/2882/ = <http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x8= 6_64/2882/> > >>> : FAILURE > >> > >> Console output of above job says: > >> > >> 08:13:34 # migrate_vm: > >> 08:16:37 * Collect artifacts: > >> 08:16:40 * Collect artifacts: Success (in 0:00:03) > >> 08:16:40 # migrate_vm: Success (in 0:03:06) > >> 08:16:40 # Results located at > >> = /dev/shm/ost/deployment-basic-suite-master/default/006_migrations.py.junit= .xml > >> 08:16:40 @ Run test: 006_migrations.py: Success (in 0:03:50) > >> 08:16:40 Error occured, aborting > >> > >> The file 006_migrations.py.junit.xml [1] says: > >> > >> <failure type=3D"exceptions.AssertionError" message=3D"False !=3D = True after > >> 180 seconds"> > > > > Reading the logs, I can see the VM migrates normally and seems to be > > reported to Engine correctly. When Engine receives end-of-migration > > event, it sends Destroy to the source (which is correct), calls = dumpxmls > > on the destination in the meantime (looks fine to me) and then calls >=20 > looks like a race between getallvmstats reporting VM as Down = (statusTime: 4296271980) being processed, while there is a = Down/MigrationSucceeded event arriving (with notify_time 4296273170) at = about the same time > Unfortunately the vdsm.log is not in DEBUG level so there=E2=80=99s = very little information as to why and what exactly did it send out. > @infra - can you enable debug log level for vdsm by default?=20 >=20 > It does look like a race to me - does it reproduce?=20 >=20 > > Destroy on the destination, which is weird and I don't understand = why > > the Destroy is invoked. > > > > Arik, would you like to take a look? Maybe I overlooked something = or > > maybe there's a bug. The logs are at > > = http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86= _64/2882/artifact/exported-artifacts/basic-suite-master__logs/test_logs/ba= sic-suite-master/post-006_migrations.py/ = <http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x8= 6_64/2882/artifact/exported-artifacts/basic-suite-master__logs/test_logs/b= asic-suite-master/post-006_migrations.py/> > > and the interesting things happen around 2017-12-18 03:13:43,758-05. >=20 > So it looks like that: > 1. the engine polls the VMs from the source host > 2. right after #1 we get the down event with proper exit reason (=3D = migration succeeded) but the engine doesn't process it since the VM is = being locked by the monitoring as part of processing that polling (to = prevent two analysis of the same VM simultaneously). > 3. the result of the polling is a VM in status Down and must probably = exit_status=3DNormal > 4. the engine decides to abort the migration and thus the monitoring = thread of the source host destroys the VM on the destination host. >=20 > Unfortunately we don't have the exit_reason that is returned by the = polling. > However, the only option I can think of is that it is different than = MigrationSucceeded, because otherwise we would have hand-over the VM to = the destination host rather than aborting the migration [1]. > That part of the code recently changed as part of [2] - we used to = hand-over the VM when we get from the source host: > status =3D Down + exit_status =3D Normal=20 > And in the database: previous_status =3D MigrationFrom > But after that change we require: > status =3D Down + exit_status =3D Normal ** + exit_reason =3D = MigrationSucceeded ** > And in the database: previous_status =3D MigrationFrom >=20 > Long story short, is it possible that VDSM had set the status of the = VM to Down and exit_status to Normal but the exit_reason was not updated = (yet?) to MigrationSucceeded? ok, so there might be a plausible explanation the guest drive mapping introduced a significant delay into the = VM.getStats call since it tries to update the mapping when it detects a = change. That is likely to happen on lifecycle changes. In the OST case = it took 1.2s to finish the whole call, and in the meantime the migration = has finished. The getStats() call is not written with possible state = change in mind, so if it so happens and the state moves from anything to = Down in the middle of it it returns a Down state without exitCode and = exitReason which confuses engine. We started to use the exitReason code = to differentiate the various flavors of Down in engine in ~4.1 and in = this case it results in misleading =E2=80=9CVM powered off by admin=E2=80=9D= case we need to fix the VM.getStats() to handle VM state changes in the = middle we need to fix the guest drive mapping updates to handle cleanly = situations when the VM is either not ready yet or already gone Thanks, michal >=20 > [1] = https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/= vdsbroker/src/main/java/org/ovirt/engine/core/vdsbroker/monitoring/VmAnaly= zer.java#L291 = <https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules= /vdsbroker/src/main/java/org/ovirt/engine/core/vdsbroker/monitoring/VmAnal= yzer.java#L291> > [2] https://gerrit.ovirt.org/#/c/84387/ = <https://gerrit.ovirt.org/#/c/84387/> > =20 > > > >> Can someone please have a look? Thanks. > >> > >> As a side note, if indeed this is the cause for the failure for the > >> job, it's misleading to say "migrate_vm: Success". > >> > >> [1] > >> = http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86= _64/2882/artifact/exported-artifacts/basic-suite-master__logs/006_migratio= ns.py.junit.xml = <http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x8= 6_64/2882/artifact/exported-artifacts/basic-suite-master__logs/006_migrati= ons.py.junit.xml> > >> > >>> > >>> To view, visit change 85177. To unsubscribe, visit settings. > >>> > >>> Gerrit-Project: ovirt-system-tests > >>> Gerrit-Branch: master > >>> Gerrit-MessageType: comment > >>> Gerrit-Change-Id: I7eb386744a2a2faf0acd734e0ba44be22dd590b5 > >>> Gerrit-Change-Number: 85177 > >>> Gerrit-PatchSet: 3 > >>> Gerrit-Owner: Yedidyah Bar David <didi@redhat.com = <mailto:didi@redhat.com>> > >>> Gerrit-Reviewer: Dafna Ron <dron@redhat.com = <mailto:dron@redhat.com>> > >>> Gerrit-Reviewer: Eyal Edri <eedri@redhat.com = <mailto:eedri@redhat.com>> > >>> Gerrit-Reviewer: Jenkins CI > >>> Gerrit-Reviewer: Sandro Bonazzola <sbonazzo@redhat.com = <mailto:sbonazzo@redhat.com>> > >>> Gerrit-Reviewer: Yedidyah Bar David <didi@redhat.com = <mailto:didi@redhat.com>> > >>> Gerrit-Comment-Date: Mon, 18 Dec 2017 08:17:11 +0000 > >>> Gerrit-HasComments: No > > _______________________________________________ > > Devel mailing list > > Devel@ovirt.org <mailto:Devel@ovirt.org> > > http://lists.ovirt.org/mailman/listinfo/devel = <http://lists.ovirt.org/mailman/listinfo/devel> > > > > >=20 >=20 --Apple-Mail=_45C10755-3D4E-480D-978D-C839E4C9C674 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 <html><head><meta http-equiv=3D"Content-Type" content=3D"text/html = charset=3Dutf-8"></head><body style=3D"word-wrap: break-word; = -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" = class=3D""><br class=3D""><div><blockquote type=3D"cite" class=3D""><div = class=3D"">On 19 Dec 2017, at 10:14, Arik Hadas <<a = href=3D"mailto:ahadas@redhat.com" class=3D"">ahadas@redhat.com</a>> = wrote:</div><br class=3D"Apple-interchange-newline"><div class=3D""><div = dir=3D"ltr" class=3D""><br class=3D""><div class=3D"gmail_extra"><br = class=3D""><div class=3D"gmail_quote">On Tue, Dec 19, 2017 at 12:20 AM, = Michal Skrivanek <span dir=3D"ltr" class=3D""><<a = href=3D"mailto:michal.skrivanek@redhat.com" target=3D"_blank" = class=3D"">michal.skrivanek@redhat.com</a>></span> wrote:<br = class=3D""><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px = 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(= 204,204,204);padding-left:1ex"><div class=3D"gmail-HOEnZb"><div = class=3D"gmail-h5"><br class=3D""> > On 18 Dec 2017, at 13:21, Milan Zamazal <<a = href=3D"mailto:mzamazal@redhat.com" class=3D"">mzamazal@redhat.com</a>>= wrote:<br class=3D""> ><br class=3D""> > Yedidyah Bar David <<a href=3D"mailto:didi@redhat.com" = class=3D"">didi@redhat.com</a>> writes:<br class=3D""> ><br class=3D""> >> On Mon, Dec 18, 2017 at 10:17 AM, Code Review <<a = href=3D"mailto:gerrit@ovirt.org" class=3D"">gerrit@ovirt.org</a>> = wrote:<br class=3D""> >>> Jenkins CI posted comments on this change.<br class=3D""> >>><br class=3D""> >><br class=3D""> >>> View Change<br class=3D""> >>><br class=3D""> >>> Patch set 3:Continuous-Integration -1<br class=3D""> >>><br class=3D""> >>> Build Failed<br class=3D""> >>><br class=3D""> >>> <a = href=3D"http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch= -el7-x86_64/2882/" rel=3D"noreferrer" target=3D"_blank" = class=3D"">http://jenkins.ovirt.org/job/<wbr = class=3D"">ovirt-system-tests_master_<wbr = class=3D"">check-patch-el7-x86_64/2882/</a><br class=3D""> >>> : FAILURE<br class=3D""> >><br class=3D""> >> Console output of above job says:<br class=3D""> >><br class=3D""> >> 08:13:34 # migrate_vm:<br class=3D""> >> 08:16:37 * Collect artifacts:<br class=3D""> >> 08:16:40 * Collect artifacts: Success (in = 0:00:03)<br class=3D""> >> 08:16:40 # migrate_vm: Success (in 0:03:06)<br = class=3D""> >> 08:16:40 # Results located at<br class=3D""> >> /dev/shm/ost/deployment-basic-<wbr = class=3D"">suite-master/default/006_<wbr = class=3D"">migrations.py.junit.xml<br class=3D""> >> 08:16:40 @ Run test: 006_migrations.py: Success (in 0:03:50)<br = class=3D""> >> 08:16:40 Error occured, aborting<br class=3D""> >><br class=3D""> >> The file 006_migrations.py.junit.xml [1] says:<br class=3D""> >><br class=3D""> >> <failure type=3D"exceptions.<wbr class=3D"">AssertionError" = message=3D"False !=3D True after<br class=3D""> >> 180 seconds"><br class=3D""> ><br class=3D""> > Reading the logs, I can see the VM migrates normally and seems to = be<br class=3D""> > reported to Engine correctly. When Engine receives = end-of-migration<br class=3D""> > event, it sends Destroy to the source (which is correct), calls = dumpxmls<br class=3D""> > on the destination in the meantime (looks fine to me) and then = calls<br class=3D""> <br class=3D""> </div></div>looks like a race between getallvmstats reporting VM as Down = (statusTime: 4296271980) being processed, while there is a = Down/MigrationSucceeded event arriving (with notify_time 4296273170) at = about the same time<br class=3D""> Unfortunately the vdsm.log is not in DEBUG level so there=E2=80=99s very = little information as to why and what exactly did it send out.<br = class=3D""> @infra - can you enable debug log level for vdsm by = default? </blockquote><blockquote class=3D"gmail_quote" = style=3D"margin:0px 0px 0px = 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(= 204,204,204);padding-left:1ex"> <br class=3D""> It does look like a race to me - does it = reproduce? </blockquote><blockquote class=3D"gmail_quote" = style=3D"margin:0px 0px 0px = 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(= 204,204,204);padding-left:1ex"> <div class=3D""><div class=3D"gmail-h5"><br class=3D""> > Destroy on the destination, which is weird and I don't understand = why<br class=3D""> > the Destroy is invoked.<br class=3D""> ><br class=3D""> > Arik, would you like to take a look? Maybe I overlooked = something or<br class=3D""> > maybe there's a bug. The logs are at<br class=3D""> > <a = href=3D"http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch= -el7-x86_64/2882/artifact/exported-artifacts/basic-suite-master__logs/test= _logs/basic-suite-master/post-006_migrations.py/" rel=3D"noreferrer" = target=3D"_blank" class=3D"">http://jenkins.ovirt.org/job/<wbr = class=3D"">ovirt-system-tests_master_<wbr = class=3D"">check-patch-el7-x86_64/2882/<wbr = class=3D"">artifact/exported-artifacts/<wbr = class=3D"">basic-suite-master__logs/test_<wbr = class=3D"">logs/basic-suite-master/post-<wbr = class=3D"">006_migrations.py/</a><br class=3D""> > and the interesting things happen around 2017-12-18 = 03:13:43,758-05.<br class=3D""></div></div></blockquote><div = class=3D""><br class=3D""></div><div class=3D"">So it looks like = that:</div><div class=3D"">1. the engine polls the VMs from the source = host</div><div class=3D"">2. right after #1 we get the down event with = proper exit reason (=3D migration succeeded) but the engine doesn't = process it since the VM is being locked by the monitoring as part of = processing that polling (to prevent two analysis of the same VM = simultaneously).</div><div class=3D"">3. the result of the polling is a = VM in status Down and must probably exit_status=3DNormal</div><div = class=3D"">4. the engine decides to abort the migration and thus the = monitoring thread of the source host destroys the VM on the destination = host.</div><div class=3D""><br class=3D""></div><div = class=3D"">Unfortunately we don't have the exit_reason that is returned = by the polling.</div><div class=3D"">However, the only option I can = think of is that it is different than MigrationSucceeded, because = otherwise we would have hand-over the VM to the destination host rather = than aborting the migration [1].</div><div class=3D"">That part of the = code recently changed as part of [2] - we used to hand-over the VM when = we get from the source host:</div><div class=3D"">status =3D Down + = exit_status =3D Normal </div><div class=3D"">And in the database: = previous_status =3D MigrationFrom</div><div class=3D"">But after that = change we require:</div><div class=3D"">status =3D Down + exit_status =3D = Normal ** + exit_reason =3D MigrationSucceeded **<br class=3D""></div><div= class=3D"">And in the database: previous_status =3D = MigrationFrom</div><div class=3D""><br class=3D""></div><div = class=3D"">Long story short, is it possible that VDSM had set the status = of the VM to Down and exit_status to Normal but the exit_reason was not = updated (yet?) to = MigrationSucceeded?</div></div></div></div></div></blockquote><div><br = class=3D""></div>ok, so there might be a plausible = explanation</div><div>the guest drive mapping introduced a significant = delay into the VM.getStats call since it tries to update the mapping = when it detects a change. That is likely to happen on lifecycle changes. = In the OST case it took 1.2s to finish the whole call, and in the = meantime the migration has finished. The getStats() call is not written = with possible state change in mind, so if it so happens and the state = moves from anything to Down in the middle of it it returns a Down state = without exitCode and exitReason which confuses engine. We started to use = the exitReason code to differentiate the various flavors of Down in = engine in ~4.1 and in this case it results in misleading =E2=80=9CVM = powered off by admin=E2=80=9D case</div><div><br class=3D""></div><div>we = need to fix the VM.getStats() to handle VM state changes in the = middle</div><div>we need to fix the guest drive mapping updates to = handle cleanly situations when the VM is either not ready yet or already = gone</div><div><br = class=3D""></div><div>Thanks,</div><div>michal</div><div><br = class=3D""></div><div><blockquote type=3D"cite" class=3D""><div = class=3D""><div dir=3D"ltr" class=3D""><div class=3D"gmail_extra"><div = class=3D"gmail_quote"><div class=3D""><br class=3D""></div><div = class=3D"">[1] <a = href=3D"https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/= modules/vdsbroker/src/main/java/org/ovirt/engine/core/vdsbroker/monitoring= /VmAnalyzer.java#L291" = class=3D"">https://github.com/oVirt/ovirt-engine/blob/master/backend/manag= er/modules/vdsbroker/src/main/java/org/ovirt/engine/core/vdsbroker/monitor= ing/VmAnalyzer.java#L291</a></div><div class=3D"">[2] <a = href=3D"https://gerrit.ovirt.org/#/c/84387/" = class=3D"">https://gerrit.ovirt.org/#/c/84387/</a></div><div = class=3D""> </div><blockquote class=3D"gmail_quote" = style=3D"margin:0px 0px 0px = 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(= 204,204,204);padding-left:1ex"><div class=3D""><div class=3D"gmail-h5"> ><br class=3D""> >> Can someone please have a look? Thanks.<br class=3D""> >><br class=3D""> >> As a side note, if indeed this is the cause for the failure for = the<br class=3D""> >> job, it's misleading to say "migrate_vm: Success".<br class=3D"">= >><br class=3D""> >> [1]<br class=3D""> >> <a = href=3D"http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch= -el7-x86_64/2882/artifact/exported-artifacts/basic-suite-master__logs/006_= migrations.py.junit.xml" rel=3D"noreferrer" target=3D"_blank" = class=3D"">http://jenkins.ovirt.org/job/<wbr = class=3D"">ovirt-system-tests_master_<wbr = class=3D"">check-patch-el7-x86_64/2882/<wbr = class=3D"">artifact/exported-artifacts/<wbr = class=3D"">basic-suite-master__logs/006_<wbr = class=3D"">migrations.py.junit.xml</a><br class=3D""> >><br class=3D""> >>><br class=3D""> >>> To view, visit change 85177. To unsubscribe, visit = settings.<br class=3D""> >>><br class=3D""> >>> Gerrit-Project: ovirt-system-tests<br class=3D""> >>> Gerrit-Branch: master<br class=3D""> >>> Gerrit-MessageType: comment<br class=3D""> >>> Gerrit-Change-Id: I7eb386744a2a2faf0acd734e0ba44<wbr = class=3D"">be22dd590b5<br class=3D""> >>> Gerrit-Change-Number: 85177<br class=3D""> >>> Gerrit-PatchSet: 3<br class=3D""> >>> Gerrit-Owner: Yedidyah Bar David <<a = href=3D"mailto:didi@redhat.com" class=3D"">didi@redhat.com</a>><br = class=3D""> >>> Gerrit-Reviewer: Dafna Ron <<a = href=3D"mailto:dron@redhat.com" class=3D"">dron@redhat.com</a>><br = class=3D""> >>> Gerrit-Reviewer: Eyal Edri <<a = href=3D"mailto:eedri@redhat.com" class=3D"">eedri@redhat.com</a>><br = class=3D""> >>> Gerrit-Reviewer: Jenkins CI<br class=3D""> >>> Gerrit-Reviewer: Sandro Bonazzola <<a = href=3D"mailto:sbonazzo@redhat.com" = class=3D"">sbonazzo@redhat.com</a>><br class=3D""> >>> Gerrit-Reviewer: Yedidyah Bar David <<a = href=3D"mailto:didi@redhat.com" class=3D"">didi@redhat.com</a>><br = class=3D""> >>> Gerrit-Comment-Date: Mon, 18 Dec 2017 08:17:11 +0000<br = class=3D""> >>> Gerrit-HasComments: No<br class=3D""> </div></div>> ______________________________<wbr = class=3D"">_________________<br class=3D""> > Devel mailing list<br class=3D""> > <a href=3D"mailto:Devel@ovirt.org" class=3D"">Devel@ovirt.org</a><br = class=3D""> > <a href=3D"http://lists.ovirt.org/mailman/listinfo/devel" = rel=3D"noreferrer" target=3D"_blank" = class=3D"">http://lists.ovirt.org/<wbr = class=3D"">mailman/listinfo/devel</a><br class=3D""> ><br class=3D""> ><br class=3D""> <br class=3D""> </blockquote></div><br class=3D""></div></div> </div></blockquote></div><br class=3D""></body></html>= --Apple-Mail=_45C10755-3D4E-480D-978D-C839E4C9C674--