Migration failed (was: Change in ovirt-system-tests[master]: Run engine-setup with OTOPI_DEBUG=1)

On Mon, Dec 18, 2017 at 10:17 AM, Code Review <gerrit@ovirt.org> wrote:
Jenkins CI posted comments on this change.
View Change
Patch set 3:Continuous-Integration -1
Build Failed
http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_6... : FAILURE
Console output of above job says: 08:13:34 # migrate_vm: 08:16:37 * Collect artifacts: 08:16:40 * Collect artifacts: Success (in 0:00:03) 08:16:40 # migrate_vm: Success (in 0:03:06) 08:16:40 # Results located at /dev/shm/ost/deployment-basic-suite-master/default/006_migrations.py.junit.xml 08:16:40 @ Run test: 006_migrations.py: Success (in 0:03:50) 08:16:40 Error occured, aborting The file 006_migrations.py.junit.xml [1] says: <failure type="exceptions.AssertionError" message="False != True after 180 seconds"> Can someone please have a look? Thanks. As a side note, if indeed this is the cause for the failure for the job, it's misleading to say "migrate_vm: Success". [1] http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_6...
To view, visit change 85177. To unsubscribe, visit settings.
Gerrit-Project: ovirt-system-tests Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I7eb386744a2a2faf0acd734e0ba44be22dd590b5 Gerrit-Change-Number: 85177 Gerrit-PatchSet: 3 Gerrit-Owner: Yedidyah Bar David <didi@redhat.com> Gerrit-Reviewer: Dafna Ron <dron@redhat.com> Gerrit-Reviewer: Eyal Edri <eedri@redhat.com> Gerrit-Reviewer: Jenkins CI Gerrit-Reviewer: Sandro Bonazzola <sbonazzo@redhat.com> Gerrit-Reviewer: Yedidyah Bar David <didi@redhat.com> Gerrit-Comment-Date: Mon, 18 Dec 2017 08:17:11 +0000 Gerrit-HasComments: No
-- Didi

Yedidyah Bar David <didi@redhat.com> writes:
On Mon, Dec 18, 2017 at 10:17 AM, Code Review <gerrit@ovirt.org> wrote:
Jenkins CI posted comments on this change.
View Change
Patch set 3:Continuous-Integration -1
Build Failed
http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_6... : FAILURE
Console output of above job says:
08:13:34 # migrate_vm: 08:16:37 * Collect artifacts: 08:16:40 * Collect artifacts: Success (in 0:00:03) 08:16:40 # migrate_vm: Success (in 0:03:06) 08:16:40 # Results located at /dev/shm/ost/deployment-basic-suite-master/default/006_migrations.py.junit.xml 08:16:40 @ Run test: 006_migrations.py: Success (in 0:03:50) 08:16:40 Error occured, aborting
The file 006_migrations.py.junit.xml [1] says:
<failure type="exceptions.AssertionError" message="False != True after 180 seconds">
Reading the logs, I can see the VM migrates normally and seems to be reported to Engine correctly. When Engine receives end-of-migration event, it sends Destroy to the source (which is correct), calls dumpxmls on the destination in the meantime (looks fine to me) and then calls Destroy on the destination, which is weird and I don't understand why the Destroy is invoked. Arik, would you like to take a look? Maybe I overlooked something or maybe there's a bug. The logs are at http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_6... and the interesting things happen around 2017-12-18 03:13:43,758-05.
Can someone please have a look? Thanks.
As a side note, if indeed this is the cause for the failure for the job, it's misleading to say "migrate_vm: Success".
[1] http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_6...
To view, visit change 85177. To unsubscribe, visit settings.
Gerrit-Project: ovirt-system-tests Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I7eb386744a2a2faf0acd734e0ba44be22dd590b5 Gerrit-Change-Number: 85177 Gerrit-PatchSet: 3 Gerrit-Owner: Yedidyah Bar David <didi@redhat.com> Gerrit-Reviewer: Dafna Ron <dron@redhat.com> Gerrit-Reviewer: Eyal Edri <eedri@redhat.com> Gerrit-Reviewer: Jenkins CI Gerrit-Reviewer: Sandro Bonazzola <sbonazzo@redhat.com> Gerrit-Reviewer: Yedidyah Bar David <didi@redhat.com> Gerrit-Comment-Date: Mon, 18 Dec 2017 08:17:11 +0000 Gerrit-HasComments: No

On 18 Dec 2017, at 13:21, Milan Zamazal <mzamazal@redhat.com> wrote:
Yedidyah Bar David <didi@redhat.com> writes:
On Mon, Dec 18, 2017 at 10:17 AM, Code Review <gerrit@ovirt.org> wrote:
Jenkins CI posted comments on this change.
View Change
Patch set 3:Continuous-Integration -1
Build Failed
http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_6... : FAILURE
Console output of above job says:
08:13:34 # migrate_vm: 08:16:37 * Collect artifacts: 08:16:40 * Collect artifacts: Success (in 0:00:03) 08:16:40 # migrate_vm: Success (in 0:03:06) 08:16:40 # Results located at /dev/shm/ost/deployment-basic-suite-master/default/006_migrations.py.junit.xml 08:16:40 @ Run test: 006_migrations.py: Success (in 0:03:50) 08:16:40 Error occured, aborting
The file 006_migrations.py.junit.xml [1] says:
<failure type="exceptions.AssertionError" message="False != True after 180 seconds">
Reading the logs, I can see the VM migrates normally and seems to be reported to Engine correctly. When Engine receives end-of-migration event, it sends Destroy to the source (which is correct), calls dumpxmls on the destination in the meantime (looks fine to me) and then calls
looks like a race between getallvmstats reporting VM as Down (statusTime: 4296271980) being processed, while there is a Down/MigrationSucceeded event arriving (with notify_time 4296273170) at about the same time Unfortunately the vdsm.log is not in DEBUG level so there’s very little information as to why and what exactly did it send out. @infra - can you enable debug log level for vdsm by default? It does look like a race to me - does it reproduce?
Destroy on the destination, which is weird and I don't understand why the Destroy is invoked.
Arik, would you like to take a look? Maybe I overlooked something or maybe there's a bug. The logs are at http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_6... and the interesting things happen around 2017-12-18 03:13:43,758-05.
Can someone please have a look? Thanks.
As a side note, if indeed this is the cause for the failure for the job, it's misleading to say "migrate_vm: Success".
[1] http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_6...
To view, visit change 85177. To unsubscribe, visit settings.
Gerrit-Project: ovirt-system-tests Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I7eb386744a2a2faf0acd734e0ba44be22dd590b5 Gerrit-Change-Number: 85177 Gerrit-PatchSet: 3 Gerrit-Owner: Yedidyah Bar David <didi@redhat.com> Gerrit-Reviewer: Dafna Ron <dron@redhat.com> Gerrit-Reviewer: Eyal Edri <eedri@redhat.com> Gerrit-Reviewer: Jenkins CI Gerrit-Reviewer: Sandro Bonazzola <sbonazzo@redhat.com> Gerrit-Reviewer: Yedidyah Bar David <didi@redhat.com> Gerrit-Comment-Date: Mon, 18 Dec 2017 08:17:11 +0000 Gerrit-HasComments: No
Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

On Dec 19, 2017 00:20, "Michal Skrivanek" <michal.skrivanek@redhat.com> wrote:
On 18 Dec 2017, at 13:21, Milan Zamazal <mzamazal@redhat.com> wrote:
Yedidyah Bar David <didi@redhat.com> writes:
On Mon, Dec 18, 2017 at 10:17 AM, Code Review <gerrit@ovirt.org> wrote:
Jenkins CI posted comments on this change.
View Change
Patch set 3:Continuous-Integration -1
Build Failed
http://jenkins.ovirt.org/job/ovirt-system-tests_master_ check-patch-el7-x86_64/2882/ : FAILURE
Console output of above job says:
08:13:34 # migrate_vm: 08:16:37 * Collect artifacts: 08:16:40 * Collect artifacts: Success (in 0:00:03) 08:16:40 # migrate_vm: Success (in 0:03:06) 08:16:40 # Results located at /dev/shm/ost/deployment-basic-suite-master/default/006_ migrations.py.junit.xml 08:16:40 @ Run test: 006_migrations.py: Success (in 0:03:50) 08:16:40 Error occured, aborting
The file 006_migrations.py.junit.xml [1] says:
<failure type="exceptions.AssertionError" message="False != True after 180 seconds">
Reading the logs, I can see the VM migrates normally and seems to be reported to Engine correctly. When Engine receives end-of-migration event, it sends Destroy to the source (which is correct), calls dumpxmls on the destination in the meantime (looks fine to me) and then calls
looks like a race between getallvmstats reporting VM as Down (statusTime: 4296271980) being processed, while there is a Down/MigrationSucceeded event arriving (with notify_time 4296273170) at about the same time Unfortunately the vdsm.log is not in DEBUG level so there’s very little information as to why and what exactly did it send out. @infra - can you enable debug log level for vdsm by default? How do you enable debug mode for vdsm ? It does look like a race to me - does it reproduce?
Destroy on the destination, which is weird and I don't understand why the Destroy is invoked.
Arik, would you like to take a look? Maybe I overlooked something or maybe there's a bug. The logs are at http://jenkins.ovirt.org/job/ovirt-system-tests_master_ check-patch-el7-x86_64/2882/artifact/exported-artifacts/ basic-suite-master__logs/test_logs/basic-suite-master/post- 006_migrations.py/ and the interesting things happen around 2017-12-18 03:13:43,758-05.
Can someone please have a look? Thanks.
As a side note, if indeed this is the cause for the failure for the job, it's misleading to say "migrate_vm: Success".
[1] http://jenkins.ovirt.org/job/ovirt-system-tests_master_ check-patch-el7-x86_64/2882/artifact/exported-artifacts/ basic-suite-master__logs/006_migrations.py.junit.xml
To view, visit change 85177. To unsubscribe, visit settings.
Gerrit-Project: ovirt-system-tests Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I7eb386744a2a2faf0acd734e0ba44be22dd590b5 Gerrit-Change-Number: 85177 Gerrit-PatchSet: 3 Gerrit-Owner: Yedidyah Bar David <didi@redhat.com> Gerrit-Reviewer: Dafna Ron <dron@redhat.com> Gerrit-Reviewer: Eyal Edri <eedri@redhat.com> Gerrit-Reviewer: Jenkins CI Gerrit-Reviewer: Sandro Bonazzola <sbonazzo@redhat.com> Gerrit-Reviewer: Yedidyah Bar David <didi@redhat.com> Gerrit-Comment-Date: Mon, 18 Dec 2017 08:17:11 +0000 Gerrit-HasComments: No
Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel
_______________________________________________ Infra mailing list Infra@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra

--Apple-Mail=_97530516-BF1E-4F4D-B41F-969FA1E8F891 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8
On 19 Dec 2017, at 07:53, Eyal Edri <eedri@redhat.com> wrote: =20 =20 =20 On Dec 19, 2017 00:20, "Michal Skrivanek" <michal.skrivanek@redhat.com = <mailto:michal.skrivanek@redhat.com>> wrote: =20
On 18 Dec 2017, at 13:21, Milan Zamazal <mzamazal@redhat.com = <mailto:mzamazal@redhat.com>> wrote:
Yedidyah Bar David <didi@redhat.com <mailto:didi@redhat.com>> = writes:
On Mon, Dec 18, 2017 at 10:17 AM, Code Review <gerrit@ovirt.org = <mailto:gerrit@ovirt.org>> wrote:
Jenkins CI posted comments on this change.
View Change
Patch set 3:Continuous-Integration -1
Build Failed
= http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86= _64/2882/ = <http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x8= 6_64/2882/> : FAILURE
Console output of above job says:
08:13:34 # migrate_vm: 08:16:37 * Collect artifacts: 08:16:40 * Collect artifacts: Success (in 0:00:03) 08:16:40 # migrate_vm: Success (in 0:03:06) 08:16:40 # Results located at = /dev/shm/ost/deployment-basic-suite-master/default/006_migrations.py.junit= .xml 08:16:40 @ Run test: 006_migrations.py: Success (in 0:03:50) 08:16:40 Error occured, aborting
The file 006_migrations.py.junit.xml [1] says:
<failure type=3D"exceptions.AssertionError" message=3D"False !=3D = True after 180 seconds">
Reading the logs, I can see the VM migrates normally and seems to be reported to Engine correctly. When Engine receives end-of-migration event, it sends Destroy to the source (which is correct), calls = dumpxmls on the destination in the meantime (looks fine to me) and then calls =20 looks like a race between getallvmstats reporting VM as Down = (statusTime: 4296271980) being processed, while there is a = Down/MigrationSucceeded event arriving (with notify_time 4296273170) at = about the same time Unfortunately the vdsm.log is not in DEBUG level so there=E2=80=99s = very little information as to why and what exactly did it send out. @infra - can you enable debug log level for vdsm by default? =20 How do you enable debug mode for vdsm ?
Can be permanently set in /etc/vdsm/logger.conf I suppose enabling for all loggers would be useful see https://access.redhat.com/articles/2919931 = <https://access.redhat.com/articles/2919931> Thanks, michal
=20 =20 It does look like a race to me - does it reproduce? =20
Destroy on the destination, which is weird and I don't understand = why the Destroy is invoked.
Arik, would you like to take a look? Maybe I overlooked something = or maybe there's a bug. The logs are at = http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86= _64/2882/artifact/exported-artifacts/basic-suite-master__logs/test_logs/ba= sic-suite-master/post-006_migrations.py/ = <http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x8= 6_64/2882/artifact/exported-artifacts/basic-suite-master__logs/test_logs/b= asic-suite-master/post-006_migrations.py/> and the interesting things happen around 2017-12-18 03:13:43,758-05.
Can someone please have a look? Thanks.
As a side note, if indeed this is the cause for the failure for the job, it's misleading to say "migrate_vm: Success".
[1] = http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86= _64/2882/artifact/exported-artifacts/basic-suite-master__logs/006_migratio= ns.py.junit.xml = <http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x8= 6_64/2882/artifact/exported-artifacts/basic-suite-master__logs/006_migrati= ons.py.junit.xml>
To view, visit change 85177. To unsubscribe, visit settings.
Gerrit-Project: ovirt-system-tests Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I7eb386744a2a2faf0acd734e0ba44be22dd590b5 Gerrit-Change-Number: 85177 Gerrit-PatchSet: 3 Gerrit-Owner: Yedidyah Bar David <didi@redhat.com =
<mailto:didi@redhat.com>>
Gerrit-Reviewer: Dafna Ron <dron@redhat.com = <mailto:dron@redhat.com>> Gerrit-Reviewer: Eyal Edri <eedri@redhat.com = <mailto:eedri@redhat.com>> Gerrit-Reviewer: Jenkins CI Gerrit-Reviewer: Sandro Bonazzola <sbonazzo@redhat.com = <mailto:sbonazzo@redhat.com>> Gerrit-Reviewer: Yedidyah Bar David <didi@redhat.com = <mailto:didi@redhat.com>> Gerrit-Comment-Date: Mon, 18 Dec 2017 08:17:11 +0000 Gerrit-HasComments: No
Devel mailing list Devel@ovirt.org <mailto:Devel@ovirt.org> http://lists.ovirt.org/mailman/listinfo/devel = <http://lists.ovirt.org/mailman/listinfo/devel>
=20 _______________________________________________ Infra mailing list Infra@ovirt.org <mailto:Infra@ovirt.org> http://lists.ovirt.org/mailman/listinfo/infra = <http://lists.ovirt.org/mailman/listinfo/infra> =20
--Apple-Mail=_97530516-BF1E-4F4D-B41F-969FA1E8F891 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 <html><head><meta http-equiv=3D"Content-Type" content=3D"text/html = charset=3Dutf-8"></head><body style=3D"word-wrap: break-word; = -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" = class=3D""><br class=3D""><div><blockquote type=3D"cite" class=3D""><div = class=3D"">On 19 Dec 2017, at 07:53, Eyal Edri <<a = href=3D"mailto:eedri@redhat.com" class=3D"">eedri@redhat.com</a>> = wrote:</div><br class=3D"Apple-interchange-newline"><div class=3D""><div = dir=3D"auto" class=3D""><div class=3D""><br class=3D""><div = class=3D"gmail_extra"><br class=3D""><div class=3D"gmail_quote">On Dec = 19, 2017 00:20, "Michal Skrivanek" <<a = href=3D"mailto:michal.skrivanek@redhat.com" = class=3D"">michal.skrivanek@redhat.com</a>> wrote:<br = type=3D"attribution" class=3D""><blockquote class=3D"quote" = style=3D"margin:0 0 0 .8ex;border-left:1px #ccc = solid;padding-left:1ex"><div class=3D"elided-text"><br class=3D""> > On 18 Dec 2017, at 13:21, Milan Zamazal <<a = href=3D"mailto:mzamazal@redhat.com" class=3D"">mzamazal@redhat.com</a>>= wrote:<br class=3D""> ><br class=3D""> > Yedidyah Bar David <<a href=3D"mailto:didi@redhat.com" = class=3D"">didi@redhat.com</a>> writes:<br class=3D""> ><br class=3D""> >> On Mon, Dec 18, 2017 at 10:17 AM, Code Review <<a = href=3D"mailto:gerrit@ovirt.org" class=3D"">gerrit@ovirt.org</a>> = wrote:<br class=3D""> >>> Jenkins CI posted comments on this change.<br class=3D""> >>><br class=3D""> >><br class=3D""> >>> View Change<br class=3D""> >>><br class=3D""> >>> Patch set 3:Continuous-Integration -1<br class=3D""> >>><br class=3D""> >>> Build Failed<br class=3D""> >>><br class=3D""> >>> <a = href=3D"http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch= -el7-x86_64/2882/" rel=3D"noreferrer" target=3D"_blank" = class=3D"">http://jenkins.ovirt.org/job/<wbr = class=3D"">ovirt-system-tests_master_<wbr = class=3D"">check-patch-el7-x86_64/2882/</a><br class=3D""> >>> : FAILURE<br class=3D""> >><br class=3D""> >> Console output of above job says:<br class=3D""> >><br class=3D""> >> 08:13:34 # migrate_vm:<br class=3D""> >> 08:16:37 * Collect artifacts:<br class=3D""> >> 08:16:40 * Collect artifacts: Success (in = 0:00:03)<br class=3D""> >> 08:16:40 # migrate_vm: Success (in 0:03:06)<br = class=3D""> >> 08:16:40 # Results located at<br class=3D""> >> /dev/shm/ost/deployment-basic-<wbr = class=3D"">suite-master/default/006_<wbr = class=3D"">migrations.py.junit.xml<br class=3D""> >> 08:16:40 @ Run test: 006_migrations.py: Success (in 0:03:50)<br = class=3D""> >> 08:16:40 Error occured, aborting<br class=3D""> >><br class=3D""> >> The file 006_migrations.py.junit.xml [1] says:<br class=3D""> >><br class=3D""> >> <failure type=3D"exceptions.<wbr class=3D"">AssertionError" = message=3D"False !=3D True after<br class=3D""> >> 180 seconds"><br class=3D""> ><br class=3D""> > Reading the logs, I can see the VM migrates normally and seems to = be<br class=3D""> > reported to Engine correctly. When Engine receives = end-of-migration<br class=3D""> > event, it sends Destroy to the source (which is correct), calls = dumpxmls<br class=3D""> > on the destination in the meantime (looks fine to me) and then = calls<br class=3D""> <br class=3D""> </div>looks like a race between getallvmstats reporting VM as Down = (statusTime: 4296271980) being processed, while there is a = Down/MigrationSucceeded event arriving (with notify_time 4296273170) at = about the same time<br class=3D""> Unfortunately the vdsm.log is not in DEBUG level so there=E2=80=99s very = little information as to why and what exactly did it send out.<br = class=3D""> @infra - can you enable debug log level for vdsm by default?<br = class=3D""></blockquote></div></div></div><div dir=3D"auto" class=3D""><br= class=3D""></div><div dir=3D"auto" class=3D"">How do you enable debug = mode for vdsm ?</div></div></div></blockquote><div><br = class=3D""></div>Can be permanently set in = /etc/vdsm/logger.conf</div><div>I suppose enabling for all loggers would = be useful</div><div><br class=3D""></div><div>see <a = href=3D"https://access.redhat.com/articles/2919931" = class=3D"">https://access.redhat.com/articles/2919931</a></div><div><br = class=3D""></div><div>Thanks,</div><div>michal</div><div><br = class=3D""><blockquote type=3D"cite" class=3D""><div class=3D""><div = dir=3D"auto" class=3D""><div dir=3D"auto" class=3D""><br = class=3D""></div><div dir=3D"auto" class=3D""><div = class=3D"gmail_extra"><div class=3D"gmail_quote"><blockquote = class=3D"quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc = solid;padding-left:1ex"> <br class=3D""> It does look like a race to me - does it reproduce?<br class=3D""> <div class=3D"elided-text"><br class=3D""> > Destroy on the destination, which is weird and I don't understand = why<br class=3D""> > the Destroy is invoked.<br class=3D""> ><br class=3D""> > Arik, would you like to take a look? Maybe I overlooked = something or<br class=3D""> > maybe there's a bug. The logs are at<br class=3D""> > <a = href=3D"http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch= -el7-x86_64/2882/artifact/exported-artifacts/basic-suite-master__logs/test= _logs/basic-suite-master/post-006_migrations.py/" rel=3D"noreferrer" = target=3D"_blank" class=3D"">http://jenkins.ovirt.org/job/<wbr = class=3D"">ovirt-system-tests_master_<wbr = class=3D"">check-patch-el7-x86_64/2882/<wbr = class=3D"">artifact/exported-artifacts/<wbr = class=3D"">basic-suite-master__logs/test_<wbr = class=3D"">logs/basic-suite-master/post-<wbr = class=3D"">006_migrations.py/</a><br class=3D""> > and the interesting things happen around 2017-12-18 = 03:13:43,758-05.<br class=3D""> ><br class=3D""> >> Can someone please have a look? Thanks.<br class=3D""> >><br class=3D""> >> As a side note, if indeed this is the cause for the failure for = the<br class=3D""> >> job, it's misleading to say "migrate_vm: Success".<br class=3D"">= >><br class=3D""> >> [1]<br class=3D""> >> <a = href=3D"http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch= -el7-x86_64/2882/artifact/exported-artifacts/basic-suite-master__logs/006_= migrations.py.junit.xml" rel=3D"noreferrer" target=3D"_blank" = class=3D"">http://jenkins.ovirt.org/job/<wbr = class=3D"">ovirt-system-tests_master_<wbr = class=3D"">check-patch-el7-x86_64/2882/<wbr = class=3D"">artifact/exported-artifacts/<wbr = class=3D"">basic-suite-master__logs/006_<wbr = class=3D"">migrations.py.junit.xml</a><br class=3D""> >><br class=3D""> >>><br class=3D""> >>> To view, visit change 85177. To unsubscribe, visit = settings.<br class=3D""> >>><br class=3D""> >>> Gerrit-Project: ovirt-system-tests<br class=3D""> >>> Gerrit-Branch: master<br class=3D""> >>> Gerrit-MessageType: comment<br class=3D""> >>> Gerrit-Change-Id: I7eb386744a2a2faf0acd734e0ba44<wbr = class=3D"">be22dd590b5<br class=3D""> >>> Gerrit-Change-Number: 85177<br class=3D""> >>> Gerrit-PatchSet: 3<br class=3D""> >>> Gerrit-Owner: Yedidyah Bar David <<a = href=3D"mailto:didi@redhat.com" class=3D"">didi@redhat.com</a>><br = class=3D""> >>> Gerrit-Reviewer: Dafna Ron <<a = href=3D"mailto:dron@redhat.com" class=3D"">dron@redhat.com</a>><br = class=3D""> >>> Gerrit-Reviewer: Eyal Edri <<a = href=3D"mailto:eedri@redhat.com" class=3D"">eedri@redhat.com</a>><br = class=3D""> >>> Gerrit-Reviewer: Jenkins CI<br class=3D""> >>> Gerrit-Reviewer: Sandro Bonazzola <<a = href=3D"mailto:sbonazzo@redhat.com" = class=3D"">sbonazzo@redhat.com</a>><br class=3D""> >>> Gerrit-Reviewer: Yedidyah Bar David <<a = href=3D"mailto:didi@redhat.com" class=3D"">didi@redhat.com</a>><br = class=3D""> >>> Gerrit-Comment-Date: Mon, 18 Dec 2017 08:17:11 +0000<br = class=3D""> >>> Gerrit-HasComments: No<br class=3D""> > ______________________________<wbr class=3D"">_________________<br = class=3D""> </div>> Devel mailing list<br class=3D""> > <a href=3D"mailto:Devel@ovirt.org" class=3D"">Devel@ovirt.org</a><br = class=3D""> > <a href=3D"http://lists.ovirt.org/mailman/listinfo/devel" = rel=3D"noreferrer" target=3D"_blank" = class=3D"">http://lists.ovirt.org/<wbr = class=3D"">mailman/listinfo/devel</a><br class=3D""> <div class=3D"elided-text">><br class=3D""> ><br class=3D""> <br class=3D""> ______________________________<wbr class=3D"">_________________<br = class=3D""> Infra mailing list<br class=3D""> <a href=3D"mailto:Infra@ovirt.org" class=3D"">Infra@ovirt.org</a><br = class=3D""> <a href=3D"http://lists.ovirt.org/mailman/listinfo/infra" = rel=3D"noreferrer" target=3D"_blank" = class=3D"">http://lists.ovirt.org/<wbr = class=3D"">mailman/listinfo/infra</a><br class=3D""> </div></blockquote></div><br class=3D""></div></div></div> </div></blockquote></div><br class=3D""></body></html>= --Apple-Mail=_97530516-BF1E-4F4D-B41F-969FA1E8F891--

On Tue, Dec 19, 2017 at 9:22 AM, Michal Skrivanek < michal.skrivanek@redhat.com> wrote:
On 19 Dec 2017, at 07:53, Eyal Edri <eedri@redhat.com> wrote:
On Dec 19, 2017 00:20, "Michal Skrivanek" <michal.skrivanek@redhat.com> wrote:
On 18 Dec 2017, at 13:21, Milan Zamazal <mzamazal@redhat.com> wrote:
Yedidyah Bar David <didi@redhat.com> writes:
On Mon, Dec 18, 2017 at 10:17 AM, Code Review <gerrit@ovirt.org> wrote:
Jenkins CI posted comments on this change.
View Change
Patch set 3:Continuous-Integration -1
Build Failed
http://jenkins.ovirt.org/job/ovirt-system-tests_master_check -patch-el7-x86_64/2882/ : FAILURE
Console output of above job says:
08:13:34 # migrate_vm: 08:16:37 * Collect artifacts: 08:16:40 * Collect artifacts: Success (in 0:00:03) 08:16:40 # migrate_vm: Success (in 0:03:06) 08:16:40 # Results located at /dev/shm/ost/deployment-basic-suite-master/default/006_migra tions.py.junit.xml 08:16:40 @ Run test: 006_migrations.py: Success (in 0:03:50) 08:16:40 Error occured, aborting
The file 006_migrations.py.junit.xml [1] says:
<failure type="exceptions.AssertionError" message="False != True after 180 seconds">
Reading the logs, I can see the VM migrates normally and seems to be reported to Engine correctly. When Engine receives end-of-migration event, it sends Destroy to the source (which is correct), calls dumpxmls on the destination in the meantime (looks fine to me) and then calls
looks like a race between getallvmstats reporting VM as Down (statusTime: 4296271980) being processed, while there is a Down/MigrationSucceeded event arriving (with notify_time 4296273170) at about the same time Unfortunately the vdsm.log is not in DEBUG level so there’s very little information as to why and what exactly did it send out. @infra - can you enable debug log level for vdsm by default?
How do you enable debug mode for vdsm ?
Can be permanently set in /etc/vdsm/logger.conf I suppose enabling for all loggers would be useful
Since this file probably is created only after host is added, I assume it can be set only after add host test, and not in deployment scripts. So might be needed to added as a test or step after the 'verify add host' step. I'll open a ticket on it and we'll see who can assist with it.
Thanks, michal
It does look like a race to me - does it reproduce?
Destroy on the destination, which is weird and I don't understand why the Destroy is invoked.
Arik, would you like to take a look? Maybe I overlooked something or maybe there's a bug. The logs are at http://jenkins.ovirt.org/job/ovirt-system-tests_master_check -patch-el7-x86_64/2882/artifact/exported-artifacts/basic- suite-master__logs/test_logs/basic-suite-master/post-006_migrations.py/ and the interesting things happen around 2017-12-18 03:13:43,758-05.
Can someone please have a look? Thanks.
As a side note, if indeed this is the cause for the failure for the job, it's misleading to say "migrate_vm: Success".
[1] http://jenkins.ovirt.org/job/ovirt-system-tests_master_check -patch-el7-x86_64/2882/artifact/exported-artifacts/basic- suite-master__logs/006_migrations.py.junit.xml
To view, visit change 85177. To unsubscribe, visit settings.
Gerrit-Project: ovirt-system-tests Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I7eb386744a2a2faf0acd734e0ba44be22dd590b5 Gerrit-Change-Number: 85177 Gerrit-PatchSet: 3 Gerrit-Owner: Yedidyah Bar David <didi@redhat.com> Gerrit-Reviewer: Dafna Ron <dron@redhat.com> Gerrit-Reviewer: Eyal Edri <eedri@redhat.com> Gerrit-Reviewer: Jenkins CI Gerrit-Reviewer: Sandro Bonazzola <sbonazzo@redhat.com> Gerrit-Reviewer: Yedidyah Bar David <didi@redhat.com> Gerrit-Comment-Date: Mon, 18 Dec 2017 08:17:11 +0000 Gerrit-HasComments: No
Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel
_______________________________________________ Infra mailing list Infra@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra
-- Eyal edri MANAGER RHV DevOps EMEA VIRTUALIZATION R&D Red Hat EMEA <https://www.redhat.com/> <https://red.ht/sig> TRIED. TESTED. TRUSTED. <https://redhat.com/trusted> phone: +972-9-7692018 irc: eedri (on #tlv #rhev-dev #rhev-integ)

On Tue, Dec 19, 2017 at 12:20 AM, Michal Skrivanek < michal.skrivanek@redhat.com> wrote:
On 18 Dec 2017, at 13:21, Milan Zamazal <mzamazal@redhat.com> wrote:
Yedidyah Bar David <didi@redhat.com> writes:
On Mon, Dec 18, 2017 at 10:17 AM, Code Review <gerrit@ovirt.org> wrote:
Jenkins CI posted comments on this change.
View Change
Patch set 3:Continuous-Integration -1
Build Failed
http://jenkins.ovirt.org/job/ovirt-system-tests_master_ check-patch-el7-x86_64/2882/ : FAILURE
Console output of above job says:
08:13:34 # migrate_vm: 08:16:37 * Collect artifacts: 08:16:40 * Collect artifacts: Success (in 0:00:03) 08:16:40 # migrate_vm: Success (in 0:03:06) 08:16:40 # Results located at /dev/shm/ost/deployment-basic-suite-master/default/006_ migrations.py.junit.xml 08:16:40 @ Run test: 006_migrations.py: Success (in 0:03:50) 08:16:40 Error occured, aborting
The file 006_migrations.py.junit.xml [1] says:
<failure type="exceptions.AssertionError" message="False != True after 180 seconds">
Reading the logs, I can see the VM migrates normally and seems to be reported to Engine correctly. When Engine receives end-of-migration event, it sends Destroy to the source (which is correct), calls dumpxmls on the destination in the meantime (looks fine to me) and then calls
looks like a race between getallvmstats reporting VM as Down (statusTime: 4296271980) being processed, while there is a Down/MigrationSucceeded event arriving (with notify_time 4296273170) at about the same time Unfortunately the vdsm.log is not in DEBUG level so there’s very little information as to why and what exactly did it send out. @infra - can you enable debug log level for vdsm by default?
It does look like a race to me - does it reproduce?
Destroy on the destination, which is weird and I don't understand why the Destroy is invoked.
Arik, would you like to take a look? Maybe I overlooked something or maybe there's a bug. The logs are at http://jenkins.ovirt.org/job/ovirt-system-tests_master_ check-patch-el7-x86_64/2882/artifact/exported-artifacts/ basic-suite-master__logs/test_logs/basic-suite-master/post- 006_migrations.py/ and the interesting things happen around 2017-12-18 03:13:43,758-05.
So it looks like that: 1. the engine polls the VMs from the source host 2. right after #1 we get the down event with proper exit reason (= migration succeeded) but the engine doesn't process it since the VM is being locked by the monitoring as part of processing that polling (to prevent two analysis of the same VM simultaneously). 3. the result of the polling is a VM in status Down and must probably exit_status=Normal 4. the engine decides to abort the migration and thus the monitoring thread of the source host destroys the VM on the destination host. Unfortunately we don't have the exit_reason that is returned by the polling. However, the only option I can think of is that it is different than MigrationSucceeded, because otherwise we would have hand-over the VM to the destination host rather than aborting the migration [1]. That part of the code recently changed as part of [2] - we used to hand-over the VM when we get from the source host: status = Down + exit_status = Normal And in the database: previous_status = MigrationFrom But after that change we require: status = Down + exit_status = Normal ** + exit_reason = MigrationSucceeded ** And in the database: previous_status = MigrationFrom Long story short, is it possible that VDSM had set the status of the VM to Down and exit_status to Normal but the exit_reason was not updated (yet?) to MigrationSucceeded? [1] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/vd... [2] https://gerrit.ovirt.org/#/c/84387/
Can someone please have a look? Thanks.
As a side note, if indeed this is the cause for the failure for the job, it's misleading to say "migrate_vm: Success".
check-patch-el7-x86_64/2882/artifact/exported-artifacts/ basic-suite-master__logs/006_migrations.py.junit.xml
To view, visit change 85177. To unsubscribe, visit settings.
Gerrit-Project: ovirt-system-tests Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I7eb386744a2a2faf0acd734e0ba44be22dd590b5 Gerrit-Change-Number: 85177 Gerrit-PatchSet: 3 Gerrit-Owner: Yedidyah Bar David <didi@redhat.com> Gerrit-Reviewer: Dafna Ron <dron@redhat.com> Gerrit-Reviewer: Eyal Edri <eedri@redhat.com> Gerrit-Reviewer: Jenkins CI Gerrit-Reviewer: Sandro Bonazzola <sbonazzo@redhat.com> Gerrit-Reviewer: Yedidyah Bar David <didi@redhat.com> Gerrit-Comment-Date: Mon, 18 Dec 2017 08:17:11 +0000 Gerrit-HasComments: No
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

--Apple-Mail=_45C10755-3D4E-480D-978D-C839E4C9C674 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 > On 19 Dec 2017, at 10:14, Arik Hadas <ahadas@redhat.com> wrote: >=20 >=20 >=20 > On Tue, Dec 19, 2017 at 12:20 AM, Michal Skrivanek = <michal.skrivanek@redhat.com <mailto:michal.skrivanek@redhat.com>> = wrote: >=20 > > On 18 Dec 2017, at 13:21, Milan Zamazal <mzamazal@redhat.com = <mailto:mzamazal@redhat.com>> wrote: > > > > Yedidyah Bar David <didi@redhat.com <mailto:didi@redhat.com>> = writes: > > > >> On Mon, Dec 18, 2017 at 10:17 AM, Code Review <gerrit@ovirt.org = <mailto:gerrit@ovirt.org>> wrote: > >>> Jenkins CI posted comments on this change. > >>> > >> > >>> View Change > >>> > >>> Patch set 3:Continuous-Integration -1 > >>> > >>> Build Failed > >>> > >>> = http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86= _64/2882/ = <http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x8= 6_64/2882/> > >>> : FAILURE > >> > >> Console output of above job says: > >> > >> 08:13:34 # migrate_vm: > >> 08:16:37 * Collect artifacts: > >> 08:16:40 * Collect artifacts: Success (in 0:00:03) > >> 08:16:40 # migrate_vm: Success (in 0:03:06) > >> 08:16:40 # Results located at > >> = /dev/shm/ost/deployment-basic-suite-master/default/006_migrations.py.junit= .xml > >> 08:16:40 @ Run test: 006_migrations.py: Success (in 0:03:50) > >> 08:16:40 Error occured, aborting > >> > >> The file 006_migrations.py.junit.xml [1] says: > >> > >> <failure type=3D"exceptions.AssertionError" message=3D"False !=3D = True after > >> 180 seconds"> > > > > Reading the logs, I can see the VM migrates normally and seems to be > > reported to Engine correctly. When Engine receives end-of-migration > > event, it sends Destroy to the source (which is correct), calls = dumpxmls > > on the destination in the meantime (looks fine to me) and then calls >=20 > looks like a race between getallvmstats reporting VM as Down = (statusTime: 4296271980) being processed, while there is a = Down/MigrationSucceeded event arriving (with notify_time 4296273170) at = about the same time > Unfortunately the vdsm.log is not in DEBUG level so there=E2=80=99s = very little information as to why and what exactly did it send out. > @infra - can you enable debug log level for vdsm by default?=20 >=20 > It does look like a race to me - does it reproduce?=20 >=20 > > Destroy on the destination, which is weird and I don't understand = why > > the Destroy is invoked. > > > > Arik, would you like to take a look? Maybe I overlooked something = or > > maybe there's a bug. The logs are at > > = http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86= _64/2882/artifact/exported-artifacts/basic-suite-master__logs/test_logs/ba= sic-suite-master/post-006_migrations.py/ = <http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x8= 6_64/2882/artifact/exported-artifacts/basic-suite-master__logs/test_logs/b= asic-suite-master/post-006_migrations.py/> > > and the interesting things happen around 2017-12-18 03:13:43,758-05. >=20 > So it looks like that: > 1. the engine polls the VMs from the source host > 2. right after #1 we get the down event with proper exit reason (=3D = migration succeeded) but the engine doesn't process it since the VM is = being locked by the monitoring as part of processing that polling (to = prevent two analysis of the same VM simultaneously). > 3. the result of the polling is a VM in status Down and must probably = exit_status=3DNormal > 4. the engine decides to abort the migration and thus the monitoring = thread of the source host destroys the VM on the destination host. >=20 > Unfortunately we don't have the exit_reason that is returned by the = polling. > However, the only option I can think of is that it is different than = MigrationSucceeded, because otherwise we would have hand-over the VM to = the destination host rather than aborting the migration [1]. > That part of the code recently changed as part of [2] - we used to = hand-over the VM when we get from the source host: > status =3D Down + exit_status =3D Normal=20 > And in the database: previous_status =3D MigrationFrom > But after that change we require: > status =3D Down + exit_status =3D Normal ** + exit_reason =3D = MigrationSucceeded ** > And in the database: previous_status =3D MigrationFrom >=20 > Long story short, is it possible that VDSM had set the status of the = VM to Down and exit_status to Normal but the exit_reason was not updated = (yet?) to MigrationSucceeded? ok, so there might be a plausible explanation the guest drive mapping introduced a significant delay into the = VM.getStats call since it tries to update the mapping when it detects a = change. That is likely to happen on lifecycle changes. In the OST case = it took 1.2s to finish the whole call, and in the meantime the migration = has finished. The getStats() call is not written with possible state = change in mind, so if it so happens and the state moves from anything to = Down in the middle of it it returns a Down state without exitCode and = exitReason which confuses engine. We started to use the exitReason code = to differentiate the various flavors of Down in engine in ~4.1 and in = this case it results in misleading =E2=80=9CVM powered off by admin=E2=80=9D= case we need to fix the VM.getStats() to handle VM state changes in the = middle we need to fix the guest drive mapping updates to handle cleanly = situations when the VM is either not ready yet or already gone Thanks, michal >=20 > [1] = https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/= vdsbroker/src/main/java/org/ovirt/engine/core/vdsbroker/monitoring/VmAnaly= zer.java#L291 = <https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules= /vdsbroker/src/main/java/org/ovirt/engine/core/vdsbroker/monitoring/VmAnal= yzer.java#L291> > [2] https://gerrit.ovirt.org/#/c/84387/ = <https://gerrit.ovirt.org/#/c/84387/> > =20 > > > >> Can someone please have a look? Thanks. > >> > >> As a side note, if indeed this is the cause for the failure for the > >> job, it's misleading to say "migrate_vm: Success". > >> > >> [1] > >> = http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86= _64/2882/artifact/exported-artifacts/basic-suite-master__logs/006_migratio= ns.py.junit.xml = <http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x8= 6_64/2882/artifact/exported-artifacts/basic-suite-master__logs/006_migrati= ons.py.junit.xml> > >> > >>> > >>> To view, visit change 85177. To unsubscribe, visit settings. > >>> > >>> Gerrit-Project: ovirt-system-tests > >>> Gerrit-Branch: master > >>> Gerrit-MessageType: comment > >>> Gerrit-Change-Id: I7eb386744a2a2faf0acd734e0ba44be22dd590b5 > >>> Gerrit-Change-Number: 85177 > >>> Gerrit-PatchSet: 3 > >>> Gerrit-Owner: Yedidyah Bar David <didi@redhat.com = <mailto:didi@redhat.com>> > >>> Gerrit-Reviewer: Dafna Ron <dron@redhat.com = <mailto:dron@redhat.com>> > >>> Gerrit-Reviewer: Eyal Edri <eedri@redhat.com = <mailto:eedri@redhat.com>> > >>> Gerrit-Reviewer: Jenkins CI > >>> Gerrit-Reviewer: Sandro Bonazzola <sbonazzo@redhat.com = <mailto:sbonazzo@redhat.com>> > >>> Gerrit-Reviewer: Yedidyah Bar David <didi@redhat.com = <mailto:didi@redhat.com>> > >>> Gerrit-Comment-Date: Mon, 18 Dec 2017 08:17:11 +0000 > >>> Gerrit-HasComments: No > > _______________________________________________ > > Devel mailing list > > Devel@ovirt.org <mailto:Devel@ovirt.org> > > http://lists.ovirt.org/mailman/listinfo/devel = <http://lists.ovirt.org/mailman/listinfo/devel> > > > > >=20 >=20 --Apple-Mail=_45C10755-3D4E-480D-978D-C839E4C9C674 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 <html><head><meta http-equiv=3D"Content-Type" content=3D"text/html = charset=3Dutf-8"></head><body style=3D"word-wrap: break-word; = -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" = class=3D""><br class=3D""><div><blockquote type=3D"cite" class=3D""><div = class=3D"">On 19 Dec 2017, at 10:14, Arik Hadas <<a = href=3D"mailto:ahadas@redhat.com" class=3D"">ahadas@redhat.com</a>> = wrote:</div><br class=3D"Apple-interchange-newline"><div class=3D""><div = dir=3D"ltr" class=3D""><br class=3D""><div class=3D"gmail_extra"><br = class=3D""><div class=3D"gmail_quote">On Tue, Dec 19, 2017 at 12:20 AM, = Michal Skrivanek <span dir=3D"ltr" class=3D""><<a = href=3D"mailto:michal.skrivanek@redhat.com" target=3D"_blank" = class=3D"">michal.skrivanek@redhat.com</a>></span> wrote:<br = class=3D""><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px = 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(= 204,204,204);padding-left:1ex"><div class=3D"gmail-HOEnZb"><div = class=3D"gmail-h5"><br class=3D""> > On 18 Dec 2017, at 13:21, Milan Zamazal <<a = href=3D"mailto:mzamazal@redhat.com" class=3D"">mzamazal@redhat.com</a>>= wrote:<br class=3D""> ><br class=3D""> > Yedidyah Bar David <<a href=3D"mailto:didi@redhat.com" = class=3D"">didi@redhat.com</a>> writes:<br class=3D""> ><br class=3D""> >> On Mon, Dec 18, 2017 at 10:17 AM, Code Review <<a = href=3D"mailto:gerrit@ovirt.org" class=3D"">gerrit@ovirt.org</a>> = wrote:<br class=3D""> >>> Jenkins CI posted comments on this change.<br class=3D""> >>><br class=3D""> >><br class=3D""> >>> View Change<br class=3D""> >>><br class=3D""> >>> Patch set 3:Continuous-Integration -1<br class=3D""> >>><br class=3D""> >>> Build Failed<br class=3D""> >>><br class=3D""> >>> <a = href=3D"http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch= -el7-x86_64/2882/" rel=3D"noreferrer" target=3D"_blank" = class=3D"">http://jenkins.ovirt.org/job/<wbr = class=3D"">ovirt-system-tests_master_<wbr = class=3D"">check-patch-el7-x86_64/2882/</a><br class=3D""> >>> : FAILURE<br class=3D""> >><br class=3D""> >> Console output of above job says:<br class=3D""> >><br class=3D""> >> 08:13:34 # migrate_vm:<br class=3D""> >> 08:16:37 * Collect artifacts:<br class=3D""> >> 08:16:40 * Collect artifacts: Success (in = 0:00:03)<br class=3D""> >> 08:16:40 # migrate_vm: Success (in 0:03:06)<br = class=3D""> >> 08:16:40 # Results located at<br class=3D""> >> /dev/shm/ost/deployment-basic-<wbr = class=3D"">suite-master/default/006_<wbr = class=3D"">migrations.py.junit.xml<br class=3D""> >> 08:16:40 @ Run test: 006_migrations.py: Success (in 0:03:50)<br = class=3D""> >> 08:16:40 Error occured, aborting<br class=3D""> >><br class=3D""> >> The file 006_migrations.py.junit.xml [1] says:<br class=3D""> >><br class=3D""> >> <failure type=3D"exceptions.<wbr class=3D"">AssertionError" = message=3D"False !=3D True after<br class=3D""> >> 180 seconds"><br class=3D""> ><br class=3D""> > Reading the logs, I can see the VM migrates normally and seems to = be<br class=3D""> > reported to Engine correctly. When Engine receives = end-of-migration<br class=3D""> > event, it sends Destroy to the source (which is correct), calls = dumpxmls<br class=3D""> > on the destination in the meantime (looks fine to me) and then = calls<br class=3D""> <br class=3D""> </div></div>looks like a race between getallvmstats reporting VM as Down = (statusTime: 4296271980) being processed, while there is a = Down/MigrationSucceeded event arriving (with notify_time 4296273170) at = about the same time<br class=3D""> Unfortunately the vdsm.log is not in DEBUG level so there=E2=80=99s very = little information as to why and what exactly did it send out.<br = class=3D""> @infra - can you enable debug log level for vdsm by = default? </blockquote><blockquote class=3D"gmail_quote" = style=3D"margin:0px 0px 0px = 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(= 204,204,204);padding-left:1ex"> <br class=3D""> It does look like a race to me - does it = reproduce? </blockquote><blockquote class=3D"gmail_quote" = style=3D"margin:0px 0px 0px = 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(= 204,204,204);padding-left:1ex"> <div class=3D""><div class=3D"gmail-h5"><br class=3D""> > Destroy on the destination, which is weird and I don't understand = why<br class=3D""> > the Destroy is invoked.<br class=3D""> ><br class=3D""> > Arik, would you like to take a look? Maybe I overlooked = something or<br class=3D""> > maybe there's a bug. The logs are at<br class=3D""> > <a = href=3D"http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch= -el7-x86_64/2882/artifact/exported-artifacts/basic-suite-master__logs/test= _logs/basic-suite-master/post-006_migrations.py/" rel=3D"noreferrer" = target=3D"_blank" class=3D"">http://jenkins.ovirt.org/job/<wbr = class=3D"">ovirt-system-tests_master_<wbr = class=3D"">check-patch-el7-x86_64/2882/<wbr = class=3D"">artifact/exported-artifacts/<wbr = class=3D"">basic-suite-master__logs/test_<wbr = class=3D"">logs/basic-suite-master/post-<wbr = class=3D"">006_migrations.py/</a><br class=3D""> > and the interesting things happen around 2017-12-18 = 03:13:43,758-05.<br class=3D""></div></div></blockquote><div = class=3D""><br class=3D""></div><div class=3D"">So it looks like = that:</div><div class=3D"">1. the engine polls the VMs from the source = host</div><div class=3D"">2. right after #1 we get the down event with = proper exit reason (=3D migration succeeded) but the engine doesn't = process it since the VM is being locked by the monitoring as part of = processing that polling (to prevent two analysis of the same VM = simultaneously).</div><div class=3D"">3. the result of the polling is a = VM in status Down and must probably exit_status=3DNormal</div><div = class=3D"">4. the engine decides to abort the migration and thus the = monitoring thread of the source host destroys the VM on the destination = host.</div><div class=3D""><br class=3D""></div><div = class=3D"">Unfortunately we don't have the exit_reason that is returned = by the polling.</div><div class=3D"">However, the only option I can = think of is that it is different than MigrationSucceeded, because = otherwise we would have hand-over the VM to the destination host rather = than aborting the migration [1].</div><div class=3D"">That part of the = code recently changed as part of [2] - we used to hand-over the VM when = we get from the source host:</div><div class=3D"">status =3D Down + = exit_status =3D Normal </div><div class=3D"">And in the database: = previous_status =3D MigrationFrom</div><div class=3D"">But after that = change we require:</div><div class=3D"">status =3D Down + exit_status =3D = Normal ** + exit_reason =3D MigrationSucceeded **<br class=3D""></div><div= class=3D"">And in the database: previous_status =3D = MigrationFrom</div><div class=3D""><br class=3D""></div><div = class=3D"">Long story short, is it possible that VDSM had set the status = of the VM to Down and exit_status to Normal but the exit_reason was not = updated (yet?) to = MigrationSucceeded?</div></div></div></div></div></blockquote><div><br = class=3D""></div>ok, so there might be a plausible = explanation</div><div>the guest drive mapping introduced a significant = delay into the VM.getStats call since it tries to update the mapping = when it detects a change. That is likely to happen on lifecycle changes. = In the OST case it took 1.2s to finish the whole call, and in the = meantime the migration has finished. The getStats() call is not written = with possible state change in mind, so if it so happens and the state = moves from anything to Down in the middle of it it returns a Down state = without exitCode and exitReason which confuses engine. We started to use = the exitReason code to differentiate the various flavors of Down in = engine in ~4.1 and in this case it results in misleading =E2=80=9CVM = powered off by admin=E2=80=9D case</div><div><br class=3D""></div><div>we = need to fix the VM.getStats() to handle VM state changes in the = middle</div><div>we need to fix the guest drive mapping updates to = handle cleanly situations when the VM is either not ready yet or already = gone</div><div><br = class=3D""></div><div>Thanks,</div><div>michal</div><div><br = class=3D""></div><div><blockquote type=3D"cite" class=3D""><div = class=3D""><div dir=3D"ltr" class=3D""><div class=3D"gmail_extra"><div = class=3D"gmail_quote"><div class=3D""><br class=3D""></div><div = class=3D"">[1] <a = href=3D"https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/= modules/vdsbroker/src/main/java/org/ovirt/engine/core/vdsbroker/monitoring= /VmAnalyzer.java#L291" = class=3D"">https://github.com/oVirt/ovirt-engine/blob/master/backend/manag= er/modules/vdsbroker/src/main/java/org/ovirt/engine/core/vdsbroker/monitor= ing/VmAnalyzer.java#L291</a></div><div class=3D"">[2] <a = href=3D"https://gerrit.ovirt.org/#/c/84387/" = class=3D"">https://gerrit.ovirt.org/#/c/84387/</a></div><div = class=3D""> </div><blockquote class=3D"gmail_quote" = style=3D"margin:0px 0px 0px = 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(= 204,204,204);padding-left:1ex"><div class=3D""><div class=3D"gmail-h5"> ><br class=3D""> >> Can someone please have a look? Thanks.<br class=3D""> >><br class=3D""> >> As a side note, if indeed this is the cause for the failure for = the<br class=3D""> >> job, it's misleading to say "migrate_vm: Success".<br class=3D"">= >><br class=3D""> >> [1]<br class=3D""> >> <a = href=3D"http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch= -el7-x86_64/2882/artifact/exported-artifacts/basic-suite-master__logs/006_= migrations.py.junit.xml" rel=3D"noreferrer" target=3D"_blank" = class=3D"">http://jenkins.ovirt.org/job/<wbr = class=3D"">ovirt-system-tests_master_<wbr = class=3D"">check-patch-el7-x86_64/2882/<wbr = class=3D"">artifact/exported-artifacts/<wbr = class=3D"">basic-suite-master__logs/006_<wbr = class=3D"">migrations.py.junit.xml</a><br class=3D""> >><br class=3D""> >>><br class=3D""> >>> To view, visit change 85177. To unsubscribe, visit = settings.<br class=3D""> >>><br class=3D""> >>> Gerrit-Project: ovirt-system-tests<br class=3D""> >>> Gerrit-Branch: master<br class=3D""> >>> Gerrit-MessageType: comment<br class=3D""> >>> Gerrit-Change-Id: I7eb386744a2a2faf0acd734e0ba44<wbr = class=3D"">be22dd590b5<br class=3D""> >>> Gerrit-Change-Number: 85177<br class=3D""> >>> Gerrit-PatchSet: 3<br class=3D""> >>> Gerrit-Owner: Yedidyah Bar David <<a = href=3D"mailto:didi@redhat.com" class=3D"">didi@redhat.com</a>><br = class=3D""> >>> Gerrit-Reviewer: Dafna Ron <<a = href=3D"mailto:dron@redhat.com" class=3D"">dron@redhat.com</a>><br = class=3D""> >>> Gerrit-Reviewer: Eyal Edri <<a = href=3D"mailto:eedri@redhat.com" class=3D"">eedri@redhat.com</a>><br = class=3D""> >>> Gerrit-Reviewer: Jenkins CI<br class=3D""> >>> Gerrit-Reviewer: Sandro Bonazzola <<a = href=3D"mailto:sbonazzo@redhat.com" = class=3D"">sbonazzo@redhat.com</a>><br class=3D""> >>> Gerrit-Reviewer: Yedidyah Bar David <<a = href=3D"mailto:didi@redhat.com" class=3D"">didi@redhat.com</a>><br = class=3D""> >>> Gerrit-Comment-Date: Mon, 18 Dec 2017 08:17:11 +0000<br = class=3D""> >>> Gerrit-HasComments: No<br class=3D""> </div></div>> ______________________________<wbr = class=3D"">_________________<br class=3D""> > Devel mailing list<br class=3D""> > <a href=3D"mailto:Devel@ovirt.org" class=3D"">Devel@ovirt.org</a><br = class=3D""> > <a href=3D"http://lists.ovirt.org/mailman/listinfo/devel" = rel=3D"noreferrer" target=3D"_blank" = class=3D"">http://lists.ovirt.org/<wbr = class=3D"">mailman/listinfo/devel</a><br class=3D""> ><br class=3D""> ><br class=3D""> <br class=3D""> </blockquote></div><br class=3D""></div></div> </div></blockquote></div><br class=3D""></body></html>= --Apple-Mail=_45C10755-3D4E-480D-978D-C839E4C9C674--

On Tue, Dec 19, 2017 at 1:36 PM, Michal Skrivanek < michal.skrivanek@redhat.com> wrote:
On 19 Dec 2017, at 10:14, Arik Hadas <ahadas@redhat.com> wrote:
On Tue, Dec 19, 2017 at 12:20 AM, Michal Skrivanek < michal.skrivanek@redhat.com> wrote:
On 18 Dec 2017, at 13:21, Milan Zamazal <mzamazal@redhat.com> wrote:
Yedidyah Bar David <didi@redhat.com> writes:
On Mon, Dec 18, 2017 at 10:17 AM, Code Review <gerrit@ovirt.org> wrote:
Jenkins CI posted comments on this change.
View Change
Patch set 3:Continuous-Integration -1
Build Failed
http://jenkins.ovirt.org/job/ovirt-system-tests_master_check -patch-el7-x86_64/2882/ : FAILURE
Console output of above job says:
08:13:34 # migrate_vm: 08:16:37 * Collect artifacts: 08:16:40 * Collect artifacts: Success (in 0:00:03) 08:16:40 # migrate_vm: Success (in 0:03:06) 08:16:40 # Results located at /dev/shm/ost/deployment-basic-suite-master/default/006_migra tions.py.junit.xml 08:16:40 @ Run test: 006_migrations.py: Success (in 0:03:50) 08:16:40 Error occured, aborting
The file 006_migrations.py.junit.xml [1] says:
<failure type="exceptions.AssertionError" message="False != True after 180 seconds">
Reading the logs, I can see the VM migrates normally and seems to be reported to Engine correctly. When Engine receives end-of-migration event, it sends Destroy to the source (which is correct), calls dumpxmls on the destination in the meantime (looks fine to me) and then calls
looks like a race between getallvmstats reporting VM as Down (statusTime: 4296271980) being processed, while there is a Down/MigrationSucceeded event arriving (with notify_time 4296273170) at about the same time Unfortunately the vdsm.log is not in DEBUG level so there’s very little information as to why and what exactly did it send out. @infra - can you enable debug log level for vdsm by default?
It does look like a race to me - does it reproduce?
Destroy on the destination, which is weird and I don't understand why the Destroy is invoked.
Arik, would you like to take a look? Maybe I overlooked something or maybe there's a bug. The logs are at http://jenkins.ovirt.org/job/ovirt-system-tests_master_check -patch-el7-x86_64/2882/artifact/exported-artifacts/basic- suite-master__logs/test_logs/basic-suite-master/post-006_migrations.py/ and the interesting things happen around 2017-12-18 03:13:43,758-05.
So it looks like that: 1. the engine polls the VMs from the source host 2. right after #1 we get the down event with proper exit reason (= migration succeeded) but the engine doesn't process it since the VM is being locked by the monitoring as part of processing that polling (to prevent two analysis of the same VM simultaneously). 3. the result of the polling is a VM in status Down and must probably exit_status=Normal 4. the engine decides to abort the migration and thus the monitoring thread of the source host destroys the VM on the destination host.
Unfortunately we don't have the exit_reason that is returned by the polling. However, the only option I can think of is that it is different than MigrationSucceeded, because otherwise we would have hand-over the VM to the destination host rather than aborting the migration [1]. That part of the code recently changed as part of [2] - we used to hand-over the VM when we get from the source host: status = Down + exit_status = Normal And in the database: previous_status = MigrationFrom But after that change we require: status = Down + exit_status = Normal ** + exit_reason = MigrationSucceeded ** And in the database: previous_status = MigrationFrom
Long story short, is it possible that VDSM had set the status of the VM to Down and exit_status to Normal but the exit_reason was not updated (yet?) to MigrationSucceeded?
ok, so there might be a plausible explanation the guest drive mapping introduced a significant delay into the VM.getStats call since it tries to update the mapping when it detects a change. That is likely to happen on lifecycle changes. In the OST case it took 1.2s to finish the whole call, and in the meantime the migration has finished. The getStats() call is not written with possible state change in mind, so if it so happens and the state moves from anything to Down in the middle of it it returns a Down state without exitCode and exitReason which confuses engine. We started to use the exitReason code to differentiate the various flavors of Down in engine in ~4.1 and in this case it results in misleading “VM powered off by admin” case
we need to fix the VM.getStats() to handle VM state changes in the middle we need to fix the guest drive mapping updates to handle cleanly situations when the VM is either not ready yet or already gone
Has anyone filed BZs for both? Can we ensure we do? The bottom line result sounds severe to me. Y.
Thanks, michal
[1] https://github.com/oVirt/ovirt-engine/blob/master/ backend/manager/modules/vdsbroker/src/main/java/org/ ovirt/engine/core/vdsbroker/monitoring/VmAnalyzer.java#L291 [2] https://gerrit.ovirt.org/#/c/84387/
Can someone please have a look? Thanks.
As a side note, if indeed this is the cause for the failure for the job, it's misleading to say "migrate_vm: Success".
[1] http://jenkins.ovirt.org/job/ovirt-system-tests_master_check
-patch-el7-x86_64/2882/artifact/exported-artifacts/basic- suite-master__logs/006_migrations.py.junit.xml
To view, visit change 85177. To unsubscribe, visit settings.
Gerrit-Project: ovirt-system-tests Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I7eb386744a2a2faf0acd734e0ba44be22dd590b5 Gerrit-Change-Number: 85177 Gerrit-PatchSet: 3 Gerrit-Owner: Yedidyah Bar David <didi@redhat.com> Gerrit-Reviewer: Dafna Ron <dron@redhat.com> Gerrit-Reviewer: Eyal Edri <eedri@redhat.com> Gerrit-Reviewer: Jenkins CI Gerrit-Reviewer: Sandro Bonazzola <sbonazzo@redhat.com> Gerrit-Reviewer: Yedidyah Bar David <didi@redhat.com> Gerrit-Comment-Date: Mon, 18 Dec 2017 08:17:11 +0000 Gerrit-HasComments: No
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

On 19 Dec 2017, at 15:13, Yaniv Kaul <ykaul@redhat.com> wrote:
On Tue, Dec 19, 2017 at 1:36 PM, Michal Skrivanek <michal.skrivanek@redhat.com> wrote:
On 19 Dec 2017, at 10:14, Arik Hadas <ahadas@redhat.com> wrote:
On Tue, Dec 19, 2017 at 12:20 AM, Michal Skrivanek <michal.skrivanek@redhat.com> wrote:
On 18 Dec 2017, at 13:21, Milan Zamazal <mzamazal@redhat.com> wrote:
Yedidyah Bar David <didi@redhat.com> writes:
On Mon, Dec 18, 2017 at 10:17 AM, Code Review <gerrit@ovirt.org> wrote:
Jenkins CI posted comments on this change.
View Change
Patch set 3:Continuous-Integration -1
Build Failed
http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_6... : FAILURE
Console output of above job says:
08:13:34 # migrate_vm: 08:16:37 * Collect artifacts: 08:16:40 * Collect artifacts: Success (in 0:00:03) 08:16:40 # migrate_vm: Success (in 0:03:06) 08:16:40 # Results located at /dev/shm/ost/deployment-basic-suite-master/default/006_migrations.py.junit.xml 08:16:40 @ Run test: 006_migrations.py: Success (in 0:03:50) 08:16:40 Error occured, aborting
The file 006_migrations.py.junit.xml [1] says:
<failure type="exceptions.AssertionError" message="False != True after 180 seconds">
Reading the logs, I can see the VM migrates normally and seems to be reported to Engine correctly. When Engine receives end-of-migration event, it sends Destroy to the source (which is correct), calls dumpxmls on the destination in the meantime (looks fine to me) and then calls
looks like a race between getallvmstats reporting VM as Down (statusTime: 4296271980) being processed, while there is a Down/MigrationSucceeded event arriving (with notify_time 4296273170) at about the same time Unfortunately the vdsm.log is not in DEBUG level so there’s very little information as to why and what exactly did it send out. @infra - can you enable debug log level for vdsm by default?
It does look like a race to me - does it reproduce?
Destroy on the destination, which is weird and I don't understand why the Destroy is invoked.
Arik, would you like to take a look? Maybe I overlooked something or maybe there's a bug. The logs are at http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_6... and the interesting things happen around 2017-12-18 03:13:43,758-05.
So it looks like that: 1. the engine polls the VMs from the source host 2. right after #1 we get the down event with proper exit reason (= migration succeeded) but the engine doesn't process it since the VM is being locked by the monitoring as part of processing that polling (to prevent two analysis of the same VM simultaneously). 3. the result of the polling is a VM in status Down and must probably exit_status=Normal 4. the engine decides to abort the migration and thus the monitoring thread of the source host destroys the VM on the destination host.
Unfortunately we don't have the exit_reason that is returned by the polling. However, the only option I can think of is that it is different than MigrationSucceeded, because otherwise we would have hand-over the VM to the destination host rather than aborting the migration [1]. That part of the code recently changed as part of [2] - we used to hand-over the VM when we get from the source host: status = Down + exit_status = Normal And in the database: previous_status = MigrationFrom But after that change we require: status = Down + exit_status = Normal ** + exit_reason = MigrationSucceeded ** And in the database: previous_status = MigrationFrom
Long story short, is it possible that VDSM had set the status of the VM to Down and exit_status to Normal but the exit_reason was not updated (yet?) to MigrationSucceeded?
ok, so there might be a plausible explanation the guest drive mapping introduced a significant delay into the VM.getStats call since it tries to update the mapping when it detects a change. That is likely to happen on lifecycle changes. In the OST case it took 1.2s to finish the whole call, and in the meantime the migration has finished. The getStats() call is not written with possible state change in mind, so if it so happens and the state moves from anything to Down in the middle of it it returns a Down state without exitCode and exitReason which confuses engine. We started to use the exitReason code to differentiate the various flavors of Down in engine in ~4.1 and in this case it results in misleading “VM powered off by admin” case
we need to fix the VM.getStats() to handle VM state changes in the middle we need to fix the guest drive mapping updates to handle cleanly situations when the VM is either not ready yet or already gone
Has anyone filed BZs for both? Can we ensure we do? The bottom line result sounds severe to me.
yes, forgot to update here https://bugzilla.redhat.com/show_bug.cgi?id=1527416 it is indeed severe (your VM gets shut down once it migrates) and should be fixed in 4.2.1
Y.
Thanks, michal
[1] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/vd... [2] https://gerrit.ovirt.org/#/c/84387/
Can someone please have a look? Thanks.
As a side note, if indeed this is the cause for the failure for the job, it's misleading to say "migrate_vm: Success".
[1] http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_6...
To view, visit change 85177. To unsubscribe, visit settings.
Gerrit-Project: ovirt-system-tests Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I7eb386744a2a2faf0acd734e0ba44be22dd590b5 Gerrit-Change-Number: 85177 Gerrit-PatchSet: 3 Gerrit-Owner: Yedidyah Bar David <didi@redhat.com> Gerrit-Reviewer: Dafna Ron <dron@redhat.com> Gerrit-Reviewer: Eyal Edri <eedri@redhat.com> Gerrit-Reviewer: Jenkins CI Gerrit-Reviewer: Sandro Bonazzola <sbonazzo@redhat.com> Gerrit-Reviewer: Yedidyah Bar David <didi@redhat.com> Gerrit-Comment-Date: Mon, 18 Dec 2017 08:17:11 +0000 Gerrit-HasComments: No
Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel
participants (6)
-
Arik Hadas
-
Eyal Edri
-
Michal Skrivanek
-
Milan Zamazal
-
Yaniv Kaul
-
Yedidyah Bar David