suspend_resume_vm fail on master experimental

Hi all, *test-repo_ovirt_experimental_master* (link to Jenkins <http://jenkins.ovirt.org/view/experimental%20jobs/job/test-repo_ovirt_experimental_master/4648/artifact/exported-artifacts/basic_suite_master.sh-el7/exported-artifacts/>) job failed on basic_sanity scenario. The job was triggered by https://gerrit.ovirt.org/#/c/69845/
From looking at the logs, it seems that the reason is *VDSM*.
In the VDSM log, i see the following error: 2017-01-09 16:47:41,331 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 34942, 0, 0) at 0x36b95f0>: unexpected eof (betterAsyncore:119) Also, when looking at the MOM logs, I see the the following: 2017-01-09 16:43:39,508 - mom.vdsmInterface - ERROR - Cannot connect to VDSM! [Errno 111] Connection refused I've attached the full VDSM logs here in a zip file. Can anyone please assist? Thanks, -- *Daniel Belenky* *RHV DevOps* *Red Hat Israel*

On Tue, Jan 10, 2017 at 9:29 AM, Daniel Belenky <dbelenky@redhat.com> wrote:
Hi all,
test-repo_ovirt_experimental_master (link to Jenkins) job failed on basic_sanity scenario. The job was triggered by https://gerrit.ovirt.org/#/c/69845/
From looking at the logs, it seems that the reason is VDSM.
In the VDSM log, i see the following error:
2017-01-09 16:47:41,331 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 34942, 0, 0) at 0x36b95f0>: unexpected eof (betterAsyncore:119)
This issue means that the client closed connection while vdsm was replying. It can happen at any time when the client is not nice with the connection. As you can see the client connected locally '::1'.
Also, when looking at the MOM logs, I see the the following:
2017-01-09 16:43:39,508 - mom.vdsmInterface - ERROR - Cannot connect to VDSM! [Errno 111] Connection refused
Looking at the log at this time vdsm had no open socket.
I've attached the full VDSM logs here in a zip file.
Can anyone please assist?
Thanks, -- Daniel Belenky RHV DevOps Red Hat Israel
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

Adding Tomas from Virt. On Tue, Jan 10, 2017 at 10:54 AM, Piotr Kliczewski < piotr.kliczewski@gmail.com> wrote:
On Tue, Jan 10, 2017 at 9:29 AM, Daniel Belenky <dbelenky@redhat.com> wrote:
Hi all,
test-repo_ovirt_experimental_master (link to Jenkins) job failed on basic_sanity scenario. The job was triggered by https://gerrit.ovirt.org/#/c/69845/
From looking at the logs, it seems that the reason is VDSM.
In the VDSM log, i see the following error:
2017-01-09 16:47:41,331 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 34942, 0, 0) at 0x36b95f0>: unexpected eof (betterAsyncore:119)
This issue means that the client closed connection while vdsm was replying. It can happen at any time when the client is not nice with the connection. As you can see the client connected locally '::1'.
Also, when looking at the MOM logs, I see the the following:
2017-01-09 16:43:39,508 - mom.vdsmInterface - ERROR - Cannot connect to VDSM! [Errno 111] Connection refused
Looking at the log at this time vdsm had no open socket.
I've attached the full VDSM logs here in a zip file.
Can anyone please assist?
Thanks, -- Daniel Belenky RHV DevOps Red Hat Israel
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel
-- Eyal Edri Associate Manager RHV DevOps EMEA ENG Virtualization R&D Red Hat Israel phone: +972-9-7692018 irc: eedri (on #tlv #rhev-dev #rhev-integ)

This is a multi-part message in MIME format. --------------BC0912C551C51594F7741D03 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Hi all On 01/11/2017 08:52 AM, Eyal Edri wrote:
Adding Tomas from Virt.
On Tue, Jan 10, 2017 at 10:54 AM, Piotr Kliczewski <piotr.kliczewski@gmail.com <mailto:piotr.kliczewski@gmail.com>> wrote:
On Tue, Jan 10, 2017 at 9:29 AM, Daniel Belenky <dbelenky@redhat.com <mailto:dbelenky@redhat.com>> wrote: > Hi all, > > test-repo_ovirt_experimental_master (link to Jenkins) job failed on > basic_sanity scenario. > The job was triggered by https://gerrit.ovirt.org/#/c/69845/ <https://gerrit.ovirt.org/#/c/69845/> > > From looking at the logs, it seems that the reason is VDSM. > > In the VDSM log, i see the following error: > > 2017-01-09 16:47:41,331 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL > error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', > 34942, 0, 0) at 0x36b95f0>: unexpected eof (betterAsyncore:119)
Daniel, could you please remind me the jenkins link? I see something suspicious on the Vdsm log. Most notably, Vdsm received SIGTERM. Is this expected and part of the test?
>
This issue means that the client closed connection while vdsm was replying. It can happen at any time when the client is not nice with the connection. As you can see the client connected locally '::1'.
> > Also, when looking at the MOM logs, I see the the following: > > 2017-01-09 16:43:39,508 - mom.vdsmInterface - ERROR - Cannot connect to > VDSM! [Errno 111] Connection refused >
Looking at the log at this time vdsm had no open socket.
Correct, but IIRC we have a race on startup - that's the reason why MOM retries to connect. After the new try, MOM seems to behave correctly: 2017-01-09 16:44:05,672 - mom.RPCServer - INFO - ping() 2017-01-09 16:44:05,673 - mom.RPCServer - INFO - getStatistics() -- Francesco Romani Red Hat Engineering Virtualization R & D IRC: fromani --------------BC0912C551C51594F7741D03 Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: 7bit <html> <head> <meta content="text/html; charset=windows-1252" http-equiv="Content-Type"> </head> <body bgcolor="#FFFFFF" text="#000000"> <p>Hi all<br> </p> <br> <div class="moz-cite-prefix">On 01/11/2017 08:52 AM, Eyal Edri wrote:<br> </div> <blockquote cite="mid:CACuV_HpRk_wkfvdtJDdZ2sanvvCvb2RY8XSU-bHFCYvHA=0SgA@mail.gmail.com" type="cite"> <div dir="ltr">Adding Tomas from Virt.</div> <div class="gmail_extra"><br> <div class="gmail_quote">On Tue, Jan 10, 2017 at 10:54 AM, Piotr Kliczewski <span dir="ltr"><<a moz-do-not-send="true" href="mailto:piotr.kliczewski@gmail.com" target="_blank">piotr.kliczewski@gmail.com</a>></span> wrote:<br> <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On Tue, Jan 10, 2017 at 9:29 AM, Daniel Belenky <<a moz-do-not-send="true" href="mailto:dbelenky@redhat.com">dbelenky@redhat.com</a>> wrote:<br> > Hi all,<br> ><br> > test-repo_ovirt_experimental_<wbr>master (link to Jenkins) job failed on<br> > basic_sanity scenario.<br> > The job was triggered by <a moz-do-not-send="true" href="https://gerrit.ovirt.org/#/c/69845/" rel="noreferrer" target="_blank">https://gerrit.ovirt.org/#/c/<wbr>69845/</a><br> ><br> > From looking at the logs, it seems that the reason is VDSM.<br> ><br> > In the VDSM log, i see the following error:<br> ><br> > 2017-01-09 16:47:41,331 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL<br> > error receiving from <yajsonrpc.betterAsyncore.<wbr>Dispatcher connected ('::1',<br> > 34942, 0, 0) at 0x36b95f0>: unexpected eof (betterAsyncore:119)<br> </span></blockquote> </div> </div> </blockquote> <br> Daniel, could you please remind me the jenkins link? I see something suspicious on the Vdsm log.<br> Most notably, Vdsm received SIGTERM. Is this expected and part of the test?<br> <br> <blockquote cite="mid:CACuV_HpRk_wkfvdtJDdZ2sanvvCvb2RY8XSU-bHFCYvHA=0SgA@mail.gmail.com" type="cite"> <div class="gmail_extra"> <div class="gmail_quote"> <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class=""> ><br> <br> </span>This issue means that the client closed connection while vdsm was<br> replying. It can happen at any time<br> when the client is not nice with the connection. As you can see the<br> client connected locally '::1'.<br> <span class=""><br> ><br> > Also, when looking at the MOM logs, I see the the following:<br> ><br> > 2017-01-09 16:43:39,508 - mom.vdsmInterface - ERROR - Cannot connect to<br> > VDSM! [Errno 111] Connection refused<br> ><br> <br> </span>Looking at the log at this time vdsm had no open socket.</blockquote> </div> </div> <br> </blockquote> <br> Correct, but IIRC we have a race on startup - that's the reason why MOM retries to connect. After the new try, MOM seems to behave<br> correctly:<br> <br> 2017-01-09 16:44:05,672 - mom.RPCServer - INFO - ping()<br> 2017-01-09 16:44:05,673 - mom.RPCServer - INFO - getStatistics()<br> <br> <pre class="moz-signature" cols="72">-- Francesco Romani Red Hat Engineering Virtualization R & D IRC: fromani</pre> </body> </html> --------------BC0912C551C51594F7741D03--

On Wed, Jan 11, 2017 at 10:26 AM, Francesco Romani <fromani@redhat.com> wrote:
Hi all
On 01/11/2017 08:52 AM, Eyal Edri wrote:
Adding Tomas from Virt.
On Tue, Jan 10, 2017 at 10:54 AM, Piotr Kliczewski < piotr.kliczewski@gmail.com> wrote:
On Tue, Jan 10, 2017 at 9:29 AM, Daniel Belenky <dbelenky@redhat.com> wrote:
Hi all,
test-repo_ovirt_experimental_master (link to Jenkins) job failed on basic_sanity scenario. The job was triggered by https://gerrit.ovirt.org/#/c/69845/
From looking at the logs, it seems that the reason is VDSM.
In the VDSM log, i see the following error:
2017-01-09 16:47:41,331 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 34942, 0, 0) at 0x36b95f0>: unexpected eof (betterAsyncore:119)
Daniel, could you please remind me the jenkins link? I see something suspicious on the Vdsm log.
Please use my live system: ssh mini@ykaul-mini.tlv.redhat.com (qum5net) then run a console to the VM: lagocli --prefix-path /dev/shm/run/current shell engine (or 'host0' for the host) Most notably, Vdsm received SIGTERM. Is this expected and part of the test?
It's not.
This issue means that the client closed connection while vdsm was replying. It can happen at any time when the client is not nice with the connection. As you can see the client connected locally '::1'.
Also, when looking at the MOM logs, I see the the following:
2017-01-09 16:43:39,508 - mom.vdsmInterface - ERROR - Cannot connect to VDSM! [Errno 111] Connection refused
Looking at the log at this time vdsm had no open socket.
Correct, but IIRC we have a race on startup - that's the reason why MOM retries to connect. After the new try, MOM seems to behave correctly:
2017-01-09 16:44:05,672 - mom.RPCServer - INFO - ping() 2017-01-09 16:44:05,673 - mom.RPCServer - INFO - getStatistics()
But there are multiple other disconnections, without anything in mom log: [root@lago-basic-suite-master-host0 vdsm]# grep SSL /var/log/vdsm/vdsm.log |grep "::1" 2017-01-11 02:29:46,310 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 49046, 0, 0) at 0x39bf1b8>: unexpected eof (betterAsyncore:119) 2017-01-11 02:29:51,089 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 49048, 0, 0) at 0x39d1ea8>: unexpected eof (betterAsyncore:119) 2017-01-11 02:29:51,392 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 49050, 0, 0) at 0x39d1710>: unexpected eof (betterAsyncore:119) 2017-01-11 02:29:51,700 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 49052, 0, 0) at 0x39d1128>: unexpected eof (betterAsyncore:119) 2017-01-11 02:29:52,008 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 49054, 0, 0) at 0x39d1128>: unexpected eof (betterAsyncore:119) 2017-01-11 02:37:34,019 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 49032, 0, 0) at 0x380cef0>: unexpected eof (betterAsyncore:119)
-- Francesco Romani Red Hat Engineering Virtualization R & D IRC: fromani
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

On Wed, Jan 11, 2017 at 9:29 AM, Yaniv Kaul <ykaul@redhat.com> wrote:
On Wed, Jan 11, 2017 at 10:26 AM, Francesco Romani <fromani@redhat.com> wrote:
Hi all
On 01/11/2017 08:52 AM, Eyal Edri wrote:
Adding Tomas from Virt.
On Tue, Jan 10, 2017 at 10:54 AM, Piotr Kliczewski <piotr.kliczewski@gmail.com> wrote:
On Tue, Jan 10, 2017 at 9:29 AM, Daniel Belenky <dbelenky@redhat.com> wrote:
Hi all,
test-repo_ovirt_experimental_master (link to Jenkins) job failed on basic_sanity scenario. The job was triggered by https://gerrit.ovirt.org/#/c/69845/
From looking at the logs, it seems that the reason is VDSM.
In the VDSM log, i see the following error:
2017-01-09 16:47:41,331 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 34942, 0, 0) at 0x36b95f0>: unexpected eof (betterAsyncore:119)
Daniel, could you please remind me the jenkins link? I see something suspicious on the Vdsm log.
Please use my live system: ssh mini@ykaul-mini.tlv.redhat.com (qum5net) then run a console to the VM: lagocli --prefix-path /dev/shm/run/current shell engine
(or 'host0' for the host)
Most notably, Vdsm received SIGTERM. Is this expected and part of the test?
It's not.
This issue means that the client closed connection while vdsm was replying. It can happen at any time when the client is not nice with the connection. As you can see the client connected locally '::1'.
Also, when looking at the MOM logs, I see the the following:
2017-01-09 16:43:39,508 - mom.vdsmInterface - ERROR - Cannot connect to VDSM! [Errno 111] Connection refused
Looking at the log at this time vdsm had no open socket.
Correct, but IIRC we have a race on startup - that's the reason why MOM retries to connect. After the new try, MOM seems to behave correctly:
2017-01-09 16:44:05,672 - mom.RPCServer - INFO - ping() 2017-01-09 16:44:05,673 - mom.RPCServer - INFO - getStatistics()
But there are multiple other disconnections, without anything in mom log:
[root@lago-basic-suite-master-host0 vdsm]# grep SSL /var/log/vdsm/vdsm.log |grep "::1" 2017-01-11 02:29:46,310 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 49046, 0, 0) at 0x39bf1b8>: unexpected eof (betterAsyncore:119) 2017-01-11 02:29:51,089 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 49048, 0, 0) at 0x39d1ea8>: unexpected eof (betterAsyncore:119) 2017-01-11 02:29:51,392 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 49050, 0, 0) at 0x39d1710>: unexpected eof (betterAsyncore:119) 2017-01-11 02:29:51,700 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 49052, 0, 0) at 0x39d1128>: unexpected eof (betterAsyncore:119) 2017-01-11 02:29:52,008 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 49054, 0, 0) at 0x39d1128>: unexpected eof (betterAsyncore:119) 2017-01-11 02:37:34,019 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 49032, 0, 0) at 0x380cef0>: unexpected eof (betterAsyncore:119)
If there is no hosted engine this could be only mom. This is indication of connection closure.
-- Francesco Romani Red Hat Engineering Virtualization R & D IRC: fromani
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

On Wed, Jan 11, 2017 at 10:53 AM, Piotr Kliczewski < piotr.kliczewski@gmail.com> wrote:
On Wed, Jan 11, 2017 at 9:29 AM, Yaniv Kaul <ykaul@redhat.com> wrote:
On Wed, Jan 11, 2017 at 10:26 AM, Francesco Romani <fromani@redhat.com> wrote:
Hi all
On 01/11/2017 08:52 AM, Eyal Edri wrote:
Adding Tomas from Virt.
On Tue, Jan 10, 2017 at 10:54 AM, Piotr Kliczewski <piotr.kliczewski@gmail.com> wrote:
On Tue, Jan 10, 2017 at 9:29 AM, Daniel Belenky <dbelenky@redhat.com> wrote:
Hi all,
test-repo_ovirt_experimental_master (link to Jenkins) job failed on basic_sanity scenario. The job was triggered by https://gerrit.ovirt.org/#/c/69845/
From looking at the logs, it seems that the reason is VDSM.
In the VDSM log, i see the following error:
2017-01-09 16:47:41,331 ERROR (JsonRpc (StompReactor))
[vds.dispatcher]
SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 34942, 0, 0) at 0x36b95f0>: unexpected eof (betterAsyncore:119)
Daniel, could you please remind me the jenkins link? I see something suspicious on the Vdsm log.
Please use my live system: ssh mini@ykaul-mini.tlv.redhat.com (qum5net) then run a console to the VM: lagocli --prefix-path /dev/shm/run/current shell engine
(or 'host0' for the host)
Most notably, Vdsm received SIGTERM. Is this expected and part of the test?
It's not.
This issue means that the client closed connection while vdsm was replying. It can happen at any time when the client is not nice with the connection. As you can see the client connected locally '::1'.
Also, when looking at the MOM logs, I see the the following:
2017-01-09 16:43:39,508 - mom.vdsmInterface - ERROR - Cannot connect
to
VDSM! [Errno 111] Connection refused
Looking at the log at this time vdsm had no open socket.
Correct, but IIRC we have a race on startup - that's the reason why MOM retries to connect. After the new try, MOM seems to behave correctly:
2017-01-09 16:44:05,672 - mom.RPCServer - INFO - ping() 2017-01-09 16:44:05,673 - mom.RPCServer - INFO - getStatistics()
But there are multiple other disconnections, without anything in mom log:
[root@lago-basic-suite-master-host0 vdsm]# grep SSL /var/log/vdsm/vdsm.log |grep "::1" 2017-01-11 02:29:46,310 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 49046, 0, 0) at 0x39bf1b8>: unexpected eof (betterAsyncore:119) 2017-01-11 02:29:51,089 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 49048, 0, 0) at 0x39d1ea8>: unexpected eof (betterAsyncore:119) 2017-01-11 02:29:51,392 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 49050, 0, 0) at 0x39d1710>: unexpected eof (betterAsyncore:119) 2017-01-11 02:29:51,700 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 49052, 0, 0) at 0x39d1128>: unexpected eof (betterAsyncore:119) 2017-01-11 02:29:52,008 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 49054, 0, 0) at 0x39d1128>: unexpected eof (betterAsyncore:119) 2017-01-11 02:37:34,019 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 49032, 0, 0) at 0x380cef0>: unexpected eof (betterAsyncore:119)
If there is no hosted engine this could be only mom. This is indication of connection closure.
There is no hosted-engine, although the services are installed (since we've installed the Cockpit stuff) - but not working. Y.
-- Francesco Romani Red Hat Engineering Virtualization R & D IRC: fromani
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

This is a multi-part message in MIME format. --------------03A5C4A20FB28EE021E60761 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit On 01/11/2017 09:29 AM, Yaniv Kaul wrote:
On Wed, Jan 11, 2017 at 10:26 AM, Francesco Romani <fromani@redhat.com <mailto:fromani@redhat.com>> wrote:
Hi all
On 01/11/2017 08:52 AM, Eyal Edri wrote:
Adding Tomas from Virt.
On Tue, Jan 10, 2017 at 10:54 AM, Piotr Kliczewski <piotr.kliczewski@gmail.com <mailto:piotr.kliczewski@gmail.com>> wrote:
On Tue, Jan 10, 2017 at 9:29 AM, Daniel Belenky <dbelenky@redhat.com <mailto:dbelenky@redhat.com>> wrote: > Hi all, > > test-repo_ovirt_experimental_master (link to Jenkins) job failed on > basic_sanity scenario. > The job was triggered by https://gerrit.ovirt.org/#/c/69845/ <https://gerrit.ovirt.org/#/c/69845/> > > From looking at the logs, it seems that the reason is VDSM. > > In the VDSM log, i see the following error: > > 2017-01-09 16:47:41,331 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL > error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', > 34942, 0, 0) at 0x36b95f0>: unexpected eof (betterAsyncore:119)
Daniel, could you please remind me the jenkins link? I see something suspicious on the Vdsm log.
Please use my live system: ssh mini@ykaul-mini.tlv.redhat.com <mailto:mini@ykaul-mini.tlv.redhat.com> (redacted) then run a console to the VM: lagocli --prefix-path /dev/shm/run/current shell engine
(or 'host0' for the host)
Most notably, Vdsm received SIGTERM. Is this expected and part of the test?
It's not.
I fooled myself. We have two hosts here. host0 was holding the VM up until the suspend. Then Engine decided to resume the VM on the other one, host1. While the VM was resuming, host0 experienced network issues, which led to soft-fencing. That explains the mess on host0, even though it is still unclear why host0 had network issues and heartbeat exceeded in the first place. On host1 the waters are even darker. The VM was resumed ~02:36 2017-01-11 02:36:04,700-05 INFO [org.ovirt.engine.core.vdsbroker.CreateVmVDSCommand] (default task-17) [72c41f12-649b-4833-8485-44e8d20d2b49] FINISH, CreateVmVDSCommand, return: RestoringState, log id: 378da701 2017-01-11 02:36:04,700-05 INFO [org.ovirt.engine.core.bll.RunVmCommand] (default task-17) [72c41f12-649b-4833-8485-44e8d20d2b49] Lock freed to object 'EngineLock:{exclusiveLocks='[42860011-acc3-44d6-9ddf-dea64 2caf083=<VM, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}' 2017-01-11 02:36:04,704-05 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-17) [72c41f12-649b-4833-8485-44e8d20d2b49] Correlation ID: 72c41f12-649b-4833-8485-44e8d20d2b 49, Job ID: a93b571e-aed1-40e7-8d71-831f646255fb, Call Stack: null, Custom Event ID: -1, Message: VM vm0 was started by admin@internal-authz (Host: lago-basic-suite-master-host1). While well within the timeout limit (600s), the vm was still restoring its state: 2017-01-11 02:37:31,059-05 DEBUG [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler8) [5582058d] START, GetAllVmStatsVDSCommand(HostName = lago-basic-suite-master-host1, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='0336661b-721f-4c55-9327-8fd2fd3a0542'}), log id: 1803c51b 2017-01-11 02:37:31,059-05 DEBUG [org.ovirt.vdsm.jsonrpc.client.reactors.stomp.impl.Message] (DefaultQuartzScheduler8) [5582058d] SEND destination:jms.topic.vdsm_requests reply-to:jms.topic.vdsm_responses content-length:103 {"jsonrpc":"2.0","method":"Host.getAllVmStats","params":{},"id":"f2997c1d-49cc-4d2e-937c-e4910fbb75df"}^@ 2017-01-11 02:37:31,059-05 DEBUG [org.ovirt.vdsm.jsonrpc.client.reactors.stomp.StompCommonClient] (DefaultQuartzScheduler8) [5582058d] Message sent: SEND destination:jms.topic.vdsm_requests content-length:103 reply-to:jms.topic.vdsm_responses <JsonRpcRequest id: "f2997c1d-49cc-4d2e-937c-e4910fbb75df", method: Host.getAllVmStats, params: {}> 2017-01-11 02:37:31,062-05 DEBUG [org.ovirt.vdsm.jsonrpc.client.reactors.stomp.impl.Message] (SSL Stomp Reactor) [5e453618] MESSAGE content-length:829 destination:jms.topic.vdsm_responses content-type:application/json subscription:fe930de2-aa67-4bc4-a34c-be22edd1623e {"jsonrpc": "2.0", "id": "f2997c1d-49cc-4d2e-937c-e4910fbb75df", "result": [{"username": "Unknown", "displayInfo": [{"tlsPort": "-1", "ipAddress": "192.168.201.4", "type": "vnc", "port": "-1"}, {"tlsPort": "-1", "ipAddress": "192.168.201.4", "type": "spice", "port": "-1"}], "hash": "-4942054084956770103", "acpiEnable": "true", "displayIp": "192.168.201.4", "guestFQDN": "", "vmId": "42860011-acc3-44d6-9ddf-dea642caf083", "pid": "0", "timeOffset": "0", "session": "Unknown", "displaySecurePort": "-1", "displayPort": "-1", "memUsage": "0", "guestIPs": "", "kvmEnable": "true", "displayType": "vnc", "cpuUser": "0.00", "monitorResponse": "0", "elapsedTime": "389", "vmType": "kvm", "cpuSys": "0.00", "status": "Restoring state", "guestCPUCount": -1, "appsList": [], "clientIp": "", "statusTime": "4295562220", "vmName": "vm0"}]}^@ 2017-01-11 02:37:31,062-05 DEBUG [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) [] Message received: {"jsonrpc": "2.0", "id": "f2997c1d-49cc-4d2e-937c-e4910fbb75df", "result": [{"username": "Unknown", "displayInfo": [{"tlsPort": "-1", "ipAddress": "192.168.201.4", "type": "vnc", "port": "-1"}, {"tlsPort": "-1", "ipAddress": "192.168.201.4", "type": "spice", "port": "-1"}], "hash": "-4942054084956770103", "acpiEnable": "true", "displayIp": "192.168.201.4", "guestFQDN": "", "vmId": "42860011-acc3-44d6-9ddf-dea642caf083", "pid": "0", "timeOffset": "0", "session": "Unknown", "displaySecurePort": "-1", "displayPort": "-1", "memUsage": "0", "guestIPs": "", "kvmEnable": "true", "displayType": "vnc", "cpuUser": "0.00", "monitorResponse": "0", "elapsedTime": "389", "vmType": "kvm", "cpuSys": "0.00", "status": "Restoring state", "guestCPUCount": -1, "appsList": [], "clientIp": "", "statusTime": "4295562220", "vmName": "vm0"}]} Then a failed migration attempt, which fails 2017-01-11 02:46:07,826-05 INFO [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-4) [27938262-f5fe-41ff-a662-5177a3a25ada] Lock Acquired to object 'EngineLock:{exclusiveLocks='[42860011-acc3-44d6-9ddf-dea642caf083=<VM, ACTION_TYPE_FAILED_VM_IS_BEING_MIGRATED$VmName vm0>]', sharedLocks='null'}' 2017-01-11 02:46:07,827-05 WARN [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-4) [27938262-f5fe-41ff-a662-5177a3a25ada] Validation of action 'MigrateVmToServer' failed for user admin@internal-authz. Reasons: VAR__ACTION__MIGRATE,VAR__TYPE__VM,VAR__HOST_STATUS__UP,ACTION_TYPE_FAILED_VDS_STATUS_ILLEGAL 2017-01-11 02:46:07,827-05 INFO [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-4) [27938262-f5fe-41ff-a662-5177a3a25ada] Lock freed to object 'EngineLock:{exclusiveLocks='[42860011-acc3-44d6-9ddf-dea642caf083=<VM, ACTION_TYPE_FAILED_VM_IS_BEING_MIGRATED$VmName vm0>]', sharedLocks='null'}' which makes sense, host0 was being reinstalled (!?) meantime. After that I see no more evidence of 'vm0' in the logs. I couldn't yet cross-correlate with the tests, perhaps the test failed meantime? TL;DR: we need a reproducer to keep investigating this, so far I've seen quite some suspicious activity but nothing really concrete. PS: last few lago failures I've seen are been triggered by completely unrelated patches, so far I thik they are sporadic failures -- Francesco Romani Red Hat Engineering Virtualization R & D IRC: fromani --------------03A5C4A20FB28EE021E60761 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 8bit <html> <head> <meta content="text/html; charset=utf-8" http-equiv="Content-Type"> </head> <body bgcolor="#FFFFFF" text="#000000"> <p><br> </p> <br> <div class="moz-cite-prefix">On 01/11/2017 09:29 AM, Yaniv Kaul wrote:<br> </div> <blockquote cite="mid:CAJgorsZPeVA_1u=Pp16MbQYsk_gUUqzw0V4rPekdJw2MFwNoqw@mail.gmail.com" type="cite"> <div dir="ltr"><br> <div class="gmail_extra"><br> <div class="gmail_quote">On Wed, Jan 11, 2017 at 10:26 AM, Francesco Romani <span dir="ltr"><<a moz-do-not-send="true" href="mailto:fromani@redhat.com" target="_blank">fromani@redhat.com</a>></span> wrote:<br> <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <div bgcolor="#FFFFFF"> <p>Hi all<br> </p> <span class="gmail-"> <br> <div class="gmail-m_-5785517619226444427moz-cite-prefix">On 01/11/2017 08:52 AM, Eyal Edri wrote:<br> </div> <blockquote type="cite"> <div dir="ltr">Adding Tomas from Virt.</div> <div class="gmail_extra"><br> <div class="gmail_quote">On Tue, Jan 10, 2017 at 10:54 AM, Piotr Kliczewski <span dir="ltr"><<a moz-do-not-send="true" href="mailto:piotr.kliczewski@gmail.com" target="_blank">piotr.kliczewski@gmail.com</a>></span> wrote:<br> <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span>On Tue, Jan 10, 2017 at 9:29 AM, Daniel Belenky <<a moz-do-not-send="true" href="mailto:dbelenky@redhat.com" target="_blank">dbelenky@redhat.com</a>> wrote:<br> > Hi all,<br> ><br> > test-repo_ovirt_experimental_m<wbr>aster (link to Jenkins) job failed on<br> > basic_sanity scenario.<br> > The job was triggered by <a moz-do-not-send="true" href="https://gerrit.ovirt.org/#/c/69845/" rel="noreferrer" target="_blank">https://gerrit.ovirt.org/#/c/6<wbr>9845/</a><br> ><br> > From looking at the logs, it seems that the reason is VDSM.<br> ><br> > In the VDSM log, i see the following error:<br> ><br> > 2017-01-09 16:47:41,331 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL<br> > error receiving from <yajsonrpc.betterAsyncore.Disp<wbr>atcher connected ('::1',<br> > 34942, 0, 0) at 0x36b95f0>: unexpected eof (betterAsyncore:119)<br> </span></blockquote> </div> </div> </blockquote> <br> </span> Daniel, could you please remind me the jenkins link? I see something suspicious on the Vdsm log.<br> </div> </blockquote> <div><br> </div> <div>Please use my live system:</div> <div style="font-size:12.8px">ssh <a moz-do-not-send="true" href="mailto:mini@ykaul-mini.tlv.redhat.com" target="_blank"><span class="gmail-m_1130033870917217609gmail-m_-7038708660695866355gmail-il"><span class="gmail-m_1130033870917217609gmail-il">mini</span></span>@<span class="gmail-m_1130033870917217609gmail-m_-7038708660695866355gmail-il">ykaul</span>-<span class="gmail-m_1130033870917217609gmail-m_-7038708660695866355gmail-il"><span class="gmail-m_1130033870917217609gmail-il">mini</span></span>.tlv.redhat<wbr>.com</a> (redacted)<br> </div> <div style="font-size:12.8px">then run a console to the VM:</div> <div><span style="font-size:12.8px">lagocli --prefix-path /dev/shm/run/current shell engine</span> </div> <div><br> </div> <div>(or 'host0' for the host)</div> <div><br> </div> <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <div bgcolor="#FFFFFF"> Most notably, Vdsm received SIGTERM. Is this expected and part of the test?</div> </blockquote> <div><br> </div> <div>It's not.</div> </div> </div> </div> </blockquote> <br> I fooled myself. We have two hosts here. host0 was holding the VM up until the suspend. Then Engine decided to resume the VM on the other one, host1.<br> While the VM was resuming, host0 experienced network issues, which led to soft-fencing. That explains the mess on host0, even though it is still<br> unclear why host0 had network issues and heartbeat exceeded in the first place.<br> <br> On host1 the waters are even darker. The VM was resumed ~02:36<br> <br> 2017-01-11 02:36:04,700-05 INFO [org.ovirt.engine.core.vdsbroker.CreateVmVDSCommand] (default task-17) [72c41f12-649b-4833-8485-44e8d20d2b49] FINISH, CreateVmVDSCommand, return: RestoringState, log id: 378da701<br> 2017-01-11 02:36:04,700-05 INFO [org.ovirt.engine.core.bll.RunVmCommand] (default task-17) [72c41f12-649b-4833-8485-44e8d20d2b49] Lock freed to object 'EngineLock:{exclusiveLocks='[42860011-acc3-44d6-9ddf-dea64<br> 2caf083=<VM, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}'<br> 2017-01-11 02:36:04,704-05 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-17) [72c41f12-649b-4833-8485-44e8d20d2b49] Correlation ID: 72c41f12-649b-4833-8485-44e8d20d2b<br> 49, Job ID: a93b571e-aed1-40e7-8d71-831f646255fb, Call Stack: null, Custom Event ID: -1, Message: VM vm0 was started by admin@internal-authz (Host: lago-basic-suite-master-host1).<br> <br> While well within the timeout limit (600s), the vm was still restoring its state:<br> <br> 2017-01-11 02:37:31,059-05 DEBUG [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler8) [5582058d] START, GetAllVmStatsVDSCommand(HostName = lago-basic-suite-master-host1, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='0336661b-721f-4c55-9327-8fd2fd3a0542'}), log id: 1803c51b<br> 2017-01-11 02:37:31,059-05 DEBUG [org.ovirt.vdsm.jsonrpc.client.reactors.stomp.impl.Message] (DefaultQuartzScheduler8) [5582058d] SEND<br> destination:jms.topic.vdsm_requests<br> reply-to:jms.topic.vdsm_responses<br> content-length:103<br> <br> {"jsonrpc":"2.0","method":"Host.getAllVmStats","params":{},"id":"f2997c1d-49cc-4d2e-937c-e4910fbb75df"}^@<br> 2017-01-11 02:37:31,059-05 DEBUG [org.ovirt.vdsm.jsonrpc.client.reactors.stomp.StompCommonClient] (DefaultQuartzScheduler8) [5582058d] Message sent: SEND<br> destination:jms.topic.vdsm_requests<br> content-length:103<br> reply-to:jms.topic.vdsm_responses<br> <br> <JsonRpcRequest id: "f2997c1d-49cc-4d2e-937c-e4910fbb75df", method: Host.getAllVmStats, params: {}><br> 2017-01-11 02:37:31,062-05 DEBUG [org.ovirt.vdsm.jsonrpc.client.reactors.stomp.impl.Message] (SSL Stomp Reactor) [5e453618] MESSAGE<br> content-length:829<br> destination:jms.topic.vdsm_responses<br> content-type:application/json<br> subscription:fe930de2-aa67-4bc4-a34c-be22edd1623e<br> <br> {"jsonrpc": "2.0", "id": "f2997c1d-49cc-4d2e-937c-e4910fbb75df", "result": [{"username": "Unknown", "displayInfo": [{"tlsPort": "-1", "ipAddress": "192.168.201.4", "type": "vnc", "port": "-1"}, {"tlsPort": "-1", "ipAddress": "192.168.201.4", "type": "spice", "port": "-1"}], "hash": "-4942054084956770103", "acpiEnable": "true", "displayIp": "192.168.201.4", "guestFQDN": "", "vmId": "42860011-acc3-44d6-9ddf-dea642caf083", "pid": "0", "timeOffset": "0", "session": "Unknown", "displaySecurePort": "-1", "displayPort": "-1", "memUsage": "0", "guestIPs": "", "kvmEnable": "true", "displayType": "vnc", "cpuUser": "0.00", "monitorResponse": "0", "elapsedTime": "389", "vmType": "kvm", "cpuSys": "0.00", "status": "Restoring state", "guestCPUCount": -1, "appsList": [], "clientIp": "", "statusTime": "4295562220", "vmName": "vm0"}]}^@<br> 2017-01-11 02:37:31,062-05 DEBUG [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) [] Message received: {"jsonrpc": "2.0", "id": "f2997c1d-49cc-4d2e-937c-e4910fbb75df", "result": [{"username": "Unknown", "displayInfo": [{"tlsPort": "-1", "ipAddress": "192.168.201.4", "type": "vnc", "port": "-1"}, {"tlsPort": "-1", "ipAddress": "192.168.201.4", "type": "spice", "port": "-1"}], "hash": "-4942054084956770103", "acpiEnable": "true", "displayIp": "192.168.201.4", "guestFQDN": "", "vmId": "42860011-acc3-44d6-9ddf-dea642caf083", "pid": "0", "timeOffset": "0", "session": "Unknown", "displaySecurePort": "-1", "displayPort": "-1", "memUsage": "0", "guestIPs": "", "kvmEnable": "true", "displayType": "vnc", "cpuUser": "0.00", "monitorResponse": "0", "elapsedTime": "389", "vmType": "kvm", "cpuSys": "0.00", "status": "Restoring state", "guestCPUCount": -1, "appsList": [], "clientIp": "", "statusTime": "4295562220", "vmName": "vm0"}]}<br> <br> Then a failed migration attempt, which fails<br> <br> 2017-01-11 02:46:07,826-05 INFO [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-4) [27938262-f5fe-41ff-a662-5177a3a25ada] Lock Acquired to object 'EngineLock:{exclusiveLocks='[42860011-acc3-44d6-9ddf-dea642caf083=<VM, ACTION_TYPE_FAILED_VM_IS_BEING_MIGRATED$VmName vm0>]', sharedLocks='null'}'<br> 2017-01-11 02:46:07,827-05 WARN [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-4) [27938262-f5fe-41ff-a662-5177a3a25ada] Validation of action 'MigrateVmToServer' failed for user admin@internal-authz. Reasons: VAR__ACTION__MIGRATE,VAR__TYPE__VM,VAR__HOST_STATUS__UP,ACTION_TYPE_FAILED_VDS_STATUS_ILLEGAL<br> 2017-01-11 02:46:07,827-05 INFO [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-4) [27938262-f5fe-41ff-a662-5177a3a25ada] Lock freed to object 'EngineLock:{exclusiveLocks='[42860011-acc3-44d6-9ddf-dea642caf083=<VM, ACTION_TYPE_FAILED_VM_IS_BEING_MIGRATED$VmName vm0>]', sharedLocks='null'}'<br> <br> which makes sense, host0 was being reinstalled (!?) meantime.<br> <br> After that I see no more evidence of 'vm0' in the logs. I couldn't yet cross-correlate with the tests, perhaps the test failed meantime?<br> <br> TL;DR: we need a reproducer to keep investigating this, so far I've seen quite some suspicious activity but nothing really concrete.<br> <br> PS: last few lago failures I've seen are been triggered by completely unrelated patches, so far I thik they are sporadic failures<br> <br> <br> <pre class="moz-signature" cols="72">-- Francesco Romani Red Hat Engineering Virtualization R & D IRC: fromani</pre> </body> </html> --------------03A5C4A20FB28EE021E60761--

On Wed, Jan 11, 2017 at 12:42 PM, Francesco Romani <fromani@redhat.com> wrote:
On 01/11/2017 09:29 AM, Yaniv Kaul wrote:
On Wed, Jan 11, 2017 at 10:26 AM, Francesco Romani <fromani@redhat.com> wrote:
Hi all
On 01/11/2017 08:52 AM, Eyal Edri wrote:
Adding Tomas from Virt.
On Tue, Jan 10, 2017 at 10:54 AM, Piotr Kliczewski < piotr.kliczewski@gmail.com> wrote:
On Tue, Jan 10, 2017 at 9:29 AM, Daniel Belenky <dbelenky@redhat.com> wrote:
Hi all,
test-repo_ovirt_experimental_master (link to Jenkins) job failed on basic_sanity scenario. The job was triggered by https://gerrit.ovirt.org/#/c/69845/
From looking at the logs, it seems that the reason is VDSM.
In the VDSM log, i see the following error:
2017-01-09 16:47:41,331 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 34942, 0, 0) at 0x36b95f0>: unexpected eof (betterAsyncore:119)
Daniel, could you please remind me the jenkins link? I see something suspicious on the Vdsm log.
Please use my live system: ssh mini@ykaul-mini.tlv.redhat.com (redacted) then run a console to the VM: lagocli --prefix-path /dev/shm/run/current shell engine
(or 'host0' for the host)
Most notably, Vdsm received SIGTERM. Is this expected and part of the test?
It's not.
I fooled myself. We have two hosts here. host0 was holding the VM up until the suspend. Then Engine decided to resume the VM on the other one, host1. While the VM was resuming, host0 experienced network issues, which led to soft-fencing. That explains the mess on host0, even though it is still unclear why host0 had network issues and heartbeat exceeded in the first place.
On host1 the waters are even darker. The VM was resumed ~02:36
2017-01-11 02:36:04,700-05 INFO [org.ovirt.engine.core.vdsbroker.CreateVmVDSCommand] (default task-17) [72c41f12-649b-4833-8485-44e8d20d2b49] FINISH, CreateVmVDSCommand, return: RestoringState, log id: 378da701 2017-01-11 02:36:04,700-05 INFO [org.ovirt.engine.core.bll.RunVmCommand] (default task-17) [72c41f12-649b-4833-8485-44e8d20d2b49] Lock freed to object 'EngineLock:{exclusiveLocks='[42860011-acc3-44d6-9ddf-dea64 2caf083=<VM, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}' 2017-01-11 02:36:04,704-05 INFO [org.ovirt.engine.core.dal. dbbroker.auditloghandling.AuditLogDirector] (default task-17) [72c41f12-649b-4833-8485-44e8d20d2b49] Correlation ID: 72c41f12-649b-4833-8485-44e8d20d2b 49, Job ID: a93b571e-aed1-40e7-8d71-831f646255fb, Call Stack: null, Custom Event ID: -1, Message: VM vm0 was started by admin@internal-authz (Host: lago-basic-suite-master-host1).
While well within the timeout limit (600s), the vm was still restoring its state:
This is an issue I (wrongfully, privately) complained to Michal - resume takes way lot more time than it used to - the 'lzop' process decompressing the memory is taking ages to complete. This is reproducible, regardless of the other issues. I did not open a bug about it :( Y.
2017-01-11 02:37:31,059-05 DEBUG [org.ovirt.engine.core. vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler8) [5582058d] START, GetAllVmStatsVDSCommand(HostName = lago-basic-suite-master-host1, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='0336661b-721f-4c55-9327-8fd2fd3a0542'}), log id: 1803c51b 2017-01-11 02:37:31,059-05 DEBUG [org.ovirt.vdsm.jsonrpc. client.reactors.stomp.impl.Message] (DefaultQuartzScheduler8) [5582058d] SEND destination:jms.topic.vdsm_requests reply-to:jms.topic.vdsm_responses content-length:103
{"jsonrpc":"2.0","method":"Host.getAllVmStats","params":{ },"id":"f2997c1d-49cc-4d2e-937c-e4910fbb75df"}^@ 2017-01-11 02:37:31,059-05 DEBUG [org.ovirt.vdsm.jsonrpc. client.reactors.stomp.StompCommonClient] (DefaultQuartzScheduler8) [5582058d] Message sent: SEND destination:jms.topic.vdsm_requests content-length:103 reply-to:jms.topic.vdsm_responses
<JsonRpcRequest id: "f2997c1d-49cc-4d2e-937c-e4910fbb75df", method: Host.getAllVmStats, params: {}> 2017-01-11 02:37:31,062-05 DEBUG [org.ovirt.vdsm.jsonrpc. client.reactors.stomp.impl.Message] (SSL Stomp Reactor) [5e453618] MESSAGE content-length:829 destination:jms.topic.vdsm_responses content-type:application/json subscription:fe930de2-aa67-4bc4-a34c-be22edd1623e
{"jsonrpc": "2.0", "id": "f2997c1d-49cc-4d2e-937c-e4910fbb75df", "result": [{"username": "Unknown", "displayInfo": [{"tlsPort": "-1", "ipAddress": "192.168.201.4", "type": "vnc", "port": "-1"}, {"tlsPort": "-1", "ipAddress": "192.168.201.4", "type": "spice", "port": "-1"}], "hash": "-4942054084956770103", "acpiEnable": "true", "displayIp": "192.168.201.4", "guestFQDN": "", "vmId": "42860011-acc3-44d6-9ddf-dea642caf083", "pid": "0", "timeOffset": "0", "session": "Unknown", "displaySecurePort": "-1", "displayPort": "-1", "memUsage": "0", "guestIPs": "", "kvmEnable": "true", "displayType": "vnc", "cpuUser": "0.00", "monitorResponse": "0", "elapsedTime": "389", "vmType": "kvm", "cpuSys": "0.00", "status": "Restoring state", "guestCPUCount": -1, "appsList": [], "clientIp": "", "statusTime": "4295562220", "vmName": "vm0"}]}^@ 2017-01-11 02:37:31,062-05 DEBUG [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) [] Message received: {"jsonrpc": "2.0", "id": "f2997c1d-49cc-4d2e-937c-e4910fbb75df", "result": [{"username": "Unknown", "displayInfo": [{"tlsPort": "-1", "ipAddress": "192.168.201.4", "type": "vnc", "port": "-1"}, {"tlsPort": "-1", "ipAddress": "192.168.201.4", "type": "spice", "port": "-1"}], "hash": "-4942054084956770103", "acpiEnable": "true", "displayIp": "192.168.201.4", "guestFQDN": "", "vmId": "42860011-acc3-44d6-9ddf-dea642caf083", "pid": "0", "timeOffset": "0", "session": "Unknown", "displaySecurePort": "-1", "displayPort": "-1", "memUsage": "0", "guestIPs": "", "kvmEnable": "true", "displayType": "vnc", "cpuUser": "0.00", "monitorResponse": "0", "elapsedTime": "389", "vmType": "kvm", "cpuSys": "0.00", "status": "Restoring state", "guestCPUCount": -1, "appsList": [], "clientIp": "", "statusTime": "4295562220", "vmName": "vm0"}]}
Then a failed migration attempt, which fails
2017-01-11 02:46:07,826-05 INFO [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-4) [27938262-f5fe-41ff-a662-5177a3a25ada] Lock Acquired to object 'EngineLock:{exclusiveLocks='[42860011-acc3-44d6-9ddf-dea642caf083=<VM, ACTION_TYPE_FAILED_VM_IS_BEING_MIGRATED$VmName vm0>]', sharedLocks='null'}' 2017-01-11 02:46:07,827-05 WARN [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-4) [27938262-f5fe-41ff-a662-5177a3a25ada] Validation of action 'MigrateVmToServer' failed for user admin@internal-authz. Reasons: VAR__ACTION__MIGRATE,VAR__TYPE__VM,VAR__HOST_STATUS__UP, ACTION_TYPE_FAILED_VDS_STATUS_ILLEGAL 2017-01-11 02:46:07,827-05 INFO [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-4) [27938262-f5fe-41ff-a662-5177a3a25ada] Lock freed to object 'EngineLock:{exclusiveLocks='[42860011-acc3-44d6-9ddf-dea642caf083=<VM, ACTION_TYPE_FAILED_VM_IS_BEING_MIGRATED$VmName vm0>]', sharedLocks='null'}'
which makes sense, host0 was being reinstalled (!?) meantime.
After that I see no more evidence of 'vm0' in the logs. I couldn't yet cross-correlate with the tests, perhaps the test failed meantime?
TL;DR: we need a reproducer to keep investigating this, so far I've seen quite some suspicious activity but nothing really concrete.
PS: last few lago failures I've seen are been triggered by completely unrelated patches, so far I thik they are sporadic failures
-- Francesco Romani Red Hat Engineering Virtualization R & D IRC: fromani

Link to Jenkins <http://jenkins.ovirt.org/view/experimental%20jobs/job/test-repo_ovirt_experimental_master/4648/artifact/exported-artifacts/basic_suite_master.sh-el7/exported-artifacts/> On Wed, Jan 11, 2017 at 10:26 AM, Francesco Romani <fromani@redhat.com> wrote:
Hi all
On 01/11/2017 08:52 AM, Eyal Edri wrote:
Adding Tomas from Virt.
On Tue, Jan 10, 2017 at 10:54 AM, Piotr Kliczewski < piotr.kliczewski@gmail.com> wrote:
On Tue, Jan 10, 2017 at 9:29 AM, Daniel Belenky <dbelenky@redhat.com> wrote:
Hi all,
test-repo_ovirt_experimental_master (link to Jenkins) job failed on basic_sanity scenario. The job was triggered by https://gerrit.ovirt.org/#/c/69845/
From looking at the logs, it seems that the reason is VDSM.
In the VDSM log, i see the following error:
2017-01-09 16:47:41,331 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 34942, 0, 0) at 0x36b95f0>: unexpected eof (betterAsyncore:119)
Daniel, could you please remind me the jenkins link? I see something suspicious on the Vdsm log. Most notably, Vdsm received SIGTERM. Is this expected and part of the test?
This issue means that the client closed connection while vdsm was replying. It can happen at any time when the client is not nice with the connection. As you can see the client connected locally '::1'.
Also, when looking at the MOM logs, I see the the following:
2017-01-09 16:43:39,508 - mom.vdsmInterface - ERROR - Cannot connect to VDSM! [Errno 111] Connection refused
Looking at the log at this time vdsm had no open socket.
Correct, but IIRC we have a race on startup - that's the reason why MOM retries to connect. After the new try, MOM seems to behave correctly:
2017-01-09 16:44:05,672 - mom.RPCServer - INFO - ping() 2017-01-09 16:44:05,673 - mom.RPCServer - INFO - getStatistics()
-- Francesco Romani Red Hat Engineering Virtualization R & D IRC: fromani
-- *Daniel Belenky* *RHV DevOps* *Red Hat Israel*

This is a multi-part message in MIME format. --------------0EBD452C6DE2C2E44A2F831B Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Thanks, sorry, I was silly enough to have it missed before. On 01/11/2017 09:32 AM, Daniel Belenky wrote:
On Wed, Jan 11, 2017 at 10:26 AM, Francesco Romani <fromani@redhat.com <mailto:fromani@redhat.com>> wrote:
Hi all
On 01/11/2017 08:52 AM, Eyal Edri wrote:
Adding Tomas from Virt.
On Tue, Jan 10, 2017 at 10:54 AM, Piotr Kliczewski <piotr.kliczewski@gmail.com <mailto:piotr.kliczewski@gmail.com>> wrote:
On Tue, Jan 10, 2017 at 9:29 AM, Daniel Belenky <dbelenky@redhat.com <mailto:dbelenky@redhat.com>> wrote: > Hi all, > > test-repo_ovirt_experimental_master (link to Jenkins) job failed on > basic_sanity scenario. > The job was triggered by https://gerrit.ovirt.org/#/c/69845/ <https://gerrit.ovirt.org/#/c/69845/> > > From looking at the logs, it seems that the reason is VDSM. > > In the VDSM log, i see the following error: > > 2017-01-09 16:47:41,331 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL > error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', > 34942, 0, 0) at 0x36b95f0>: unexpected eof (betterAsyncore:119)
Daniel, could you please remind me the jenkins link? I see something suspicious on the Vdsm log. Most notably, Vdsm received SIGTERM. Is this expected and part of the test?
>
This issue means that the client closed connection while vdsm was replying. It can happen at any time when the client is not nice with the connection. As you can see the client connected locally '::1'.
> > Also, when looking at the MOM logs, I see the the following: > > 2017-01-09 16:43:39,508 - mom.vdsmInterface - ERROR - Cannot connect to > VDSM! [Errno 111] Connection refused >
Looking at the log at this time vdsm had no open socket.
Correct, but IIRC we have a race on startup - that's the reason why MOM retries to connect. After the new try, MOM seems to behave correctly:
2017-01-09 16:44:05,672 - mom.RPCServer - INFO - ping() 2017-01-09 16:44:05,673 - mom.RPCServer - INFO - getStatistics()
-- Francesco Romani Red Hat Engineering Virtualization R & D IRC: fromani
-- /Daniel Belenky / /RHV DevOps / /Red Hat Israel /
-- Francesco Romani Red Hat Engineering Virtualization R & D IRC: fromani --------------0EBD452C6DE2C2E44A2F831B Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 7bit <html> <head> <meta content="text/html; charset=utf-8" http-equiv="Content-Type"> </head> <body bgcolor="#FFFFFF" text="#000000"> <p>Thanks, sorry, I was silly enough to have it missed before.<br> </p> <br> <div class="moz-cite-prefix">On 01/11/2017 09:32 AM, Daniel Belenky wrote:<br> </div> <blockquote cite="mid:CACDJhnfxXuzOYBU3CcdRrsZSi6+pyZn9Ot8ZJgDnw+mTPDqsUA@mail.gmail.com" type="cite"> <div dir="ltr"><a moz-do-not-send="true" href="http://jenkins.ovirt.org/view/experimental%20jobs/job/test-repo_ovirt_experimental_master/4648/artifact/exported-artifacts/basic_suite_master.sh-el7/exported-artifacts/">Link to Jenkins</a></div> <div class="gmail_extra"><br> <div class="gmail_quote">On Wed, Jan 11, 2017 at 10:26 AM, Francesco Romani <span dir="ltr"><<a moz-do-not-send="true" href="mailto:fromani@redhat.com" target="_blank">fromani@redhat.com</a>></span> wrote:<br> <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <div bgcolor="#FFFFFF" text="#000000"> <p>Hi all<br> </p> <span class=""> <br> <div class="m_1420320669945181361moz-cite-prefix">On 01/11/2017 08:52 AM, Eyal Edri wrote:<br> </div> <blockquote type="cite"> <div dir="ltr">Adding Tomas from Virt.</div> <div class="gmail_extra"><br> <div class="gmail_quote">On Tue, Jan 10, 2017 at 10:54 AM, Piotr Kliczewski <span dir="ltr"><<a moz-do-not-send="true" href="mailto:piotr.kliczewski@gmail.com" target="_blank">piotr.kliczewski@gmail.com</a>></span> wrote:<br> <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span>On Tue, Jan 10, 2017 at 9:29 AM, Daniel Belenky <<a moz-do-not-send="true" href="mailto:dbelenky@redhat.com" target="_blank">dbelenky@redhat.com</a>> wrote:<br> > Hi all,<br> ><br> > test-repo_ovirt_experimental_m<wbr>aster (link to Jenkins) job failed on<br> > basic_sanity scenario.<br> > The job was triggered by <a moz-do-not-send="true" href="https://gerrit.ovirt.org/#/c/69845/" rel="noreferrer" target="_blank">https://gerrit.ovirt.org/#/c/6<wbr>9845/</a><br> ><br> > From looking at the logs, it seems that the reason is VDSM.<br> ><br> > In the VDSM log, i see the following error:<br> ><br> > 2017-01-09 16:47:41,331 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL<br> > error receiving from <yajsonrpc.betterAsyncore.Disp<wbr>atcher connected ('::1',<br> > 34942, 0, 0) at 0x36b95f0>: unexpected eof (betterAsyncore:119)<br> </span></blockquote> </div> </div> </blockquote> <br> </span> Daniel, could you please remind me the jenkins link? I see something suspicious on the Vdsm log.<br> Most notably, Vdsm received SIGTERM. Is this expected and part of the test?<span class=""><br> <br> <blockquote type="cite"> <div class="gmail_extra"> <div class="gmail_quote"> <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span> ><br> <br> </span>This issue means that the client closed connection while vdsm was<br> replying. It can happen at any time<br> when the client is not nice with the connection. As you can see the<br> client connected locally '::1'.<br> <span><br> ><br> > Also, when looking at the MOM logs, I see the the following:<br> ><br> > 2017-01-09 16:43:39,508 - mom.vdsmInterface - ERROR - Cannot connect to<br> > VDSM! [Errno 111] Connection refused<br> ><br> <br> </span>Looking at the log at this time vdsm had no open socket.</blockquote> </div> </div> <br> </blockquote> <br> </span> Correct, but IIRC we have a race on startup - that's the reason why MOM retries to connect. After the new try, MOM seems to behave<br> correctly:<br> <br> 2017-01-09 16:44:05,672 - mom.RPCServer - INFO - ping()<br> 2017-01-09 16:44:05,673 - mom.RPCServer - INFO - getStatistics()<span class="HOEnZb"><font color="#888888"><br> <br> <pre class="m_1420320669945181361moz-signature" cols="72">-- Francesco Romani Red Hat Engineering Virtualization R & D IRC: fromani</pre> </font></span></div> </blockquote> </div> <br> <br clear="all"> <div><br> </div> -- <br> <div class="gmail_signature" data-smartmail="gmail_signature"> <div dir="ltr"> <div> <div dir="ltr"> <div> <div dir="ltr"> <div> <div> <div><span style="background-color:rgb(255,255,255)"><span style="color:rgb(0,0,0)"><i>Daniel Belenky<br> </i></span></span></div> <span style="background-color:rgb(255,255,255)"><span style="color:rgb(0,0,0)"><i>RHV DevOps<br> </i></span></span></div> <span style="background-color:rgb(255,255,255)"><span style="color:rgb(0,0,0)"><i>Red Hat Israel<br> </i></span></span></div> </div> </div> </div> </div> </div> </div> </div> </blockquote> <br> <pre class="moz-signature" cols="72">-- Francesco Romani Red Hat Engineering Virtualization R & D IRC: fromani</pre> </body> </html> --------------0EBD452C6DE2C2E44A2F831B--

I just ran ovirt-system-tests on two very different machines. It passed on one of them, while it failed on the other one, at a different place: @ Run test: 005_network_by_label.py: nose.config: INFO: Ignoring files matching ['^\\.', '^_', '^setup\\.py$'] # assign_hosts_network_label: Error while running thread Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/lago/utils.py", line 55, in _ret_via_queue queue.put({'return': func()}) File "/var/local/lago/ovirt-system-tests/basic-suite-master/test-scenarios/005_network_by_label.py", line 56, in _assign_host_network_label host_nic=nic File "/usr/lib/python2.7/site-packages/ovirtsdk/infrastructure/brokers.py", line 16231, in add headers={"Correlation-Id":correlation_id, "Expect":expect} File "/usr/lib/python2.7/site-packages/ovirtsdk/infrastructure/proxy.py", line 79, in add return self.request('POST', url, body, headers, cls=cls) File "/usr/lib/python2.7/site-packages/ovirtsdk/infrastructure/proxy.py", line 122, in request persistent_auth=self.__persistent_auth File "/usr/lib/python2.7/site-packages/ovirtsdk/infrastructure/connectionspool.py", line 79, in do_request persistent_auth) File "/usr/lib/python2.7/site-packages/ovirtsdk/infrastructure/connectionspool.py", line 162, in __do_request raise errors.RequestError(response_code, response_reason, response_body) RequestError: status: 409 reason: Conflict detail: Cannot add Label. Operation can be performed only when Host status is Maintenance, Up, NonOperational. I can also see occasional errors like the following in vdsm.log: ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::ffff:192.168.201.3', 47434, 0, 0) at 0x271fd88>: (104, 'Connection reset by peer') (betterAsyncore:119) So we are probably dealing with an error that occurs "randomly" and is not related to a particular test. Daniel Belenky <dbelenky@redhat.com> writes:
On Wed, Jan 11, 2017 at 10:26 AM, Francesco Romani <fromani@redhat.com> wrote:
Hi all
On 01/11/2017 08:52 AM, Eyal Edri wrote:
Adding Tomas from Virt.
On Tue, Jan 10, 2017 at 10:54 AM, Piotr Kliczewski < piotr.kliczewski@gmail.com> wrote:
On Tue, Jan 10, 2017 at 9:29 AM, Daniel Belenky <dbelenky@redhat.com> wrote:
Hi all,
test-repo_ovirt_experimental_master (link to Jenkins) job failed on basic_sanity scenario. The job was triggered by https://gerrit.ovirt.org/#/c/69845/
From looking at the logs, it seems that the reason is VDSM.
In the VDSM log, i see the following error:
2017-01-09 16:47:41,331 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 34942, 0, 0) at 0x36b95f0>: unexpected eof (betterAsyncore:119)
Daniel, could you please remind me the jenkins link? I see something suspicious on the Vdsm log. Most notably, Vdsm received SIGTERM. Is this expected and part of the test?
This issue means that the client closed connection while vdsm was replying. It can happen at any time when the client is not nice with the connection. As you can see the client connected locally '::1'.
Also, when looking at the MOM logs, I see the the following:
2017-01-09 16:43:39,508 - mom.vdsmInterface - ERROR - Cannot connect to VDSM! [Errno 111] Connection refused
Looking at the log at this time vdsm had no open socket.
Correct, but IIRC we have a race on startup - that's the reason why MOM retries to connect. After the new try, MOM seems to behave correctly:
2017-01-09 16:44:05,672 - mom.RPCServer - INFO - ping() 2017-01-09 16:44:05,673 - mom.RPCServer - INFO - getStatistics()
-- Francesco Romani Red Hat Engineering Virtualization R & D IRC: fromani

On Wed, Jan 11, 2017 at 12:49 PM, Milan Zamazal <mzamazal@redhat.com> wrote:
I just ran ovirt-system-tests on two very different machines. It passed on one of them, while it failed on the other one, at a different place:
@ Run test: 005_network_by_label.py: nose.config: INFO: Ignoring files matching ['^\\.', '^_', '^setup\\.py$'] # assign_hosts_network_label: Error while running thread Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/lago/utils.py", line 55, in _ret_via_queue queue.put({'return': func()}) File "/var/local/lago/ovirt-system-tests/basic-suite-master/test- scenarios/005_network_by_label.py", line 56, in _assign_host_network_label host_nic=nic File "/usr/lib/python2.7/site-packages/ovirtsdk/infrastructure/brokers.py", line 16231, in add headers={"Correlation-Id":correlation_id, "Expect":expect} File "/usr/lib/python2.7/site-packages/ovirtsdk/infrastructure/proxy.py", line 79, in add return self.request('POST', url, body, headers, cls=cls) File "/usr/lib/python2.7/site-packages/ovirtsdk/infrastructure/proxy.py", line 122, in request persistent_auth=self.__persistent_auth File "/usr/lib/python2.7/site-packages/ovirtsdk/infrastructure/connectionspool.py", line 79, in do_request persistent_auth) File "/usr/lib/python2.7/site-packages/ovirtsdk/infrastructure/connectionspool.py", line 162, in __do_request raise errors.RequestError(response_code, response_reason, response_body) RequestError: status: 409 reason: Conflict detail: Cannot add Label. Operation can be performed only when Host status is Maintenance, Up, NonOperational.
This is an issue we've seen from time to time and have not figured it out yet. Do you have engine logs for it?
I can also see occasional errors like the following in vdsm.log:
ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::ffff:192.168.201.3', 47434, 0, 0) at 0x271fd88>: (104, 'Connection reset by peer') (betterAsyncore:119)
This is the core issue of today's - but probably unrelated to the issue you've just described, that we have seen happening from time to time in the past (I'd say that I've seen it happening last time ~2 weeks ago or so, but it's not reproducible easily to me). Y.
So we are probably dealing with an error that occurs "randomly" and is not related to a particular test.
Daniel Belenky <dbelenky@redhat.com> writes:
Link to Jenkins <http://jenkins.ovirt.org/view/experimental%20jobs/job/ test-repo_ovirt_experimental_master/4648/artifact/exported- artifacts/basic_suite_master.sh-el7/exported-artifacts/>
On Wed, Jan 11, 2017 at 10:26 AM, Francesco Romani <fromani@redhat.com> wrote:
Hi all
On 01/11/2017 08:52 AM, Eyal Edri wrote:
Adding Tomas from Virt.
On Tue, Jan 10, 2017 at 10:54 AM, Piotr Kliczewski < piotr.kliczewski@gmail.com> wrote:
On Tue, Jan 10, 2017 at 9:29 AM, Daniel Belenky <dbelenky@redhat.com> wrote:
Hi all,
test-repo_ovirt_experimental_master (link to Jenkins) job failed on basic_sanity scenario. The job was triggered by https://gerrit.ovirt.org/#/c/69845/
From looking at the logs, it seems that the reason is VDSM.
In the VDSM log, i see the following error:
2017-01-09 16:47:41,331 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 34942, 0, 0) at 0x36b95f0>: unexpected eof (betterAsyncore:119)
Daniel, could you please remind me the jenkins link? I see something suspicious on the Vdsm log. Most notably, Vdsm received SIGTERM. Is this expected and part of the test?
This issue means that the client closed connection while vdsm was replying. It can happen at any time when the client is not nice with the connection. As you can see the client connected locally '::1'.
Also, when looking at the MOM logs, I see the the following:
2017-01-09 16:43:39,508 - mom.vdsmInterface - ERROR - Cannot connect
to
VDSM! [Errno 111] Connection refused
Looking at the log at this time vdsm had no open socket.
Correct, but IIRC we have a race on startup - that's the reason why MOM retries to connect. After the new try, MOM seems to behave correctly:
2017-01-09 16:44:05,672 - mom.RPCServer - INFO - ping() 2017-01-09 16:44:05,673 - mom.RPCServer - INFO - getStatistics()
-- Francesco Romani Red Hat Engineering Virtualization R & D IRC: fromani
Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

Yaniv Kaul <ykaul@redhat.com> writes:
On Wed, Jan 11, 2017 at 12:49 PM, Milan Zamazal <mzamazal@redhat.com> wrote:
I just ran ovirt-system-tests on two very different machines. It passed on one of them, while it failed on the other one, at a different place:
@ Run test: 005_network_by_label.py: nose.config: INFO: Ignoring files matching ['^\\.', '^_', '^setup\\.py$'] # assign_hosts_network_label: Error while running thread Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/lago/utils.py", line 55, in _ret_via_queue queue.put({'return': func()}) File "/var/local/lago/ovirt-system-tests/basic-suite-master/test- scenarios/005_network_by_label.py", line 56, in _assign_host_network_label host_nic=nic File "/usr/lib/python2.7/site-packages/ovirtsdk/infrastructure/brokers.py", line 16231, in add headers={"Correlation-Id":correlation_id, "Expect":expect} File "/usr/lib/python2.7/site-packages/ovirtsdk/infrastructure/proxy.py", line 79, in add return self.request('POST', url, body, headers, cls=cls) File "/usr/lib/python2.7/site-packages/ovirtsdk/infrastructure/proxy.py", line 122, in request persistent_auth=self.__persistent_auth File "/usr/lib/python2.7/site-packages/ovirtsdk/infrastructure/connectionspool.py", line 79, in do_request persistent_auth) File "/usr/lib/python2.7/site-packages/ovirtsdk/infrastructure/connectionspool.py", line 162, in __do_request raise errors.RequestError(response_code, response_reason, response_body) RequestError: status: 409 reason: Conflict detail: Cannot add Label. Operation can be performed only when Host status is Maintenance, Up, NonOperational.
This is an issue we've seen from time to time and have not figured it out yet. Do you have engine logs for it?
Yes, I still have the given tests run instance available. Here's an excerpt, I'll send you the complete logs off-list (they are large):
I can also see occasional errors like the following in vdsm.log:
ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::ffff:192.168.201.3', 47434, 0, 0) at 0x271fd88>: (104, 'Connection reset by peer') (betterAsyncore:119)
This is the core issue of today's - but probably unrelated to the issue you've just described, that we have seen happening from time to time in the past (I'd say that I've seen it happening last time ~2 weeks ago or so, but it's not reproducible easily to me). Y.
So we are probably dealing with an error that occurs "randomly" and is not related to a particular test.
Daniel Belenky <dbelenky@redhat.com> writes:
Link to Jenkins <http://jenkins.ovirt.org/view/experimental%20jobs/job/ test-repo_ovirt_experimental_master/4648/artifact/exported- artifacts/basic_suite_master.sh-el7/exported-artifacts/>
On Wed, Jan 11, 2017 at 10:26 AM, Francesco Romani <fromani@redhat.com> wrote:
Hi all
On 01/11/2017 08:52 AM, Eyal Edri wrote:
Adding Tomas from Virt.
On Tue, Jan 10, 2017 at 10:54 AM, Piotr Kliczewski < piotr.kliczewski@gmail.com> wrote:
On Tue, Jan 10, 2017 at 9:29 AM, Daniel Belenky <dbelenky@redhat.com> wrote:
Hi all,
test-repo_ovirt_experimental_master (link to Jenkins) job failed on basic_sanity scenario. The job was triggered by https://gerrit.ovirt.org/#/c/69845/
From looking at the logs, it seems that the reason is VDSM.
In the VDSM log, i see the following error:
2017-01-09 16:47:41,331 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 34942, 0, 0) at 0x36b95f0>: unexpected eof (betterAsyncore:119)
Daniel, could you please remind me the jenkins link? I see something suspicious on the Vdsm log. Most notably, Vdsm received SIGTERM. Is this expected and part of the test?
This issue means that the client closed connection while vdsm was replying. It can happen at any time when the client is not nice with the connection. As you can see the client connected locally '::1'.
Also, when looking at the MOM logs, I see the the following:
2017-01-09 16:43:39,508 - mom.vdsmInterface - ERROR - Cannot connect
to
VDSM! [Errno 111] Connection refused
Looking at the log at this time vdsm had no open socket.
Correct, but IIRC we have a race on startup - that's the reason why MOM retries to connect. After the new try, MOM seems to behave correctly:
2017-01-09 16:44:05,672 - mom.RPCServer - INFO - ping() 2017-01-09 16:44:05,673 - mom.RPCServer - INFO - getStatistics()
-- Francesco Romani Red Hat Engineering Virtualization R & D IRC: fromani
Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel
participants (6)
-
Daniel Belenky
-
Eyal Edri
-
Francesco Romani
-
Milan Zamazal
-
Piotr Kliczewski
-
Yaniv Kaul