Check-merged is broken since we added network test to it

Hi guys and Leon, https://gerrit.ovirt.org/#/c/68078/ broke the check-merged job with really tough exception: *15:23:34* sh: [17766: 1 (255)] tcsetattr: Inappropriate ioctl for device*15:23:34* Took 2586 seconds*15:23:34* Slave went offline during the build <http://jenkins.ovirt.org/computer/vm0136.workers-phx.ovirt.org/log>*15:23:34* ERROR: Connection was broken: java.io.IOException: Unexpected termination of the channel*15:23:34* at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50)*15:23:34* Caused by: java.io.EOFException*15:23:34* at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2353)*15:23:34* at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2822)*15:23:34* at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:804)*15:23:34* at java.io.ObjectInputStream.<init>(ObjectInputStream.java:301)*15:23:34* at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48)*15:23:34* at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)*15:23:34* at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)*15:23:34* *15:23:34* Build step 'Execute shell' marked build as failure*15:23:34* Performing Post build task... I have no clue what causes it.. we need to investigate in the tests code. you can see it in http://jenkins.ovirt.org/job/vdsm_master_check-merged-el7-x86_64/772/console... and this for a job run just before it got in, which worked well - http://jenkins.ovirt.org/job/vdsm_master_check-merged-el7-x86_64/688/console I suggest to revert this patch (and backport the revert to ovirt-4.1 branch as well) until figuring what causes it -- *Yaniv Bronhaim.*

Doesn't seem related; the patch does nothing but move pieces around. Judging by the title I guess you're referring to https://gerrit.ovirt.org/#/c/67787/ ? On Thu, Dec 22, 2016 at 5:14 PM, Yaniv Bronheim <ybronhei@redhat.com> wrote:
Hi guys and Leon,
https://gerrit.ovirt.org/#/c/68078/ broke the check-merged job with really tough exception:
*15:23:34* sh: [17766: 1 (255)] tcsetattr: Inappropriate ioctl for device*15:23:34* Took 2586 seconds*15:23:34* Slave went offline during the build <http://jenkins.ovirt.org/computer/vm0136.workers-phx.ovirt.org/log>*15:23:34* ERROR: Connection was broken: java.io.IOException: Unexpected termination of the channel*15:23:34* at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50)*15:23:34* Caused by: java.io.EOFException*15:23:34* at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2353)*15:23:34* at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2822)*15:23:34* at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:804)*15:23:34* at java.io.ObjectInputStream.<init>(ObjectInputStream.java:301)*15:23:34* at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48)*15:23:34* at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)*15:23:34* at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)*15:23:34* *15:23:34* Build step 'Execute shell' marked build as failure*15:23:34* Performing Post build task...
I have no clue what causes it.. we need to investigate in the tests code.
you can see it in http://jenkins.ovirt.org/job/vdsm_master_check-merged-el7-x86_64/772/console...
and this for a job run just before it got in, which worked well - http://jenkins.ovirt.org/job/vdsm_master_check-merged-el7-x86_64/688/console
I suggest to revert this patch (and backport the revert to ovirt-4.1 branch as well) until figuring what causes it
-- *Yaniv Bronhaim.*

On Thu, Dec 22, 2016 at 5:39 PM, Leon Goldberg <lgoldber@redhat.com> wrote:
Doesn't seem related; the patch does nothing but move pieces around.
Judging by the title I guess you're referring to https://gerrit.ovirt.org/#/c/67787/ ?
no.. you can see that after this patch it used to work (check the jobs after the merge) so something in https://gerrit.ovirt.org/#/c/68078/ broke it
On Thu, Dec 22, 2016 at 5:14 PM, Yaniv Bronheim <ybronhei@redhat.com> wrote:
Hi guys and Leon,
https://gerrit.ovirt.org/#/c/68078/ broke the check-merged job with really tough exception:
*15:23:34* sh: [17766: 1 (255)] tcsetattr: Inappropriate ioctl for device*15:23:34* Took 2586 seconds*15:23:34* Slave went offline during the build <http://jenkins.ovirt.org/computer/vm0136.workers-phx.ovirt.org/log>*15:23:34* ERROR: Connection was broken: java.io.IOException: Unexpected termination of the channel*15:23:34* at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50)*15:23:34* Caused by: java.io.EOFException*15:23:34* at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2353)*15:23:34* at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2822)*15:23:34* at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:804)*15:23:34* at java.io.ObjectInputStream.<init>(ObjectInputStream.java:301)*15:23:34* at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48)*15:23:34* at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)*15:23:34* at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)*15:23:34* *15:23:34* Build step 'Execute shell' marked build as failure*15:23:34* Performing Post build task...
I have no clue what causes it.. we need to investigate in the tests code.
you can see it in http://jenkins.ovirt.org/job/vdsm_master_check-merged-el7-x86_64/772/console...
and this for a job run just before it got in, which worked well - http://jenkins.ovirt.org/job/vdsm_master_check-merged-el7-x86_64/688/console
I suggest to revert this patch (and backport the revert to ovirt-4.1 branch as well) until figuring what causes it
-- *Yaniv Bronhaim.*
-- *Yaniv Bronhaim.*

So it's not about the added network tests (these were added in the one I've mentioned, the one who had successful runs after being merged), and I don't see how the patch you've mentioned breaks anything (as it merely moves pieces around). I could of course be wrong entirely and there's something I'm missing about the patch you're claiming is the guilty one, but it won't be about the network tests either way. If it's then not about the network tests, then my immediate suspicion is that the problem lies somewhere else and isn't in either of the patches. I am far from able to prove this either way, though, so if reverting the patch you've mentioned somehow fixes check-merged, then it is of course fine by me. On Thu, Dec 22, 2016 at 5:45 PM, Yaniv Bronheim <ybronhei@redhat.com> wrote:
On Thu, Dec 22, 2016 at 5:39 PM, Leon Goldberg <lgoldber@redhat.com> wrote:
Doesn't seem related; the patch does nothing but move pieces around.
Judging by the title I guess you're referring to https://gerrit.ovirt.org/#/c/67787/ ?
no.. you can see that after this patch it used to work (check the jobs after the merge) so something in https://gerrit.ovirt.org/#/c/68078/ broke it
On Thu, Dec 22, 2016 at 5:14 PM, Yaniv Bronheim <ybronhei@redhat.com> wrote:
Hi guys and Leon,
https://gerrit.ovirt.org/#/c/68078/ broke the check-merged job with really tough exception:
*15:23:34* sh: [17766: 1 (255)] tcsetattr: Inappropriate ioctl for device*15:23:34* Took 2586 seconds*15:23:34* Slave went offline during the build <http://jenkins.ovirt.org/computer/vm0136.workers-phx.ovirt.org/log>*15:23:34* ERROR: Connection was broken: java.io.IOException: Unexpected termination of the channel*15:23:34* at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50)*15:23:34* Caused by: java.io.EOFException*15:23:34* at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2353)*15:23:34* at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2822)*15:23:34* at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:804)*15:23:34* at java.io.ObjectInputStream.<init>(ObjectInputStream.java:301)*15:23:34* at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48)*15:23:34* at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)*15:23:34* at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)*15:23:34* *15:23:34* Build step 'Execute shell' marked build as failure*15:23:34* Performing Post build task...
I have no clue what causes it.. we need to investigate in the tests code.
you can see it in http://jenkins.ovirt.org/job/vdsm_master_check-merged-el7-x86_64/772/console...
and this for a job run just before it got in, which worked well - http://jenkins.ovirt.org/job/vdsm_master_check-merged-el7-x86_64/688/console
I suggest to revert this patch (and backport the revert to ovirt-4.1 branch as well) until figuring what causes it
-- *Yaniv Bronhaim.*
-- *Yaniv Bronhaim.*

On Thu, Dec 22, 2016 at 5:59 PM, Leon Goldberg <lgoldber@redhat.com> wrote:
So it's not about the added network tests (these were added in the one I've mentioned, the one who had successful runs after being merged), and I don't see how the patch you've mentioned breaks anything (as it merely moves pieces around).
I could of course be wrong entirely and there's something I'm missing about the patch you're claiming is the guilty one, but it won't be about the network tests either way.
If it's then not about the network tests, then my immediate suspicion is that the problem lies somewhere else and isn't in either of the patches. I am far from able to prove this either way, though, so if reverting the patch you've mentioned somehow fixes check-merged, then it is of course fine by me.
On Thu, Dec 22, 2016 at 5:45 PM, Yaniv Bronheim <ybronhei@redhat.com> wrote:
On Thu, Dec 22, 2016 at 5:39 PM, Leon Goldberg <lgoldber@redhat.com> wrote:
Doesn't seem related; the patch does nothing but move pieces around.
Judging by the title I guess you're referring to https://gerrit.ovirt.org/#/c/67787/ ?
no.. you can see that after this patch it used to work (check the jobs after the merge) so something in https://gerrit.ovirt.org/#/c/68078/ broke it
On Thu, Dec 22, 2016 at 5:14 PM, Yaniv Bronheim <ybronhei@redhat.com> wrote:
Hi guys and Leon,
https://gerrit.ovirt.org/#/c/68078/ broke the check-merged job with really tough exception:
15:23:34 sh: [17766: 1 (255)] tcsetattr: Inappropriate ioctl for device 15:23:34 Took 2586 seconds 15:23:34 Slave went offline during the build 15:23:34 ERROR: Connection was broken: java.io.IOException: Unexpected
Maybe the problem is few line above this. Could Jenkins simply timeout? Ran 44 tests in 1752.270s OK + return 0 sh: [9520: 1 (255)] tcsetattr: Inappropriate ioctl for device Anyway, a revert is available https://gerrit.ovirt.org/#/c/69078/ and I'm trying to see if it helps.

https://gerrit.ovirt.org/#/c/68078/ broke the check-merged job with really tough exception:
15:23:34 sh: [17766: 1 (255)] tcsetattr: Inappropriate ioctl for device 15:23:34 Took 2586 seconds 15:23:34 Slave went offline during the build
This isa what I was talking about, something is disconnecting the slave. (But this is happening in a very temporary manner, a few seconds later the slave reconnects. See logs at [1]).
15:23:34 ERROR: Connection was broken: java.io.IOException: Unexpected
Maybe the problem is few line above this. Could Jenkins simply timeout?
Ran 44 tests in 1752.270s
OK + return 0 sh: [9520: 1 (255)] tcsetattr: Inappropriate ioctl for device
This is not how a Jenkins timeout looks like. The whole job took 53m and 20s. The timeout is set to 360m [2]... [1]: https://ovirt-jira.atlassian.net/browse/OVIRT-938 [2]: https://gerrit.ovirt.org/gitweb?p=jenkins.git;a=blob;f=jobs/confs/projects/d... -- Barak Korren bkorren@redhat.com RHCE, RHCi, RHV-DevOps Team https://ifireball.wordpress.com/

On Thu, Dec 22, 2016 at 5:45 PM, Yaniv Bronheim <ybronhei@redhat.com> wrote:
On Thu, Dec 22, 2016 at 5:39 PM, Leon Goldberg <lgoldber@redhat.com> wrote:
Doesn't seem related; the patch does nothing but move pieces around.
Judging by the title I guess you're referring to https://gerrit.ovirt.org/#/c/67787/ ?
no.. you can see that after this patch it used to work (check the jobs after the merge) so something in https://gerrit.ovirt.org/#/c/68078/ broke it
"post hoc ergo propter hoc" is a fallacy. We had a horrible week CI-wise; check-merged keept failing due to "Groovy thread" exception which is an internal Jenkins thingy. Now that's over, but a recent test failed on http://jenkins.ovirt.org/job/vdsm_master_check-merged-el7-x86_64/778/console 11:55:54 RuntimeError: Failed to run reposync 3 times for repoid: ovirt-master-snapshot-static-el7, aborting. It is wrong to pick up on the first patch that happened to see the "Groovy thread" exception. Regards, Dan.

Yaniv, Maybe you can look at http://jenkins.ovirt.org/job/vdsm_master_check-merged-el7-x86_64/791/console... Unlike other, this was actually run by CI, and failed to start vdsm on the lago host. "A dependency job for vdsmd.service failed. See 'journalctl -xe' for details." The cause might be the beautifully-numbered Bug 1400003 - imageio fails during system startup but that's just a guess. Yaniv, can you extract the journal from the VM to take a look?

בתאריך 23 בדצמ׳ 2016 12:46 PM, "Dan Kenigsberg" <danken@redhat.com> כתב: On Thu, Dec 22, 2016 at 5:45 PM, Yaniv Bronheim <ybronhei@redhat.com> wrote:
On Thu, Dec 22, 2016 at 5:39 PM, Leon Goldberg <lgoldber@redhat.com> wrote:
Doesn't seem related; the patch does nothing but move pieces around.
Judging by the title I guess you're referring to https://gerrit.ovirt.org/#/c/67787/ ?
no.. you can see that after this patch it used to work (check the jobs after the merge) so something in https://gerrit.ovirt.org/#/c/68078/ broke it
"post hoc ergo propter hoc" is a fallacy. We had a horrible week CI-wise; check-merged keept failing due to "Groovy thread" exception which is an internal Jenkins thingy. Please give the CI team the credit of being able to eliminate the Jenkins and environment related issues. We have runs that pre-date this week's issues and still show that behaviour. for example: http://jenkins.ovirt.org/job/vdsm_master_check-merged-el7-x86_64/692/console The fact that the job crashes during the groovy code is not an indicator that the issue is in that code. That exact same code runs for every single job in the system without failing, so it means that something in this job probably creates conditions that prevent it from running. Actually we know more then that. Looking carefully at the job logs you can see the groovy fails to run because the slave disconnects just before Jenkins tries to run it (By now you should be familiar with OVIRT-938). Again, all the CI code that runs before, after and during this job is also used in other jobs that do not fail in the same manner - With this I want to Yaniv to ask him to help figuring out if the check_merged.sh code may be doing anything that may cause this. Our conversation led to this email. Trying to wave this this as a "Groovy failure" at this point is not helping. The CI team is not going to come up with a magic solution to this one without you guys` help. Now that's over, but a recent test failed on http://jenkins.ovirt.org/job/vdsm_master_check-merged-el7-x86_64/778/console 11:55:54 RuntimeError: Failed to run reposync 3 times for repoid: ovirt-master-snapshot-static-el7, aborting. It is wrong to pick up on the first patch that happened to see the "Groovy thread" exception.

http://jenkins.ovirt.org/job/vdsm_master_check-merged-el7-x86_64/778/console
11:55:54 RuntimeError: Failed to run reposync 3 times for repoid: ovirt-master-snapshot-static-el7, aborting.
It is wrong to pick up on the first patch that happened to see the "Groovy thread" exception.
It is not wrong when we know all the reasons for the other failures and can eliminate them like here: http://jenkins.ovirt.org/job/vdsm_master_check-merged-el7-x86_64/777/ And while we do see one of the VDSM tests fail there I hardly think it is the reason behind the later slave disconnection. The reposync error you are quoting is the result of a package file being updated without updating the version or revision number. This effectively poisoned all our local YUM caches and prevented all jobs running Lago from doing anything interesting during Thursday until I managed to clean all the failed caches. Then again jobs that fail there do not cause the slave disconnection so I made no point of citing them here. We (the CI team) do our best to eliminate false negatives in the system, please do not take the easy path of pointing out to those false negatives every time we reach out to you. Give us the benefit of the doubt of believing we were diligent enough to eliminate such issues before approaching you. -- Barak Korren bkorren@redhat.com RHCE, RHCi, RHV-DevOps Team https://ifireball.wordpress.com/
participants (4)
-
Barak Korren
-
Dan Kenigsberg
-
Leon Goldberg
-
Yaniv Bronheim