
Hi infra Recently tests started to fail quite randomly due to the python interpreter crashing. E.g. for example (but many others are like this) http://jenkins.ovirt.org/job/vdsm_master_unit_tests_gerrit/8370/console LibvirtModuleConfigureTests testLibvirtConfigureToSSLFalse ../tests/run_tests_local.sh: line 10: 31835 Segmentation fault PYTHONDONTWRITEBYTECODE=1 LC_ALL=C PYTHONPATH="../lib:../vdsm:../client:../vdsm_api:$PYTHONPATH" "$PYTHON_EXE" ../tests/testrunner.py --local-modules $@ quite often, re-running the same tests using jenkins manual trigger or uploading a new version of the affected patch seem to somehow fix the crash. I have ssh access to the affected box, so I did more investigation the main, if not only, lead those crashes leave behind is a laconic [8855948.327687] python[10418]: segfault at 1 ip 00000036f2c88637 sp 00007fffda3c3a60 error 4 in libpython2.7.so.1.0[36f2c00000+178000] the error code sometimes varies, the addresses do not. So, I followed http://enki-tech.blogspot.it/2012/08/debugging-c-part-3-dmesg.html and found the following: [root@jenkins-slave-vm02 ~]# ./getcrash.sh '[9141800.034517] python[11612]: segfault at 1 ip 00000036f2c88637 sp 00007fffe1127c50 error 4 in libpython2.7.so.1.0[36f2c00000+178000]' Segmentation fault in libpython2.7.so.1.0 at: 0x88637. [root@jenkins-slave-vm02 ~]# gdb /usr/lib64/libpython2.7.so.1.0 GNU gdb (GDB) Fedora 7.6.50.20130731-19.fc20 [...] Reading symbols from /usr/lib64/libpython2.7.so.1.0...Reading symbols from /usr/lib/debug/usr/lib64/libpython2.7.so.1.0.debug...done. done. (gdb) disass 0x88637 No function contains specified address. (gdb) (getcrash.sh is a copy of the script presented in the page linked above) The only sense I can make from all of the above summarized, is a faulty RAM bank, but this is little more than a wild guess. Any suggestion on how to go further? Thanks, -- Francesco Romani RedHat Engineering Virtualization R & D Phone: 8261328 IRC: fromani

Sorry, forgot to add. By "main, if not only lead" I mean: * /var/crash is empty * abrt-cli list yields nothing relevant Bests, ----- Original Message -----
From: "Francesco Romani" <fromani@redhat.com> To: infra@ovirt.org Sent: Wednesday, April 23, 2014 5:48:46 PM Subject: Intermittent Jenkins crashes
Hi infra
Recently tests started to fail quite randomly due to the python interpreter crashing.
E.g. for example (but many others are like this) http://jenkins.ovirt.org/job/vdsm_master_unit_tests_gerrit/8370/console
LibvirtModuleConfigureTests testLibvirtConfigureToSSLFalse ../tests/run_tests_local.sh: line 10: 31835 Segmentation fault PYTHONDONTWRITEBYTECODE=1 LC_ALL=C PYTHONPATH="../lib:../vdsm:../client:../vdsm_api:$PYTHONPATH" "$PYTHON_EXE" ../tests/testrunner.py --local-modules $@
quite often, re-running the same tests using jenkins manual trigger or uploading a new version of the affected patch seem to somehow fix the crash.
I have ssh access to the affected box, so I did more investigation
the main, if not only, lead those crashes leave behind is a laconic
[8855948.327687] python[10418]: segfault at 1 ip 00000036f2c88637 sp 00007fffda3c3a60 error 4 in libpython2.7.so.1.0[36f2c00000+178000]
the error code sometimes varies, the addresses do not. So, I followed http://enki-tech.blogspot.it/2012/08/debugging-c-part-3-dmesg.html
and found the following:
[root@jenkins-slave-vm02 ~]# ./getcrash.sh '[9141800.034517] python[11612]: segfault at 1 ip 00000036f2c88637 sp 00007fffe1127c50 error 4 in libpython2.7.so.1.0[36f2c00000+178000]' Segmentation fault in libpython2.7.so.1.0 at: 0x88637. [root@jenkins-slave-vm02 ~]# gdb /usr/lib64/libpython2.7.so.1.0 GNU gdb (GDB) Fedora 7.6.50.20130731-19.fc20 [...] Reading symbols from /usr/lib64/libpython2.7.so.1.0...Reading symbols from /usr/lib/debug/usr/lib64/libpython2.7.so.1.0.debug...done. done. (gdb) disass 0x88637 No function contains specified address. (gdb)
(getcrash.sh is a copy of the script presented in the page linked above)
The only sense I can make from all of the above summarized, is a faulty RAM bank, but this is little more than a wild guess.
Any suggestion on how to go further?
Thanks,
-- Francesco Romani RedHat Engineering Virtualization R & D Phone: 8261328 IRC: fromani _______________________________________________ Infra mailing list Infra@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra
-- Francesco Romani RedHat Engineering Virtualization R & D Phone: 8261328 IRC: fromani

This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --kStjDNgJGeSwBfMrqQh0NguLeIxRm3vVG Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On Wed 23 Apr 2014 05:51:23 PM CEST, Francesco Romani wrote:
Sorry, forgot to add.
By "main, if not only lead" I mean: * /var/crash is empty * abrt-cli list yields nothing relevant
Bests,
----- Original Message -----
From: "Francesco Romani" <fromani@redhat.com> To: infra@ovirt.org Sent: Wednesday, April 23, 2014 5:48:46 PM Subject: Intermittent Jenkins crashes
Hi infra
Recently tests started to fail quite randomly due to the python interp= reter crashing.
E.g. for example (but many others are like this) http://jenkins.ovirt.org/job/vdsm_master_unit_tests_gerrit/8370/consol= e
LibvirtModuleConfigureTests testLibvirtConfigureToSSLFalse ../tests/run_tests_local.sh: line 10: 31835 Segmentation fault PYTHONDONTWRITEBYTECODE=3D1 LC_ALL=3DC PYTHONPATH=3D"../lib:../vdsm:../client:../vdsm_api:$PYTHONPATH" "$PYTHON_EXE" ../tests/testrunner.py --local-modules $@
quite often, re-running the same tests using jenkins manual trigger or uploading a new version of the affected patch seem to somehow fix t= he crash.
I have ssh access to the affected box, so I did more investigation
the main, if not only, lead those crashes leave behind is a laconic
[8855948.327687] python[10418]: segfault at 1 ip 00000036f2c88637 sp 00007fffda3c3a60 error 4 in libpython2.7.so.1.0[36f2c00000+178000]
the error code sometimes varies, the addresses do not. So, I followed http://enki-tech.blogspot.it/2012/08/debugging-c-part-3-dmesg.html
and found the following:
[root@jenkins-slave-vm02 ~]# ./getcrash.sh '[9141800.034517] python[11= 612]: segfault at 1 ip 00000036f2c88637 sp 00007fffe1127c50 error 4 in libpython2.7.so.1.0[36f2c00000+178000]' Segmentation fault in libpython2.7.so.1.0 at: 0x88637. [root@jenkins-slave-vm02 ~]# gdb /usr/lib64/libpython2.7.so.1.0 GNU gdb (GDB) Fedora 7.6.50.20130731-19.fc20 [...] Reading symbols from /usr/lib64/libpython2.7.so.1.0...Reading symbols = from /usr/lib/debug/usr/lib64/libpython2.7.so.1.0.debug...done. done. (gdb) disass 0x88637 No function contains specified address. (gdb)
(getcrash.sh is a copy of the script presented in the page linked abov= e)
The only sense I can make from all of the above summarized, is a fault= y RAM bank, but this is little more than a wild guess.
Any suggestion on how to go further?
Thanks,
-- Francesco Romani RedHat Engineering Virtualization R & D Phone: 8261328 IRC: fromani _______________________________________________ Infra mailing list Infra@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra
Let's try to see if it's a problem that only affects one slave, one=20 python version, one distribution or fails anywhere. If it only affects=20 one slave, we might just reprovision it (the one you pointed out is a=20 vm). If it's related to a package version we can try to upgrade it, or=20 downgrade it, or fix it (in the best case). If it's anything else it will be more complicated to fix, and we will=20 have to look deeper (try to reproduce manually, add traces, maybe as=20 you say it's an issue on the RAM, but being a vm, we might expect it=20 failing also on the host). I started running it only on f19 slaves, to see if it happens, I'll=20 check f20 slaves after. -- David Caro Red Hat S.L. Continuous Integration Engineer - EMEA ENG Virtualization R&D Email: dcaro@redhat.com Web: www.redhat.com RHT Global #: 82-62605 --kStjDNgJGeSwBfMrqQh0NguLeIxRm3vVG Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBAgAGBQJTV/2CAAoJEEBxx+HSYmnDj2cH/RKtjfn0vxecGeSDKXCDHfYb kVUwrK3nWJV8wK1YE++Geaj/a+cQCi8KXmMucnACZyTfbKQdQ29wJfASpnLexmyJ 59FX2RBnP5/7bDYwqjJo20pl81JxcEPORnFq74b0KC9LxXLFedPQeTL3XroWH4xK 17P64wzKztYyAt8OtUD9kNsOX41n1s1bEZpPU9l3faxYXFkGJhuWHYWW0SFkGleG GDGaYGxfQfcsmneZqjkrhABq+/MWXjI7pb/G6ytZgEeZv4B4heL/HQZXhNRw1/ot 3w2kYBbh3edXrf3roiTbX2dz3qK+ZlHKLXCceSZ5/C5bv5x/1KJnESEgLxP0tHo= =/k6v -----END PGP SIGNATURE----- --kStjDNgJGeSwBfMrqQh0NguLeIxRm3vVG--

----- Original Message -----
From: "David Caro" <dcaroest@redhat.com> To: "Francesco Romani" <fromani@redhat.com> Cc: infra@ovirt.org Sent: Wednesday, April 23, 2014 7:50:58 PM Subject: Re: Intermittent Jenkins crashes
Let's try to see if it's a problem that only affects one slave, one python version, one distribution or fails anywhere. If it only affects one slave, we might just reprovision it (the one you pointed out is a vm). If it's related to a package version we can try to upgrade it, or downgrade it, or fix it (in the best case).
If it's anything else it will be more complicated to fix, and we will have to look deeper (try to reproduce manually, add traces, maybe as you say it's an issue on the RAM, but being a vm, we might expect it failing also on the host).
I started running it only on f19 slaves, to see if it happens, I'll check f20 slaves after.
Looks like I was wrong, it happens on other VMs as well, as http://jenkins.ovirt.org/job/vdsm_master_unit_tests_gerrit/8609/console shows. So, I think we need to get at least one of those coredumps. I begun to enable temporarily hoping to catch one of those, but looks like we need to cast a wider net (everything reverted as I wrote). Please let's talk this again next week (starting 2014/02/05), I'm available basically anytime (UTC+1). Bests and thanks, -- Francesco Romani RedHat Engineering Virtualization R & D Phone: 8261328 IRC: fromani

On 05/02/2014 11:39 AM, Francesco Romani wrote:
----- Original Message -----
From: "David Caro" <dcaroest@redhat.com> To: "Francesco Romani" <fromani@redhat.com> Cc: infra@ovirt.org Sent: Wednesday, April 23, 2014 7:50:58 PM Subject: Re: Intermittent Jenkins crashes
Let's try to see if it's a problem that only affects one slave, one python version, one distribution or fails anywhere. If it only affects one slave, we might just reprovision it (the one you pointed out is a vm). If it's related to a package version we can try to upgrade it, or downgrade it, or fix it (in the best case).
If it's anything else it will be more complicated to fix, and we will have to look deeper (try to reproduce manually, add traces, maybe as you say it's an issue on the RAM, but being a vm, we might expect it failing also on the host).
I started running it only on f19 slaves, to see if it happens, I'll check f20 slaves after.
Looks like I was wrong, it happens on other VMs as well, as http://jenkins.ovirt.org/job/vdsm_master_unit_tests_gerrit/8609/console shows.
So, I think we need to get at least one of those coredumps. I begun to enable temporarily hoping to catch one of those, but looks like we need to cast a wider net (everything reverted as I wrote).
Please let's talk this again next week (starting 2014/02/05), I'm available basically anytime (UTC+1).
Bests and thanks,
We must have coredump here. it introduced in patch http://gerrit.ovirt.org/25263 which might load libvirt lib somehow. I couldn't put my finger on the exact reason. one coredump of this crash can give us the reason for it. imo it doesn't worth too much investigation due to our refactoring in that area which hopefully will be merged soon and replace this code, but if its not hard to produce this coredump I'll be glad to fix it. -- Yaniv Bronhaim.

On Sun, May 04, 2014 at 09:40:23AM +0300, ybronhei wrote:
On 05/02/2014 11:39 AM, Francesco Romani wrote:
----- Original Message -----
From: "David Caro" <dcaroest@redhat.com> To: "Francesco Romani" <fromani@redhat.com> Cc: infra@ovirt.org Sent: Wednesday, April 23, 2014 7:50:58 PM Subject: Re: Intermittent Jenkins crashes
Let's try to see if it's a problem that only affects one slave, one python version, one distribution or fails anywhere. If it only affects one slave, we might just reprovision it (the one you pointed out is a vm). If it's related to a package version we can try to upgrade it, or downgrade it, or fix it (in the best case).
If it's anything else it will be more complicated to fix, and we will have to look deeper (try to reproduce manually, add traces, maybe as you say it's an issue on the RAM, but being a vm, we might expect it failing also on the host).
I started running it only on f19 slaves, to see if it happens, I'll check f20 slaves after.
Looks like I was wrong, it happens on other VMs as well, as http://jenkins.ovirt.org/job/vdsm_master_unit_tests_gerrit/8609/console shows.
So, I think we need to get at least one of those coredumps. I begun to enable temporarily hoping to catch one of those, but looks like we need to cast a wider net (everything reverted as I wrote).
Please let's talk this again next week (starting 2014/02/05), I'm available basically anytime (UTC+1).
Bests and thanks,
We must have coredump here. it introduced in patch http://gerrit.ovirt.org/25263 which might load libvirt lib somehow. I couldn't put my finger on the exact reason. one coredump of this crash can give us the reason for it.
imo it doesn't worth too much investigation due to our refactoring in that area which hopefully will be merged soon and replace this code, but if its not hard to produce this coredump I'll be glad to fix it.
A reproducible segfault in Python worths a lot of investigation. We do not know if it is limited to Jenkins slaves; it may indicated a serious bug that may bite us badly in the future. Does Mooli's refactoring make the segfault go away? (I did not check the tests myself)

On 05/06/2014 12:48 PM, Dan Kenigsberg wrote:
On Sun, May 04, 2014 at 09:40:23AM +0300, ybronhei wrote:
On 05/02/2014 11:39 AM, Francesco Romani wrote:
----- Original Message -----
From: "David Caro" <dcaroest@redhat.com> To: "Francesco Romani" <fromani@redhat.com> Cc: infra@ovirt.org Sent: Wednesday, April 23, 2014 7:50:58 PM Subject: Re: Intermittent Jenkins crashes
Let's try to see if it's a problem that only affects one slave, one python version, one distribution or fails anywhere. If it only affects one slave, we might just reprovision it (the one you pointed out is a vm). If it's related to a package version we can try to upgrade it, or downgrade it, or fix it (in the best case).
If it's anything else it will be more complicated to fix, and we will have to look deeper (try to reproduce manually, add traces, maybe as you say it's an issue on the RAM, but being a vm, we might expect it failing also on the host).
I started running it only on f19 slaves, to see if it happens, I'll check f20 slaves after.
Looks like I was wrong, it happens on other VMs as well, as http://jenkins.ovirt.org/job/vdsm_master_unit_tests_gerrit/8609/console shows.
So, I think we need to get at least one of those coredumps. I begun to enable temporarily hoping to catch one of those, but looks like we need to cast a wider net (everything reverted as I wrote).
Please let's talk this again next week (starting 2014/02/05), I'm available basically anytime (UTC+1).
Bests and thanks,
We must have coredump here. it introduced in patch http://gerrit.ovirt.org/25263 which might load libvirt lib somehow. I couldn't put my finger on the exact reason. one coredump of this crash can give us the reason for it.
imo it doesn't worth too much investigation due to our refactoring in that area which hopefully will be merged soon and replace this code, but if its not hard to produce this coredump I'll be glad to fix it.
A reproducible segfault in Python worths a lot of investigation. We do not know if it is limited to Jenkins slaves; it may indicated a serious bug that may bite us badly in the future.
Does Mooli's refactoring make the segfault go away? (I did not check the tests myself)
still in progress and need to check more flows with the new implementation (reviews will be very appreciated if you can - http://gerrit.ovirt.org/#/q/status:open+project:vdsm+branch:master+topic:con...) I agree that investigate this segfault can be fruitful. coredump is the only way that might assist us here to know more. Francesco, any news with getting it on the jenkins vm? -- Yaniv Bronhaim.

----- Original Message -----
From: "ybronhei" <ybronhei@redhat.com> To: "Dan Kenigsberg" <danken@redhat.com> Cc: "Francesco Romani" <fromani@redhat.com>, "David Caro" <dcaroest@redhat.com>, infra@ovirt.org Sent: Wednesday, May 7, 2014 8:22:03 AM Subject: Re: Intermittent Jenkins crashes [...]
Does Mooli's refactoring make the segfault go away? (I did not check the tests myself)
still in progress and need to check more flows with the new implementation (reviews will be very appreciated if you can - http://gerrit.ovirt.org/#/q/status:open+project:vdsm+branch:master+topic:con...)
I agree that investigate this segfault can be fruitful. coredump is the only way that might assist us here to know more. Francesco, any news with getting it on the jenkins vm?
Not yet unfortunately. I'll do another hunting session ASAP; I'll sync David and ask advice to avoid to wreak havoc to jenkins while chasing this crash. -- Francesco Romani RedHat Engineering Virtualization R & D Phone: 8261328 IRC: fromani

----- Original Message -----
From: "Francesco Romani" <fromani@redhat.com> To: "ybronhei" <ybronhei@redhat.com> Cc: infra@ovirt.org Sent: Wednesday, May 7, 2014 9:00:26 AM Subject: Re: Intermittent Jenkins crashes
----- Original Message -----
From: "ybronhei" <ybronhei@redhat.com> To: "Dan Kenigsberg" <danken@redhat.com> Cc: "Francesco Romani" <fromani@redhat.com>, "David Caro" <dcaroest@redhat.com>, infra@ovirt.org Sent: Wednesday, May 7, 2014 8:22:03 AM Subject: Re: Intermittent Jenkins crashes [...]
Does Mooli's refactoring make the segfault go away? (I did not check the tests myself)
still in progress and need to check more flows with the new implementation (reviews will be very appreciated if you can - http://gerrit.ovirt.org/#/q/status:open+project:vdsm+branch:master+topic:con...)
I agree that investigate this segfault can be fruitful. coredump is the only way that might assist us here to know more. Francesco, any news with getting it on the jenkins vm?
Not yet unfortunately. I'll do another hunting session ASAP; I'll sync David and ask advice to avoid to wreak havoc to jenkins while chasing this crash.
With the help of David, I managed to eventually have a coredump. First and foremost, if anyone has ssh access, on this box there is everything: ssh jenkins-slave-vm03.ovirt.org the box is a FC19 up-to-date as 20140508 cores are on /var/log/core the most up-to-date and recent coredump to use is core.6729.* I copied the xz-compressed coredumps, the output of `t a a bt full` and the package listing on the box here https://drive.google.com/#folders/0B9ZpeH8QzH5rY1NEZUpwZUw2bzQ I'd like to use space on fedorapeople.org but in order to activate my space I got stuck here http://fedoraproject.org/wiki/Infrastructure/fedorapeople.org#Accessing_Your... -> You must be sponsored in a group (other than the CLA groups) and I used gdrive to speed up to process and to avoid wasting more time I'm investigating the dump and the backtrace as I write. Let me know if anyone has trouble downloading anything or if there is something more which needs to be uploaded. -- Francesco Romani RedHat Engineering Virtualization R & D Phone: 8261328 IRC: fromani

----- Original Message -----
From: "Francesco Romani" <fromani@redhat.com> To: "ybronhei" <ybronhei@redhat.com>, "Dan Kenigsberg" <danken@redhat.com> Cc: infra@ovirt.org Sent: Thursday, May 8, 2014 12:55:34 PM Subject: Re: Intermittent Jenkins crashes
----- Original Message -----
From: "Francesco Romani" <fromani@redhat.com> To: "ybronhei" <ybronhei@redhat.com> Cc: infra@ovirt.org Sent: Wednesday, May 7, 2014 9:00:26 AM Subject: Re: Intermittent Jenkins crashes
----- Original Message -----
From: "ybronhei" <ybronhei@redhat.com> To: "Dan Kenigsberg" <danken@redhat.com> Cc: "Francesco Romani" <fromani@redhat.com>, "David Caro" <dcaroest@redhat.com>, infra@ovirt.org Sent: Wednesday, May 7, 2014 8:22:03 AM Subject: Re: Intermittent Jenkins crashes [...]
Does Mooli's refactoring make the segfault go away? (I did not check the tests myself)
still in progress and need to check more flows with the new implementation (reviews will be very appreciated if you can - http://gerrit.ovirt.org/#/q/status:open+project:vdsm+branch:master+topic:con...)
I agree that investigate this segfault can be fruitful. coredump is the only way that might assist us here to know more. Francesco, any news with getting it on the jenkins vm?
Not yet unfortunately. I'll do another hunting session ASAP; I'll sync David and ask advice to avoid to wreak havoc to jenkins while chasing this crash.
With the help of David, I managed to eventually have a coredump.
First and foremost, if anyone has ssh access, on this box there is everything:
ssh jenkins-slave-vm03.ovirt.org
the box is a FC19 up-to-date as 20140508
cores are on /var/log/core
the most up-to-date and recent coredump to use is core.6729.*
I copied the xz-compressed coredumps, the output of `t a a bt full` and the package listing on the box here
https://drive.google.com/#folders/0B9ZpeH8QzH5rY1NEZUpwZUw2bzQ
wrong link, please use https://drive.google.com/folderview?id=0B9ZpeH8QzH5rY1NEZUpwZUw2bzQ&usp=drive_web
I'd like to use space on fedorapeople.org but in order to activate my space I got stuck here http://fedoraproject.org/wiki/Infrastructure/fedorapeople.org#Accessing_Your...
-> You must be sponsored in a group (other than the CLA groups)
and I used gdrive to speed up to process and to avoid wasting more time
I'm investigating the dump and the backtrace as I write.
Let me know if anyone has trouble downloading anything or if there is something more which needs to be uploaded.
-- Francesco Romani RedHat Engineering Virtualization R & D Phone: 8261328 IRC: fromani
participants (4)
-
Dan Kenigsberg
-
David Caro
-
Francesco Romani
-
ybronhei