Intermittent Jenkins crashes

David Caro dcaroest at redhat.com
Wed Apr 23 17:50:58 UTC 2014


On Wed 23 Apr 2014 05:51:23 PM CEST, Francesco Romani wrote:
> Sorry, forgot to add.
>
> By "main, if not only lead" I mean:
> * /var/crash is empty
> * abrt-cli list yields nothing relevant
>
> Bests,
>
> ----- Original Message -----
>> From: "Francesco Romani" <fromani at redhat.com>
>> To: infra at ovirt.org
>> Sent: Wednesday, April 23, 2014 5:48:46 PM
>> Subject: Intermittent Jenkins crashes
>>
>> Hi infra
>>
>> Recently tests started to fail quite randomly due to the python interpreter
>> crashing.
>>
>> E.g. for example (but many others are like this)
>> http://jenkins.ovirt.org/job/vdsm_master_unit_tests_gerrit/8370/console
>>
>> LibvirtModuleConfigureTests
>>     testLibvirtConfigureToSSLFalse
>>     ../tests/run_tests_local.sh: line 10: 31835
>>     Segmentation fault      PYTHONDONTWRITEBYTECODE=1 LC_ALL=C
>>     PYTHONPATH="../lib:../vdsm:../client:../vdsm_api:$PYTHONPATH"
>>     "$PYTHON_EXE" ../tests/testrunner.py --local-modules $@
>>
>> quite often, re-running the same tests using jenkins manual trigger
>> or uploading a new version of the affected patch seem to somehow fix the
>> crash.
>>
>> I have ssh access to the affected box, so I did more investigation
>>
>> the main, if not only, lead those crashes leave behind is a laconic
>>
>> [8855948.327687] python[10418]: segfault at 1 ip 00000036f2c88637 sp
>> 00007fffda3c3a60 error 4 in libpython2.7.so.1.0[36f2c00000+178000]
>>
>> the error code sometimes varies, the addresses do not.
>> So, I followed
>> http://enki-tech.blogspot.it/2012/08/debugging-c-part-3-dmesg.html
>>
>> and found the following:
>>
>> [root at jenkins-slave-vm02 ~]# ./getcrash.sh '[9141800.034517] python[11612]:
>> segfault at 1 ip 00000036f2c88637 sp 00007fffe1127c50 error 4 in
>> libpython2.7.so.1.0[36f2c00000+178000]'
>> Segmentation fault in libpython2.7.so.1.0 at: 0x88637.
>> [root at jenkins-slave-vm02 ~]# gdb /usr/lib64/libpython2.7.so.1.0
>> GNU gdb (GDB) Fedora 7.6.50.20130731-19.fc20
>> [...]
>> Reading symbols from /usr/lib64/libpython2.7.so.1.0...Reading symbols from
>> /usr/lib/debug/usr/lib64/libpython2.7.so.1.0.debug...done.
>> done.
>> (gdb) disass 0x88637
>> No function contains specified address.
>> (gdb)
>>
>> (getcrash.sh is a copy of the script presented in the page linked above)
>>
>> The only sense I can make from all of the above summarized, is a faulty RAM
>> bank, but this is little more than a wild guess.
>>
>> Any suggestion on how to go further?
>>
>> Thanks,
>>
>> --
>> Francesco Romani
>> RedHat Engineering Virtualization R & D
>> Phone: 8261328
>> IRC: fromani
>> _______________________________________________
>> Infra mailing list
>> Infra at ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/infra
>>
>

Let's try to see if it's a problem that only affects one slave, one 
python version, one distribution or fails anywhere. If it only affects 
one slave, we might just reprovision it (the one you pointed out is a 
vm). If it's related to a package version we can try to upgrade it, or 
downgrade it, or fix it (in the best case).

If it's anything else it will be more complicated to fix, and we will 
have to look deeper (try to reproduce manually, add traces, maybe as 
you say it's an issue on the RAM, but being a vm, we might expect it 
failing also on the host).


I started running it only on f19 slaves, to see if it happens, I'll 
check f20 slaves after.

--
David Caro

Red Hat S.L.
Continuous Integration Engineer - EMEA ENG Virtualization R&D

Email: dcaro at redhat.com
Web: www.redhat.com
RHT Global #: 82-62605

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: OpenPGP digital signature
URL: <http://lists.ovirt.org/pipermail/infra/attachments/20140423/58f8c397/attachment.sig>


More information about the Infra mailing list