Hi infra
Recently tests started to fail quite randomly due to the python interpreter
crashing.
E.g. for example (but many others are like this)
http://jenkins.ovirt.org/job/vdsm_master_unit_tests_gerrit/8370/console
LibvirtModuleConfigureTests
testLibvirtConfigureToSSLFalse
../tests/run_tests_local.sh: line 10: 31835 Segmentation fault
PYTHONDONTWRITEBYTECODE=1 LC_ALL=C
PYTHONPATH="../lib:../vdsm:../client:../vdsm_api:$PYTHONPATH"
"$PYTHON_EXE" ../tests/testrunner.py --local-modules $@
quite often, re-running the same tests using jenkins manual trigger
or uploading a new version of the affected patch seem to somehow fix the crash.
I have ssh access to the affected box, so I did more investigation
the main, if not only, lead those crashes leave behind is a laconic
[8855948.327687] python[10418]: segfault at 1 ip 00000036f2c88637 sp 00007fffda3c3a60
error 4 in libpython2.7.so.1.0[36f2c00000+178000]
the error code sometimes varies, the addresses do not.
So, I followed
http://enki-tech.blogspot.it/2012/08/debugging-c-part-3-dmesg.html
and found the following:
[root@jenkins-slave-vm02 ~]# ./getcrash.sh '[9141800.034517] python[11612]: segfault
at 1 ip 00000036f2c88637 sp 00007fffe1127c50 error 4 in
libpython2.7.so.1.0[36f2c00000+178000]'
Segmentation fault in libpython2.7.so.1.0 at: 0x88637.
[root@jenkins-slave-vm02 ~]# gdb /usr/lib64/libpython2.7.so.1.0
GNU gdb (GDB) Fedora 7.6.50.20130731-19.fc20
[...]
Reading symbols from /usr/lib64/libpython2.7.so.1.0...Reading symbols from
/usr/lib/debug/usr/lib64/libpython2.7.so.1.0.debug...done.
done.
(gdb) disass 0x88637
No function contains specified address.
(gdb)
(getcrash.sh is a copy of the script presented in the page linked above)
The only sense I can make from all of the above summarized, is a faulty RAM bank, but this
is little more than a wild guess.
Any suggestion on how to go further?
Thanks,
--
Francesco Romani
RedHat Engineering Virtualization R & D
Phone: 8261328
IRC: fromani