Re: [ovirt-users] VM get stuck randomly

30 Mar 2016

      Hi Kevin,

Another host went down, so I have to prepare info for this one.

I could not SSH to it anymore.
Console would show login screen, but no keystrokes were registered.

I could “suspend” the VM and “run” it, but still can’t SSH to it.
Before suspension, all QEMU threads were around 0%, after resuming, 3 of them hover at 100%.

Attached you could find the gdb, core dump, and other logs.

Logs: https://dl.dropboxusercontent.com/u/63261/ubuntu2-logs.tar.gz

Core Dump: https://dl.dropboxusercontent.com/u/63261/core-ubuntu2.tar.gz

Is there anything else we could provide?

Since this is a test machine, I will leave it “hanging” for now.

Best,

Dr Christophe Trefois, Dipl.-Ing.  
Technical Specialist / Post-Doc

UNIVERSITÉ DU LUXEMBOURG

LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE
Campus Belval | House of Biomedicine  
6, avenue du Swing 
L-4367 Belvaux  
T: +352 46 66 44 6124 
F: +352 46 66 44 6949  
http://www.uni.lu/lcsb

----
This message is confidential and may contain privileged information. 
It is intended for the named recipient only. 
If you receive it in error please notify me and permanently delete the original message and any copies. 
----
...
On 29 Mar 2016, at 15:40, Kevin Wolf <kwolf@redhat.com> wrote:
Am 27.03.2016 um 22:38 hat Christophe TREFOIS geschrieben:
...
Hi,
MS does not like my previous email, so here it is again with a link to Dropbox
instead of as attached.
——
Hi Nir,
Inside the core dump tarball is also the output of the two gdb commands you
mentioned.
Understandbly, you might not want to download the big files for that, so I
attached them here seperately.
The gdb dump looks pretty much like an idle qemu that just sits there
and waits for events. The vcpu threads seem to be running guest code,
the I/O thread and SPICE thread are in poll() waiting for events to
respond to, and finally the RCU thread is idle as well.
Does the qemu process still respond to monitor commands, so for example
can you still pause and resume the guest?
Kevin
...
For the other logs, here you go.
For gluster I didn’t know which, so I sent all.
I got the icinga notifcation at 17:06 CEST on March 27th (today). So for vdsm,
I provided logs from 16h-18h.
The check said that the VM was down for 11 minutes at that time.
https://dl.dropboxusercontent.com/u/63261/bioservice-1.tar.gz
Please do let me know if there is anything else I can provide.
Best regards,
...
On 27 Mar 2016, at 21:24, Nir Soffer <nsoffer@redhat.com> wrote:
On Sun, Mar 27, 2016 at 8:39 PM, Christophe TREFOIS
<christophe.trefois@uni.lu> wrote:
...
Hi Nir,
Here is another one, this time with strace of children and gdb dump.
Interestingly, this time, the qemu seems stuck 0%, vs 100% for other cases.
The files for strace are attached.
Hopefully Kevin can take a look.
...
The gdb + core dump is found here (too
big):
https://dl.dropboxusercontent.com/u/63261/gdb-core.tar.gz
I think it will be more useful to extract a traceback of all threads
and send the tiny traceback.
gdb --pid <qemu pid> --batch --eval-command='thread apply all bt'
...
If it helps, most machines get stuck on the host hosting the self-hosted
engine, which runs a local 1-node glusterfs.
And getting also /var/log/messages, sanlock, vdsm, glusterfs and
libvirt logs for this timeframe
would be helpful.
Nir
...
Thank you for your help,
—
Christophe
Dr Christophe Trefois, Dipl.-Ing.
Technical Specialist / Post-Doc
UNIVERSITÉ DU LUXEMBOURG
LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE
Campus Belval | House of Biomedicine
6, avenue du Swing
L-4367 Belvaux
T: +352 46 66 44 6124
F: +352 46 66 44 6949
http://www.uni.lu/lcsb
----
This message is confidential and may contain privileged information.
It is intended for the named recipient only.
If you receive it in error please notify me and permanently delete the
original message and any copies.
----
...
On 25 Mar 2016, at 11:53, Nir Soffer <nsoffer@redhat.com> wrote:
gdb --pid <qemu pid> --batch --eval-command='thread apply all bt'