Re: [ovirt-users] VM get stuck randomly

25 Mar 2016

      On Thu, Mar 24, 2016 at 10:43 PM, Christophe TREFOIS
<christophe.trefois@uni.lu> wrote:
...
Hi Nir,
I restarted the VM now so I can't provide more info until the next time.
I could try strace -p <pid> -f &> strace.log next time it hangs.
Could you just point me on how to obtain a dump with gdb?
I think you should install the debug info package for qemu, something like:
debuginfo

    debuginfo-install qemu-kvm-ev

Then you can extract a backtrace of all threads like this:

    gdb --pid <qemu pid> --batch --eval-command='thread apply all bt'

Sometimes "bt full" return more useful info:

    gdb --pid <qemu pid> --batch --eval-command='thread apply all bt full'

To generate a core dump you can do:

    gcore -o filename <qemu pid>

This is generic way that works with anything, there may  be a better
qemu specific way.

Nir
...
Do I have to do anything special in order to catch the required contents?
For the idle vs stuck in a loop, I guess the VM has 4 children qemu threads, and one of them was at 100%.
Thank you for your help,
--
Christophe
...
-----Original Message-----
From: Nir Soffer [mailto:nsoffer@redhat.com]
Sent: jeudi 24 mars 2016 20:17
To: Christophe TREFOIS <christophe.trefois@uni.lu>; Kevin Wolf
<kwolf@redhat.com>; Francesco Romani <fromani@redhat.com>
Cc: users <users@ovirt.org>; lcsb-sysadmins <lcsb-sysadmins@uni.lu>
Subject: Re: [ovirt-users] VM get stuck randomly
On Thu, Mar 24, 2016 at 7:51 PM, Christophe TREFOIS
<christophe.trefois@uni.lu> wrote:
...
Hi Nir,
And the second one is down now too. see some comments below.
...
On 13 Mar 2016, at 12:51, Nir Soffer <nsoffer@redhat.com> wrote:
On Sun, Mar 13, 2016 at 9:46 AM, Christophe TREFOIS
<christophe.trefois@uni.lu> wrote:
...
Dear all,
I have a problem since couple of weeks, where randomly 1 VM (not
always the same) becomes completely unresponsive.
We find this out because our Icinga server complains that host is down.
Upon inspection, we find we can’t open a console to the VM, nor can we
login.
In oVirt engine, the VM looks like “up”. The only weird thing is that RAM
usage shows 0% and CPU usage shows 100% or 75% depending on number of
cores.
The only way to recover is to force shutdown the VM via 2-times
shutdown from the engine.
Could you please help me to start debugging this?
I can provide any logs, but I’m not sure which ones, because I couldn’t
see anything with ERROR in the vdsm logs on the host.
I would inspect this vm on the host when it happens.
What is vdsm cpu usage? what is the qemu process (for this vm) cpu
usage?
vdsm cpu usage is going up and down to 15%.
qemu process usage for the VM was 0, except for 1 of the threads “stuck”
at 100%, rest was idle.
0% may be a deadlock, 100% a thread stuck in endless loop, but this is just a
wild guess.
...
...
strace output of this qemu process (all threads) or a core dump can
help qemu developers to understand this issue.
I attached an strace on the process for:
qemu     15241 10.6  0.4 4742904 1934988 ?     Sl   Mar23 131:41
/usr/libexec/qemu-kvm -name test-ubuntu-uni-lu -S -machine pc-i440fx-
rhel7.2.0,accel=kvm,usb=off -cpu SandyBridge -m
size=4194304k,slots=16,maxmem=4294967296k -realtime mlock=off -smp
4,maxcpus=64,sockets=16,cores=4,threads=1 -numa node,nodeid=0,cpus=0-
3,mem=4096 -uuid 754871ec-0339-4a65-b490-6a766aaea537 -smbios
type=1,manufacturer=oVirt,product=oVirt Node,version=7-
2.1511.el7.centos.2.10,serial=4C4C4544-0048-4610-8052-
B4C04F575831,uuid=754871ec-0339-4a65-b490-6a766aaea537 -no-user-config
-nodefaults -chardev
socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-test-ubuntu-uni-
lu/monitor.sock,server,nowait -mon
chardev=charmonitor,id=monitor,mode=control -rtc base=2016-03-
23T22:06:01,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -
no-shutdown -boot strict=on -device piix3-usb-
uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-
pci,id=scsi0,bus=pci.0,addr=0x4 -device virtio-serial-pci,id=virtio-
serial0,max_ports=16,bus=pci.0,addr=0x5 -drive if=none,id=drive-ide0-1-
0,readonly=on,format=raw,serial= -device ide-
cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/rhev/data-
center/00000002-0002-0002-0002-0000000003d5/8253a89b-651e-4ff4-865b-
57adef05d383/images/9d60ae41-bf17-48b4-b0e6-29625b248718/47a6916c-
c902-4ea3-8dfb-a3240d7d9515,if=none,id=drive-virtio-
disk0,format=qcow2,serial=9d60ae41-bf17-48b4-b0e6-
29625b248718,cache=none,werror=stop,rerror=stop,aio=threads -device
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-
disk0,bootindex=1 -netdev tap,fd=108,id=hostnet0,vhost=on,vhostfd=109 -
device virtio-net-
pci,netdev=hostnet0,id=net0,mac=00:1a:4a:e5:12:0f,bus=pci.0,addr=0x3,boo
tindex=2 -chardev
socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/754871ec-
0339-4a65-b490-6a766aaea537.com.redhat.rhevm.vdsm,server,nowait -
device virtserialport,bus=virtio-
serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.
vdsm -chardev
socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/754871ec-
0339-4a65-b490-6a766aaea537.org.qemu.guest_agent.0,server,nowait -
device virtserialport,bus=virtio-
serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_ag
ent.0 -device usb-tablet,id=input0 -vnc 10.79.2.2:76,password -device cirrus-
vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-
pci,id=balloon0,bus=pci.0,addr=0x7 -msg timestamp=on
...
http://paste.fedoraproject.org/344756/84131214
You connected only to one thread. I would try to use -f to see all threads, or
connect with gdb and get a backtrace of all threads.
Adding Kevin to suggest how to continue.
I think we need a qemu bug for this.
Nir
...
This is CentOS 7.2, latest patches and latest 3.6.4 oVirt.
Thank you for any help / pointers.
Could it be memory ballooning?
Best,
...
...
The host is running
OS Version:             RHEL - 7 - 1.1503.el7.centos.2.8
Kernel Version: 3.10.0 - 229.14.1.el7.x86_64
KVM Version:            2.1.2 - 23.el7_1.8.1
LIBVIRT Version:        libvirt-1.2.8-16.el7_1.4
VDSM Version:   vdsm-4.16.26-0.el7.centos
SPICE Version:  0.12.4 - 9.el7_1.3
GlusterFS Version:      glusterfs-3.7.5-1.el7
You are running old versions, missing lot of fixes. Nothing specific
to your problem but this lower the chance to get a working system.
It would be nice if you can upgrade to ovirt-3.6 and report if it
made any change.
Or at lest latest ovirt-3.5.
...
We use a locally exported gluster as storage domain (eg, storage is on
the same machine exposed via gluster). No replica.
...
...
...
We run around 50 VMs on that host.
Why use gluster for this? Do you plan to add more gluster servers in the
future?
Nir