[ovirt-users] VM get stuck randomly

Fri Mar 25 10:53:09 UTC 2016

On Thu, Mar 24, 2016 at 10:43 PM, Christophe TREFOIS
<christophe.trefois at uni.lu> wrote:
> Hi Nir,
>
> I restarted the VM now so I can't provide more info until the next time.
>
> I could try strace -p <pid> -f &> strace.log next time it hangs.
>
> Could you just point me on how to obtain a dump with gdb?

I think you should install the debug info package for qemu, something like:
debuginfo

    debuginfo-install qemu-kvm-ev

Then you can extract a backtrace of all threads like this:

    gdb --pid <qemu pid> --batch --eval-command='thread apply all bt'

Sometimes "bt full" return more useful info:

    gdb --pid <qemu pid> --batch --eval-command='thread apply all bt full'

To generate a core dump you can do:

    gcore -o filename <qemu pid>

This is generic way that works with anything, there may  be a better
qemu specific way.

Nir

> Do I have to do anything special in order to catch the required contents?
>
> For the idle vs stuck in a loop, I guess the VM has 4 children qemu threads, and one of them was at 100%.
>
> Thank you for your help,
>
> --
> Christophe
>
>> -----Original Message-----
>> From: Nir Soffer [mailto:nsoffer at redhat.com]
>> Sent: jeudi 24 mars 2016 20:17
>> To: Christophe TREFOIS <christophe.trefois at uni.lu>; Kevin Wolf
>> <kwolf at redhat.com>; Francesco Romani <fromani at redhat.com>
>> Cc: users <users at ovirt.org>; lcsb-sysadmins <lcsb-sysadmins at uni.lu>
>> Subject: Re: [ovirt-users] VM get stuck randomly
>>
>> On Thu, Mar 24, 2016 at 7:51 PM, Christophe TREFOIS
>> <christophe.trefois at uni.lu> wrote:
>> > Hi Nir,
>> >
>> > And the second one is down now too. see some comments below.
>> >
>> >> On 13 Mar 2016, at 12:51, Nir Soffer <nsoffer at redhat.com> wrote:
>> >>
>> >> On Sun, Mar 13, 2016 at 9:46 AM, Christophe TREFOIS
>> >> <christophe.trefois at uni.lu> wrote:
>> >>> Dear all,
>> >>>
>> >>> I have a problem since couple of weeks, where randomly 1 VM (not
>> always the same) becomes completely unresponsive.
>> >>> We find this out because our Icinga server complains that host is down.
>> >>>
>> >>> Upon inspection, we find we can’t open a console to the VM, nor can we
>> login.
>> >>>
>> >>> In oVirt engine, the VM looks like “up”. The only weird thing is that RAM
>> usage shows 0% and CPU usage shows 100% or 75% depending on number of
>> cores.
>> >>> The only way to recover is to force shutdown the VM via 2-times
>> shutdown from the engine.
>> >>>
>> >>> Could you please help me to start debugging this?
>> >>> I can provide any logs, but I’m not sure which ones, because I couldn’t
>> see anything with ERROR in the vdsm logs on the host.
>> >>
>> >> I would inspect this vm on the host when it happens.
>> >>
>> >> What is vdsm cpu usage? what is the qemu process (for this vm) cpu
>> usage?
>> >
>> > vdsm cpu usage is going up and down to 15%.
>> >
>> > qemu process usage for the VM was 0, except for 1 of the threads “stuck”
>> at 100%, rest was idle.
>>
>> 0% may be a deadlock, 100% a thread stuck in endless loop, but this is just a
>> wild guess.
>>
>> >
>> >>
>> >> strace output of this qemu process (all threads) or a core dump can
>> >> help qemu developers to understand this issue.
>> >
>> > I attached an strace on the process for:
>> >
>> > qemu     15241 10.6  0.4 4742904 1934988 ?     Sl   Mar23 131:41
>> /usr/libexec/qemu-kvm -name test-ubuntu-uni-lu -S -machine pc-i440fx-
>> rhel7.2.0,accel=kvm,usb=off -cpu SandyBridge -m
>> size=4194304k,slots=16,maxmem=4294967296k -realtime mlock=off -smp
>> 4,maxcpus=64,sockets=16,cores=4,threads=1 -numa node,nodeid=0,cpus=0-
>> 3,mem=4096 -uuid 754871ec-0339-4a65-b490-6a766aaea537 -smbios
>> type=1,manufacturer=oVirt,product=oVirt Node,version=7-
>> 2.1511.el7.centos.2.10,serial=4C4C4544-0048-4610-8052-
>> B4C04F575831,uuid=754871ec-0339-4a65-b490-6a766aaea537 -no-user-config
>> -nodefaults -chardev
>> socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-test-ubuntu-uni-
>> lu/monitor.sock,server,nowait -mon
>> chardev=charmonitor,id=monitor,mode=control -rtc base=2016-03-
>> 23T22:06:01,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -
>> no-shutdown -boot strict=on -device piix3-usb-
>> uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-
>> pci,id=scsi0,bus=pci.0,addr=0x4 -device virtio-serial-pci,id=virtio-
>> serial0,max_ports=16,bus=pci.0,addr=0x5 -drive if=none,id=drive-ide0-1-
>> 0,readonly=on,format=raw,serial= -device ide-
>> cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/rhev/data-
>> center/00000002-0002-0002-0002-0000000003d5/8253a89b-651e-4ff4-865b-
>> 57adef05d383/images/9d60ae41-bf17-48b4-b0e6-29625b248718/47a6916c-
>> c902-4ea3-8dfb-a3240d7d9515,if=none,id=drive-virtio-
>> disk0,format=qcow2,serial=9d60ae41-bf17-48b4-b0e6-
>> 29625b248718,cache=none,werror=stop,rerror=stop,aio=threads -device
>> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-
>> disk0,bootindex=1 -netdev tap,fd=108,id=hostnet0,vhost=on,vhostfd=109 -
>> device virtio-net-
>> pci,netdev=hostnet0,id=net0,mac=00:1a:4a:e5:12:0f,bus=pci.0,addr=0x3,boo
>> tindex=2 -chardev
>> socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/754871ec-
>> 0339-4a65-b490-6a766aaea537.com.redhat.rhevm.vdsm,server,nowait -
>> device virtserialport,bus=virtio-
>> serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.
>> vdsm -chardev
>> socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/754871ec-
>> 0339-4a65-b490-6a766aaea537.org.qemu.guest_agent.0,server,nowait -
>> device virtserialport,bus=virtio-
>> serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_ag
>> ent.0 -device usb-tablet,id=input0 -vnc 10.79.2.2:76,password -device cirrus-
>> vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-
>> pci,id=balloon0,bus=pci.0,addr=0x7 -msg timestamp=on
>> >
>> > http://paste.fedoraproject.org/344756/84131214
>>
>> You connected only to one thread. I would try to use -f to see all threads, or
>> connect with gdb and get a backtrace of all threads.
>>
>> Adding Kevin to suggest how to continue.
>>
>> I think we need a qemu bug for this.
>>
>> Nir
>>
>> >
>> > This is CentOS 7.2, latest patches and latest 3.6.4 oVirt.
>> >
>> > Thank you for any help / pointers.
>> >
>> > Could it be memory ballooning?
>> >
>> > Best,
>> >
>> >>
>> >>>
>> >>> The host is running
>> >>>
>> >>> OS Version:             RHEL - 7 - 1.1503.el7.centos.2.8
>> >>> Kernel Version: 3.10.0 - 229.14.1.el7.x86_64
>> >>> KVM Version:            2.1.2 - 23.el7_1.8.1
>> >>> LIBVIRT Version:        libvirt-1.2.8-16.el7_1.4
>> >>> VDSM Version:   vdsm-4.16.26-0.el7.centos
>> >>> SPICE Version:  0.12.4 - 9.el7_1.3
>> >>> GlusterFS Version:      glusterfs-3.7.5-1.el7
>> >>
>> >> You are running old versions, missing lot of fixes. Nothing specific
>> >> to your problem but this lower the chance to get a working system.
>> >>
>> >> It would be nice if you can upgrade to ovirt-3.6 and report if it
>> >> made any change.
>> >> Or at lest latest ovirt-3.5.
>> >>
>> >>> We use a locally exported gluster as storage domain (eg, storage is on
>> the same machine exposed via gluster). No replica.
>> >>> We run around 50 VMs on that host.
>> >>
>> >> Why use gluster for this? Do you plan to add more gluster servers in the
>> future?
>> >>
>> >> Nir
>> >