Hi Nir,
And the second one is down now too. see some comments below.
On 13 Mar 2016, at 12:51, Nir Soffer <nsoffer(a)redhat.com>
wrote:
On Sun, Mar 13, 2016 at 9:46 AM, Christophe TREFOIS
<christophe.trefois(a)uni.lu> wrote:
> Dear all,
>
> I have a problem since couple of weeks, where randomly 1 VM (not always the same)
becomes completely unresponsive.
> We find this out because our Icinga server complains that host is down.
>
> Upon inspection, we find we can’t open a console to the VM, nor can we login.
>
> In oVirt engine, the VM looks like “up”. The only weird thing is that RAM usage shows
0% and CPU usage shows 100% or 75% depending on number of cores.
> The only way to recover is to force shutdown the VM via 2-times shutdown from the
engine.
>
> Could you please help me to start debugging this?
> I can provide any logs, but I’m not sure which ones, because I couldn’t see anything
with ERROR in the vdsm logs on the host.
I would inspect this vm on the host when it happens.
What is vdsm cpu usage? what is the qemu process (for this vm) cpu usage?
vdsm cpu usage is going up and down to 15%.
qemu process usage for the VM was 0, except for 1 of the threads “stuck” at 100%, rest was
idle.
strace output of this qemu process (all threads) or a core dump can help qemu
developers to understand this issue.
I attached an strace on the process for:
qemu 15241 10.6 0.4 4742904 1934988 ? Sl Mar23 131:41 /usr/libexec/qemu-kvm
-name test-ubuntu-uni-lu -S -machine pc-i440fx-rhel7.2.0,accel=kvm,usb=off -cpu
SandyBridge -m size=4194304k,slots=16,maxmem=4294967296k -realtime mlock=off -smp
4,maxcpus=64,sockets=16,cores=4,threads=1 -numa node,nodeid=0,cpus=0-3,mem=4096 -uuid
754871ec-0339-4a65-b490-6a766aaea537 -smbios type=1,manufacturer=oVirt,product=oVirt
Node,version=7-2.1511.el7.centos.2.10,serial=4C4C4544-0048-4610-8052-B4C04F575831,uuid=754871ec-0339-4a65-b490-6a766aaea537
-no-user-config -nodefaults -chardev
socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-test-ubuntu-uni-lu/monitor.sock,server,nowait
-mon chardev=charmonitor,id=monitor,mode=control -rtc
base=2016-03-23T22:06:01,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet
-no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device
virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 -device
virtio-serial-pci,id=virtio-serial0,max_ports=16,bus=pci.0,addr=0x5 -drive
if=none,id=drive-ide0-1-0,readonly=on,format=raw,serial= -device
ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive
file=/rhev/data-center/00000002-0002-0002-0002-0000000003d5/8253a89b-651e-4ff4-865b-57adef05d383/images/9d60ae41-bf17-48b4-b0e6-29625b248718/47a6916c-c902-4ea3-8dfb-a3240d7d9515,if=none,id=drive-virtio-disk0,format=qcow2,serial=9d60ae41-bf17-48b4-b0e6-29625b248718,cache=none,werror=stop,rerror=stop,aio=threads
-device
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
-netdev tap,fd=108,id=hostnet0,vhost=on,vhostfd=109 -device
virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:e5:12:0f,bus=pci.0,addr=0x3,bootindex=2
-chardev
socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/754871ec-0339-4a65-b490-6a766aaea537.com.redhat.rhevm.vdsm,server,nowait
-device
virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm
-chardev
socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/754871ec-0339-4a65-b490-6a766aaea537.org.qemu.guest_agent.0,server,nowait
-device
virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0
-device usb-tablet,id=input0 -vnc 10.79.2.2:76,password -device
cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 -msg timestamp=on
http://paste.fedoraproject.org/344756/84131214
This is CentOS 7.2, latest patches and latest 3.6.4 oVirt.
Thank you for any help / pointers.
Could it be memory ballooning?
Best,
>
> The host is running
>
> OS Version: RHEL - 7 - 1.1503.el7.centos.2.8
> Kernel Version: 3.10.0 - 229.14.1.el7.x86_64
> KVM Version: 2.1.2 - 23.el7_1.8.1
> LIBVIRT Version: libvirt-1.2.8-16.el7_1.4
> VDSM Version: vdsm-4.16.26-0.el7.centos
> SPICE Version: 0.12.4 - 9.el7_1.3
> GlusterFS Version: glusterfs-3.7.5-1.el7
You are running old versions, missing lot of fixes. Nothing specific
to your problem
but this lower the chance to get a working system.
It would be nice if you can upgrade to ovirt-3.6 and report if it made
any change.
Or at lest latest ovirt-3.5.
> We use a locally exported gluster as storage domain (eg, storage is on the same
machine exposed via gluster). No replica.
> We run around 50 VMs on that host.
Why use gluster for this? Do you plan to add more gluster servers in the future?
Nir