[ovirt-users] VM crashes during snapshot/clone/export and show only "?"

Soeren Malchow soeren.malchow at mcon.net
Wed May 27 12:00:06 EDT 2015


Dear all,

We are experiencing this problem over and over again with different Vms, the situation is as follows

  *   we are backing up all VM’s through iterating through the VM (see attached python file) which basically follow the recommendations
  *   This process run well for a while but at some point we get a problem with a random VM (it is always different machines) the backup process tries to remove all snapshots and this is in the log files

Vdsm.log
<— snip —>
Thread-8246::DEBUG::2015-05-27 16:56:00,003::libvirtconnection::143::root::(wrapper) Unknown libvirterror: ecode: 68 edom: 10 level: 2 message: Timed out during operation: cannot acquire state change lock
Thread-8246::ERROR::2015-05-27 16:56:00,016::vm::5761::vm.Vm::(queryBlockJobs) vmId=`84da8d5e-4a9d-4272-861a-a706ebce3160`::Error getting block job info
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 5759, in queryBlockJobs
    liveInfo = self._dom.blockJobInfo(drive.name, 0)
  File "/usr/share/vdsm/virt/vm.py", line 697, in f
    raise toe
TimeoutError: Timed out during operation: cannot acquire state change lock
VM Channels Listener::DEBUG::2015-05-27 16:56:00,561::vmchannels::96::vds::(_handle_timeouts) Timeout on fileno 55.
<— snip —>

Syslog / journarctl
<— snip —>
May 27 16:55:15 mc-dc3ham-compute-04-live.mc.mcon.net libvirtd[1751]: Cannot start job (modify, none) for domain fab-cms-app-01-live-fab-mcon-net; current job is (modify, none) owned by (1780, 0)
May 27 16:55:15 mc-dc3ham-compute-04-live.mc.mcon.net libvirtd[1751]: Timed out during operation: cannot acquire state change lock
May 27 16:55:15 mc-dc3ham-compute-04-live.mc.mcon.net vdsm[10478]: vdsm vm.Vm ERROR vmId=`84da8d5e-4a9d-4272-861a-a706ebce3160`::Error getting block job info
                                                                   Traceback (most recent call last):
                                                                     File "/usr/share/vdsm/virt/vm.py", line 5759, in queryBlockJobs
                                                                       liveInfo = self._dom.blockJobInfo(drive.name, 0)
                                                                     File "/usr/share/vdsm/virt/vm.py", line 697, in f
                                                                       raise toe
                                                                   TimeoutError: Timed out during operation: cannot acquire state change lock
<— snip —>

The result is, that the VM is non-operational, the qemu process is running and ovirt show it with a “?”, this itself would not be that bad if we could recover from this problem, but the only thing we found so far to resolve this, is to put the hypervisor host in maintenance and then with the last VM with the “?” reboot it, we found no other way that allowed us to reboot this VM.
Even after actually killing the qemu process, there is no way to do anything with this VM.

I think i understand that the problem arises when 2 threads are making requests against the same VM, however, in the last case the VM was not doing anything else as far as we can see.

I found a bug that seems to be very similar (‘a little older though) in launchpad, but libvirt related.

https://bugs.launchpad.net/nova/+bug/1254872

These are the libvirt versions on the server

libvirt-daemon-driver-nwfilter-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-driver-interface-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-driver-storage-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-1.2.8-16.el7_1.3.x86_64
libvirt-lock-sanlock-1.2.8-16.el7_1.3.x86_64
libvirt-client-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-config-nwfilter-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-driver-secret-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-driver-qemu-1.2.8-16.el7_1.3.x86_64
libvirt-python-1.2.8-7.el7_1.1.x86_64
libvirt-daemon-driver-nodedev-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-driver-network-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-kvm-1.2.8-16.el7_1.3.x86_64

VDSM version
vdsm-python-4.16.14-0.el7.noarch
vdsm-jsonrpc-4.16.14-0.el7.noarch
vdsm-yajsonrpc-4.16.14-0.el7.noarch
vdsm-4.16.14-0.el7.x86_64
vdsm-python-zombiereaper-4.16.14-0.el7.noarch
vdsm-cli-4.16.14-0.el7.noarch
vdsm-xmlrpc-4.16.14-0.el7.noarch

Kernel
3.10.0-229.4.2.el7.x86_64


Any idea where to go ?





Regards
Soeren


From: Soeren Malchow <soeren.malchow at mcon.net<mailto:soeren.malchow at mcon.net>>
Date: Monday 25 May 2015 22:27
To: "users at ovirt.org<mailto:users at ovirt.org>" <users at ovirt.org<mailto:users at ovirt.org>>
Subject: [ovirt-users] VM crashes during snapshot/clone/export and show only "?"

Dear all,

In version 3.5.2 on CentOS 7.1 we now have the problem that the backup script seems to trigger a crash of Vms, this is the second time, the first time i could only solve the problem by rebooting the hypervisor host and acknowleding “host has been rebooted”

This problems happens while removing snapshots after snapshot –> clone -> export procedures.

The actual qemu process is still running after following log output, but the VM is not responsive anymore, i can kill the process problem.

Two questions for this:

How can I avoid this problem ?
Is there a way to tell ovirt that the qemu process is gone and that the VM can be started again ?

<— snip —>
May 25 22:03:47 mc-dc3ham-compute-02-live.mc.mcon.net libvirtd[1386]: metadata not found: Requested metadata element is not present
May 25 22:03:47 mc-dc3ham-compute-02-live.mc.mcon.net libvirtd[1386]: metadata not found: Requested metadata element is not present
May 25 22:03:48 mc-dc3ham-compute-02-live.mc.mcon.net libvirtd[1386]: internal error: End of file from monitor
May 25 22:03:48 mc-dc3ham-compute-02-live.mc.mcon.net kernel: IDMZ_MC_PUBLIC: port 3(vnet3) entered disabled state
May 25 22:03:48 mc-dc3ham-compute-02-live.mc.mcon.net kernel: device vnet3 left promiscuous mode
May 25 22:03:48 mc-dc3ham-compute-02-live.mc.mcon.net kernel: IDMZ_MC_PUBLIC: port 3(vnet3) entered disabled state
May 25 22:03:48 mc-dc3ham-compute-02-live.mc.mcon.net kvm[22973]: 15 guests now active
May 25 22:03:48 mc-dc3ham-compute-02-live.mc.mcon.net systemd-machined[14412]: Machine qemu-mc-glpi-app-01-live.mc.mcon.net terminated.
May 25 22:04:11 mc-dc3ham-compute-02-live.mc.mcon.net libvirtd[1386]: Cannot start job (modify, none) for domain mc-glpi-app-01-live.mc.mcon.net; current job is (modify, none) own
ed by (1534, 0)
May 25 22:04:11 mc-dc3ham-compute-02-live.mc.mcon.net libvirtd[1386]: Timed out during operation: cannot acquire state change lock
May 25 22:04:18 mc-dc3ham-compute-02-live.mc.mcon.net libvirtd[1386]: Cannot start job (modify, none) for domain mc-glpi-app-01-live.mc.mcon.net; current job is (modify, none) own
ed by (1534, 0)
May 25 22:04:18 mc-dc3ham-compute-02-live.mc.mcon.net libvirtd[1386]: Timed out during operation: cannot acquire state change lock
May 25 22:04:18 mc-dc3ham-compute-02-live.mc.mcon.net libvirtd[1386]: metadata not found: Requested metadata element is not present
May 25 22:04:18 mc-dc3ham-compute-02-live.mc.mcon.net vdsm[3230]: vdsm vm.Vm ERROR vmId=`598bdf61-2f2c-4569-9513-93043890f676`::Error getting block job info
                                                                  Traceback (most recent call last):
                                                                    File "/usr/share/vdsm/virt/vm.py", line 5759, in queryBlockJobs
                                                                      liveInfo = self._dom.blockJobInfo(drive.name, 0)
                                                                    File "/usr/share/vdsm/virt/vm.py", line 697, in f
                                                                      raise toe
                                                                  TimeoutError: Timed out during operation: cannot acquire state change lock
<— snip —>


<— snip —>
May 25 22:04:18 mc-dc3ham-compute-02-live.mc.mcon.net vdsm[3230]: vdsm vm.Vm ERROR vmId=`598bdf61-2f2c-4569-9513-93043890f676`::Stats function failed: <AdvancedStatsFunction _samp
                                                                  Traceback (most recent call last):
                                                                    File "/usr/share/vdsm/virt/sampling.py", line 484, in collect
                                                                      statsFunction()
                                                                    File "/usr/share/vdsm/virt/sampling.py", line 359, in __call__
                                                                      retValue = self._function(*args, **kwargs)
                                                                    File "/usr/share/vdsm/virt/vm.py", line 346, in _sampleVmJobs
                                                                      return self._vm.queryBlockJobs()
                                                                    File "/usr/share/vdsm/virt/vm.py", line 5759, in queryBlockJobs
                                                                      liveInfo = self._dom.blockJobInfo(drive.name, 0)
                                                                  AttributeError: 'NoneType' object has no attribute 'blockJobInfo'
May 25 22:04:18 mc-dc3ham-compute-02-live.mc.mcon.net libvirtd[1386]: metadata not found: Requested metadata element is not present
May 25 22:04:18 mc-dc3ham-compute-02-live.mc.mcon.net libvirtd[1386]: metadata not found: Requested metadata element is not present
May 25 22:04:18 mc-dc3ham-compute-02-live.mc.mcon.net libvirtd[1386]: metadata not found: Requested metadata element is not present
May 25 22:04:18 mc-dc3ham-compute-02-live.mc.mcon.net vdsm[3230]: vdsm vm.Vm ERROR vmId=`598bdf61-2f2c-4569-9513-93043890f676`::Stats function failed: <AdvancedStatsFunction _samp
                                                                  Traceback (most recent call last):
                                                                    File "/usr/share/vdsm/virt/sampling.py", line 484, in collect
                                                                      statsFunction()
                                                                    File "/usr/share/vdsm/virt/sampling.py", line 359, in __call__
                                                                      retValue = self._function(*args, **kwargs)
                                                                    File "/usr/share/vdsm/virt/vm.py", line 338, in _sampleVcpuPinning
                                                                      vCpuInfos = self._vm._dom.vcpus()
                                                                  AttributeError: 'NoneType' object has no attribute 'vcpus'
May 25 22:04:18 mc-dc3ham-compute-02-live.mc.mcon.net vdsm[3230]: vdsm vm.Vm ERROR vmId=`598bdf61-2f2c-4569-9513-93043890f676`::Stats function failed: <AdvancedStatsFunction _samp
                                                                  Traceback (most recent call last):
                                                                    File "/usr/share/vdsm/virt/sampling.py", line 484, in collect
                                                                      statsFunction()
                                                                    File "/usr/share/vdsm/virt/sampling.py", line 359, in __call__
                                                                      retValue = self._function(*args, **kwargs)
                                                                    File "/usr/share/vdsm/virt/vm.py", line 349, in _sampleCpuTune
                                                                      infos = self._vm._dom.schedulerParameters()
                                                                  AttributeError: 'NoneType' object has no attribute ‘schedulerParameters'

<— snip —>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20150527/f31a9028/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: backup-queue.py
Type: text/x-python-script
Size: 5313 bytes
Desc: backup-queue.py
URL: <http://lists.ovirt.org/pipermail/users/attachments/20150527/f31a9028/attachment-0001.bin>


More information about the Users mailing list