Interestingly enough this happens only to a Windows VM with two disks
(60GB, 500GB). This is reproducible every time a snapshot is taken.
If I snapshot only one disk, regardless if it is the small or the big one,
the snapshot completes fairly quickly and the VM is not rendered
unresponsive.
Seems to be a race issue with libvirtd.
Alex
On Wed, Apr 11, 2018 at 6:57 PM, Alex K <rightkicktech(a)gmail.com> wrote:
Thanx Yaniv,
I will try to follow your advice.
Alex
On Tue, Apr 3, 2018 at 8:54 PM, Yaniv Kaul <ykaul(a)redhat.com> wrote:
>
>
> On Sun, Apr 1, 2018 at 9:04 PM, Alex K <rightkicktech(a)gmail.com> wrote:
>
>> Checking further the logs I see this error given from libvirt of the
>> host that has the guest VM running:
>>
>> Apr 1 17:53:41 v0 libvirtd: 2018-04-01 17:53:41.298+0000: 1862: warning
>> : qemuDomainObjBeginJobInternal:3847 : Cannot start job (query, none) for
>> domain Data-Server; current job is (async nested, snapshot) owned by (1863
>> remoteDispatchDomainSnapshotCreateXML, 1863
>> remoteDispatchDomainSnapshotCreateXML) for (39s, 41s)
>> Apr 1 17:53:41 v0 libvirtd: 2018-04-01 17:53:41.299+0000: 1862: error :
>> qemuDomainObjBeginJobInternal:3859 : Timed out during operation: cannot
>> acquire state change lock (held by remoteDispatchDomainSnapshotCreateXML)
>> Apr 1 17:53:57 v0 journal: vdsm Executor WARN Worker blocked: <Worker
>> name=jsonrpc/3 running <Task <JsonRpcTask {'params':
{u'frozen': False,
>> u'vmID': u'6bdb3d02-cc33-4019-97cd-7447aecc1e02',
u'snapDrives':
>> [{u'baseVolumeID': u'adfabed5-451b-4f46-b22a-45f720b06110',
u'domainID':
>> u'2c4b8d45-3d05-4619-9a36-1ecd199d3056', u'volumeID':
>> u'cc0d0772-924c-46db-8ad6-a2b0897c313f', u'imageID':
>> u'7eeadedc-f247-4a31-840d-4de622bf3541'}, {u'baseVolumeID':
>> u'0d960c12-3bcf-4918-896d-bd8e68b5278b', u'domainID':
>> u'2c4b8d45-3d05-4619-9a36-1ecd199d3056', u'volumeID':
>> u'590a6bdd-a9e2-444e-87bc-721c5f8586eb', u'imageID':
>> u'da0e4111-6bbe-43cb-bf59-db5fbf5c3e38'}]}, 'jsonrpc':
'2.0', 'method':
>> u'VM.snapshot', 'id':
u'be7912e6-ba3d-4357-8ba1-abe40825acf1'} at
>> 0x3bf84d0> timeout=60, duration=60 at 0x3bf8050> task#=416 at
0x20d7f90>
>>
>>
>> Immediately after above the engine reports the VM as unresponsive.
>>
>
> If it's reproducible, it'd be great if you could:
> 1. Run libvirt with debug logs
> 2. Get a dump of libvirt when this happens (with gstack for example).
> 3. File a bug on libvirt.
>
> Thanks,
> Y.
>
>
>> The SPM host does not log any issues.
>>
>> In the same time, the 3 hosts are fairly idle with only one running
>> guest VM. The gluster traffic is dedicated to a separate Gbit NIC of the
>> servers (dedicated VLAN) while the management network is on a separate
>> network. The gluster traffic does not exceed 40 Mbps during the snapshot
>> operation. Can't understand why libvirt is logging timeout.
>>
>> Alex
>>
>> On Thu, Mar 29, 2018 at 9:42 PM, Alex K <rightkicktech(a)gmail.com> wrote:
>>
>>> Any idea with this issue?
>>> I am still trying to understand what may be causing this issue.
>>>
>>> Many thanx for any assistance.
>>>
>>> Alex
>>>
>>> On Wed, Mar 28, 2018 at 10:06 AM, Yedidyah Bar David <didi(a)redhat.com>
>>> wrote:
>>>
>>>> On Tue, Mar 27, 2018 at 3:38 PM, Sandro Bonazzola
<sbonazzo(a)redhat.com
>>>> > wrote:
>>>>
>>>>>
>>>>>
>>>>> 2018-03-27 14:34 GMT+02:00 Alex K <rightkicktech(a)gmail.com>:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> Any idea on the below?
>>>>>>
>>>>>> I am using oVirt Guest Tools 4.2-1.el7.centos for the VM.
>>>>>> The Window 2016 server VM (which it the one with the relatively
big
>>>>>> disks: 500 GB) it is consistently rendered unresponsive when
trying to get
>>>>>> a snapshot.
>>>>>> I amy provide any other additional logs if needed.
>>>>>>
>>>>>
>>>>> Adding some people to the thread
>>>>>
>>>>
>>>> Adding more people for this part.
>>>>
>>>>
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> Alex
>>>>>>
>>>>>> On Sun, Mar 25, 2018 at 7:30 PM, Alex K
<rightkicktech(a)gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi folks,
>>>>>>>
>>>>>>> I am facing frequently the following issue:
>>>>>>>
>>>>>>> On some large VMs (Windows 2016 with two disk drives, 60GB
and
>>>>>>> 500GB) when attempting to create a snapshot of the VM, the VM
becomes
>>>>>>> unresponsive.
>>>>>>>
>>>>>>> The errors that I managed to collect were:
>>>>>>>
>>>>>>> vdsm error at host hosting the VM:
>>>>>>> 2018-03-25 14:40:13,442+0000 WARN (vdsm.Scheduler)
[Executor]
>>>>>>> Worker blocked: <Worker name=jsonrpc/7 running <Task
<JsonRpcTask
>>>>>>> {'params': {u'frozen': False,
u'vmID':
>>>>>>> u'a5c761a2-41cd-40c2-b65f-f3819293e8a4',
u'snapDrives': [{u'baseVolumeID':
>>>>>>> u'2a33e585-ece8-4f4d-b45d-5ecc9239200e',
u'domainID':
>>>>>>> u'888e3aae-f49f-42f7-a7fa-76700befabea',
u'volumeID':
>>>>>>> u'e9a01ebd-83dd-40c3-8c83-5302b0d15e04',
u'imageID':
>>>>>>> u'c75b8e93-3067-4472-bf24-dafada224e4d'},
{u'baseVolumeID':
>>>>>>> u'3fb2278c-1b0d-4677-a529-99084e4b08af',
u'domainID':
>>>>>>> u'888e3aae-f49f-42f7-a7fa-76700befabea',
u'volumeID':
>>>>>>> u'78e6b6b1-2406-4393-8d92-831a6d4f1337',
u'imageID':
>>>>>>> u'd4223744-bf5d-427b-bec2-f14b9bc2ef81'}]},
'jsonrpc': '2.0', 'method':
>>>>>>> u'VM.snapshot', 'id':
u'89555c87-9701-4260-9952-789965261e65'} at
>>>>>>> 0x7fca4004cc90> timeout=60, duration=60 at 0x39d8210>
task#=155842 at
>>>>>>> 0x2240e10> (executor:351)
>>>>>>> 2018-03-25 14:40:15,261+0000 INFO (jsonrpc/3)
>>>>>>> [jsonrpc.JsonRpcServer] RPC call VM.getStats failed (error 1)
in 0.01
>>>>>>> seconds (__init__:539)
>>>>>>> 2018-03-25 14:40:17,471+0000 WARN (jsonrpc/5) [virt.vm]
>>>>>>> (vmId='a5c761a2-41cd-40c2-b65f-f3819293e8a4') monitor
became unresponsive
>>>>>>> (command timeout, age=67.9100000001) (vm:5132)
>>>>>>>
>>>>>>> engine.log:
>>>>>>> 2018-03-25 14:40:19,875Z WARN
>>>>>>>
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>>>>> (DefaultQuartzScheduler2) [1d737df7] EVENT_ID:
VM_NOT_RESPONDING(126),
>>>>>>> Correlation ID: null, Call Stack: null, Custom ID: null,
Custom Event ID:
>>>>>>> -1, Message: VM Data-Server is not responding.
>>>>>>>
>>>>>>> 2018-03-25 14:42:13,708Z ERROR
>>>>>>>
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>>>>> (DefaultQuartzScheduler5)
[17789048-009a-454b-b8ad-2c72c7cd37aa] EVENT_ID:
>>>>>>> VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null,
Call Stack: null,
>>>>>>> Custom ID: null, Custom Event ID: -1, Message: VDSM
v1.cluster command
>>>>>>> SnapshotVDS failed: Message timeout which can be caused by
communication
>>>>>>> issues
>>>>>>> 2018-03-25 14:42:13,708Z ERROR
>>>>>>>
[org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand]
>>>>>>> (DefaultQuartzScheduler5)
[17789048-009a-454b-b8ad-2c72c7cd37aa] Command
>>>>>>> 'SnapshotVDSCommand(HostName = v1.cluster,
>>>>>>> SnapshotVDSCommandParameters:{runAsync='true',
>>>>>>> hostId='a713d988-ee03-4ff0-a0cd-dc4cde1507f4',
>>>>>>> vmId='a5c761a2-41cd-40c2-b65f-f3819293e8a4'})'
execution failed:
>>>>>>> VDSGenericException: VDSNetworkException: Message timeout
which can be
>>>>>>> caused by communication issues
>>>>>>> 2018-03-25 14:42:13,708Z WARN
>>>>>>>
[org.ovirt.engine.core.bll.snapshots.CreateAllSnapshotsFromVmCommand]
>>>>>>> (DefaultQuartzScheduler5)
[17789048-009a-454b-b8ad-2c72c7cd37aa] Could not
>>>>>>> perform live snapshot due to error, VM will still be
configured to the new
>>>>>>> created snapshot: EngineException:
>>>>>>>
org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException:
>>>>>>> VDSGenericException: VDSNetworkException: Message timeout
which can be
>>>>>>> caused by communication issues (Failed with error
VDS_NETWORK_ERROR and
>>>>>>> code 5022)
>>>>>>> 2018-03-25 14:42:13,708Z WARN
>>>>>>> [org.ovirt.engine.core.vdsbroker.VdsManager]
>>>>>>> (org.ovirt.thread.pool-6-thread-15)
[17789048-009a-454b-b8ad-2c72c7cd37aa]
>>>>>>> Host 'v1.cluster' is not responding. It will stay in
Connecting state for a
>>>>>>> grace period of 61 seconds and after that an attempt to fence
the host will
>>>>>>> be issued.
>>>>>>> 2018-03-25 14:42:13,725Z WARN
>>>>>>>
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>>>>> (org.ovirt.thread.pool-6-thread-15)
[17789048-009a-454b-b8ad-2c72c7cd37aa]
>>>>>>> EVENT_ID: VDS_HOST_NOT_RESPONDING_CONNECTING(9,008),
Correlation ID: null,
>>>>>>> Call Stack: null, Custom ID: null, Custom Event ID: -1,
Message: Host
>>>>>>> v1.cluster is not responding. It will stay in Connecting
state for a grace
>>>>>>> period of 61 seconds and after that an attempt to fence the
host will be
>>>>>>> issued.
>>>>>>> 2018-03-25 14:42:13,751Z WARN
>>>>>>>
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>>>>> (DefaultQuartzScheduler5)
[17789048-009a-454b-b8ad-2c72c7cd37aa] EVENT_ID:
>>>>>>> USER_CREATE_LIVE_SNAPSHOT_FINISHED_FAILURE(170), Correlation
ID:
>>>>>>> 17789048-009a-454b-b8ad-2c72c7cd37aa, Job ID:
>>>>>>> 16e48c28-a8c7-4841-bd81-1f2d370f345d, Call Stack:
>>>>>>> org.ovirt.engine.core.common.errors.EngineException:
EngineException:
>>>>>>>
org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException:
>>>>>>> VDSGenericException: VDSNetworkException: Message timeout
which can be
>>>>>>> caused by communication issues (Failed with error
VDS_NETWORK_ERROR and
>>>>>>> code 5022)
>>>>>>> 2018-03-25 14:42:14,372Z ERROR
>>>>>>>
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>>>>> (DefaultQuartzScheduler5) [] EVENT_ID:
>>>>>>> USER_CREATE_SNAPSHOT_FINISHED_FAILURE(69), Correlation ID:
>>>>>>> 17789048-009a-454b-b8ad-2c72c7cd37aa, Job ID:
>>>>>>> 16e48c28-a8c7-4841-bd81-1f2d370f345d, Call Stack:
>>>>>>> org.ovirt.engine.core.common.errors.EngineException:
EngineException:
>>>>>>>
org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException:
>>>>>>> VDSGenericException: VDSNetworkException: Message timeout
which can be
>>>>>>> caused by communication issues (Failed with error
VDS_NETWORK_ERROR and
>>>>>>> code 5022)
>>>>>>> 2018-03-25 14:42:14,372Z WARN
>>>>>>>
[org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback]
>>>>>>> (DefaultQuartzScheduler5) [] Command
'CreateAllSnapshotsFromVm' id:
>>>>>>> 'bad4f5be-5306-413f-a86a-513b3cfd3c66' end method
execution failed, as the
>>>>>>> command isn't marked for endAction() retries silently
ignoring
>>>>>>> 2018-03-25 14:42:15,951Z WARN
>>>>>>>
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>>>>> (DefaultQuartzScheduler5) [5017c163] EVENT_ID:
>>>>>>> VDS_NO_SELINUX_ENFORCEMENT(25), Correlation ID: null, Call
Stack: null,
>>>>>>> Custom ID: null, Custom Event ID: -1, Message: Host
v1.cluster does not
>>>>>>> enforce SELinux. Current status: DISABLED
>>>>>>> 2018-03-25 14:42:15,951Z WARN
>>>>>>> [org.ovirt.engine.core.vdsbroker.VdsManager]
(DefaultQuartzScheduler5)
>>>>>>> [5017c163] Host 'v1.cluster' is running with SELinux
in 'DISABLED' mode
>>>>>>>
>>>>>>> As soon as the VM is unresponsive, the VM console that was
already
>>>>>>> open freezes. I can resume the VM only by powering off and
on.
>>>>>>>
>>>>>>> I am using ovirt 4.1.9 with 3 nodes and self-hosted engine. I
am
>>>>>>> running mostly Windows 10 and Windows 2016 server VMs. I have
installed
>>>>>>> latest guest agents from:
>>>>>>>
>>>>>>>
>>>>>>>
http://resources.ovirt.org/pub/ovirt-4.2/iso/oVirt-toolsSetup/4.2-1.el7.c...
>>>>>>>
>>>>>>> At the screen where one takes a snapshot I get a warning
saying
>>>>>>> "Could not detect guest agent on the VM. Note that
without guest agent the
>>>>>>> data on the created snapshot may be inconsistent". See
attached. I have
>>>>>>> verified that ovirt guest tools are installed and shown at
installed apps
>>>>>>> at engine GUI. Also Ovirt Guest Agent (32 bit) and qemu-ga
are listed as
>>>>>>> running at the windows tasks manager. Shouldn't ovirt
guest agent be 64 bit
>>>>>>> on Windows 64 bit?
>>>>>>>
>>>>>>
>>>> No idea, but I do not think it's related to your problem of freezing
>>>> while taking a snapshot.
>>>>
>>>> This error was already discussed in the past, see e.g.:
>>>>
>>>>
http://lists.ovirt.org/pipermail/users/2017-June/082577.html
>>>>
>>>> Best regards,
>>>> --
>>>> Didi
>>>>
>>>
>>>
>>
>> _______________________________________________
>> Users mailing list
>> Users(a)ovirt.org
>>
http://lists.ovirt.org/mailman/listinfo/users
>>
>>
>