On Sun, Apr 1, 2018 at 9:04 PM, Alex K <rightkicktech@gmail.com> wrote:Checking further the logs I see this error given from libvirt of the host that has the guest VM running:Immediately after above the engine reports the VM as unresponsive.
Apr 1 17:53:41 v0 libvirtd: 2018-04-01 17:53:41.298+0000: 1862: warning : qemuDomainObjBeginJobInternal:3847 : Cannot start job (query, none) for domain Data-Server; current job is (async nested, snapshot) owned by (1863 remoteDispatchDomainSnapshotCr eateXML, 1863 remoteDispatchDomainSnapshotCr eateXML) for (39s, 41s)
Apr 1 17:53:41 v0 libvirtd: 2018-04-01 17:53:41.299+0000: 1862: error : qemuDomainObjBeginJobInternal:3859 : Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainSnapshotCr eateXML)
Apr 1 17:53:57 v0 journal: vdsm Executor WARN Worker blocked: <Worker name=jsonrpc/3 running <Task <JsonRpcTask {'params': {u'frozen': False, u'vmID': u'6bdb3d02-cc33-4019-97cd-7447aecc1e02', u'snapDrives': [{u'baseVolumeID': u'adfabed5-451b-4f46-b22a-45f7 20b06110', u'domainID': u'2c4b8d45-3d05-4619-9a36-1ecd 199d3056', u'volumeID': u'cc0d0772-924c-46db-8ad6-a2b0 897c313f', u'imageID': u'7eeadedc-f247-4a31-840d-4de6 22bf3541'}, {u'baseVolumeID': u'0d960c12-3bcf-4918-896d-bd8e 68b5278b', u'domainID': u'2c4b8d45-3d05-4619-9a36-1ecd 199d3056', u'volumeID': u'590a6bdd-a9e2-444e-87bc-721c 5f8586eb', u'imageID': u'da0e4111-6bbe-43cb-bf59-db5f bf5c3e38'}]}, 'jsonrpc': '2.0', 'method': u'VM.snapshot', 'id': u'be7912e6-ba3d-4357-8ba1-abe4 0825acf1'} at 0x3bf84d0> timeout=60, duration=60 at 0x3bf8050> task#=416 at 0x20d7f90> If it's reproducible, it'd be great if you could:1. Run libvirt with debug logs2. Get a dump of libvirt when this happens (with gstack for example).3. File a bug on libvirt.Thanks,Y.______________________________The SPM host does not log any issues.In the same time, the 3 hosts are fairly idle with only one running guest VM. The gluster traffic is dedicated to a separate Gbit NIC of the servers (dedicated VLAN) while the management network is on a separate network. The gluster traffic does not exceed 40 Mbps during the snapshot operation. Can't understand why libvirt is logging timeout.AlexOn Thu, Mar 29, 2018 at 9:42 PM, Alex K <rightkicktech@gmail.com> wrote:AlexMany thanx for any assistance.Any idea with this issue?I am still trying to understand what may be causing this issue.On Wed, Mar 28, 2018 at 10:06 AM, Yedidyah Bar David <didi@redhat.com> wrote:On Tue, Mar 27, 2018 at 3:38 PM, Sandro Bonazzola <sbonazzo@redhat.com> wrote:--2018-03-27 14:34 GMT+02:00 Alex K <rightkicktech@gmail.com>:The Window 2016 server VM (which it the one with the relatively big disks: 500 GB) it is consistently rendered unresponsive when trying to get a snapshot.I am using oVirt Guest Tools 4.2-1.el7.centos for the VM.Hi All,Any idea on the below?I amy provide any other additional logs if needed.Adding some people to the threadAdding more people for this part.AlexOn Sun, Mar 25, 2018 at 7:30 PM, Alex K <rightkicktech@gmail.com> wrote:The errors that I managed to collect were:On some large VMs (Windows 2016 with two disk drives, 60GB and 500GB) when attempting to create a snapshot of the VM, the VM becomes unresponsive.Hi folks,I am facing frequently the following issue:
vdsm error at host hosting the VM:
2018-03-25 14:40:13,442+0000 WARN (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/7 running <Task <JsonRpcTask {'params': {u'frozen': False, u'vmID': u'a5c761a2-41cd-40c2-b65f-f3819293e8a4', u'snapDrives': [{u'baseVolumeID': u'2a33e585-ece8-4f4d-b45d-5ecc 9239200e', u'domainID': u'888e3aae-f49f-42f7-a7fa-7670 0befabea', u'volumeID': u'e9a01ebd-83dd-40c3-8c83-5302 b0d15e04', u'imageID': u'c75b8e93-3067-4472-bf24-dafa da224e4d'}, {u'baseVolumeID': u'3fb2278c-1b0d-4677-a529-9908 4e4b08af', u'domainID': u'888e3aae-f49f-42f7-a7fa-7670 0befabea', u'volumeID': u'78e6b6b1-2406-4393-8d92-831a 6d4f1337', u'imageID': u'd4223744-bf5d-427b-bec2-f14b 9bc2ef81'}]}, 'jsonrpc': '2.0', 'method': u'VM.snapshot', 'id': u'89555c87-9701-4260-9952-7899 65261e65'} at 0x7fca4004cc90> timeout=60, duration=60 at 0x39d8210> task#=155842 at 0x2240e10> (executor:351)
2018-03-25 14:40:15,261+0000 INFO (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call VM.getStats failed (error 1) in 0.01 seconds (__init__:539)
2018-03-25 14:40:17,471+0000 WARN (jsonrpc/5) [virt.vm] (vmId='a5c761a2-41cd-40c2-b65f-f3819293e8a4') monitor became unresponsive (command timeout, age=67.9100000001) (vm:5132)
engine.log:
2018-03-25 14:40:19,875Z WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLo gDirector] (DefaultQuartzScheduler2) [1d737df7] EVENT_ID: VM_NOT_RESPONDING(126), Correlation ID: null, Call Stack: null, Custom ID: null, Custom Event ID: -1, Message: VM Data-Server is not responding.
2018-03-25 14:42:13,708Z ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLo gDirector] (DefaultQuartzScheduler5) [17789048-009a-454b-b8ad-2c72c 7cd37aa] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10, 802), Correlation ID: null, Call Stack: null, Custom ID: null, Custom Event ID: -1, Message: VDSM v1.cluster command SnapshotVDS failed: Message timeout which can be caused by communication issues
2018-03-25 14:42:13,708Z ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSComman d] (DefaultQuartzScheduler5) [17789048-009a-454b-b8ad-2c72c 7cd37aa] Command 'SnapshotVDSCommand(HostName = v1.cluster, SnapshotVDSCommandParameters:{ runAsync='true', hostId='a713d988-ee03-4ff0-a0c d-dc4cde1507f4', vmId='a5c761a2-41cd-40c2-b65f- f3819293e8a4'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues
2018-03-25 14:42:13,708Z WARN [org.ovirt.engine.core.bll.snapshots.CreateAllSnapshotsFromV mCommand] (DefaultQuartzScheduler5) [17789048-009a-454b-b8ad-2c72c 7cd37aa] Could not perform live snapshot due to error, VM will still be configured to the new created snapshot: EngineException: org.ovirt.engine.core.vdsbroke r.vdsbroker.VDSNetworkExceptio n: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues (Failed with error VDS_NETWORK_ERROR and code 5022)
2018-03-25 14:42:13,708Z WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-threa d-15) [17789048-009a-454b-b8ad-2c72c 7cd37aa] Host 'v1.cluster' is not responding. It will stay in Connecting state for a grace period of 61 seconds and after that an attempt to fence the host will be issued.
2018-03-25 14:42:13,725Z WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLo gDirector] (org.ovirt.thread.pool-6-threa d-15) [17789048-009a-454b-b8ad-2c72c 7cd37aa] EVENT_ID: VDS_HOST_NOT_RESPONDING_CONNEC TING(9,008), Correlation ID: null, Call Stack: null, Custom ID: null, Custom Event ID: -1, Message: Host v1.cluster is not responding. It will stay in Connecting state for a grace period of 61 seconds and after that an attempt to fence the host will be issued.
2018-03-25 14:42:13,751Z WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLo gDirector] (DefaultQuartzScheduler5) [17789048-009a-454b-b8ad-2c72c 7cd37aa] EVENT_ID: USER_CREATE_LIVE_SNAPSHOT_FINI SHED_FAILURE(170), Correlation ID: 17789048-009a-454b-b8ad-2c72c7 cd37aa, Job ID: 16e48c28-a8c7-4841-bd81-1f2d37 0f345d, Call Stack: org.ovirt.engine.core.common.e rrors.EngineException: EngineException: org.ovirt.engine.core.vdsbroke r.vdsbroker.VDSNetworkExceptio n: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues (Failed with error VDS_NETWORK_ERROR and code 5022)
2018-03-25 14:42:14,372Z ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLo gDirector] (DefaultQuartzScheduler5) [] EVENT_ID: USER_CREATE_SNAPSHOT_FINISHED_ FAILURE(69), Correlation ID: 17789048-009a-454b-b8ad-2c72c7 cd37aa, Job ID: 16e48c28-a8c7-4841-bd81-1f2d37 0f345d, Call Stack: org.ovirt.engine.core.common.e rrors.EngineException: EngineException: org.ovirt.engine.core.vdsbroke r.vdsbroker.VDSNetworkExceptio n: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues (Failed with error VDS_NETWORK_ERROR and code 5022)
2018-03-25 14:42:14,372Z WARN [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionC allback] (DefaultQuartzScheduler5) [] Command 'CreateAllSnapshotsFromVm' id: 'bad4f5be-5306-413f-a86a-513b3 cfd3c66' end method execution failed, as the command isn't marked for endAction() retries silently ignoring
2018-03-25 14:42:15,951Z WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLo gDirector] (DefaultQuartzScheduler5) [5017c163] EVENT_ID: VDS_NO_SELINUX_ENFORCEMENT(25) , Correlation ID: null, Call Stack: null, Custom ID: null, Custom Event ID: -1, Message: Host v1.cluster does not enforce SELinux. Current status: DISABLED
2018-03-25 14:42:15,951Z WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler5) [5017c163] Host 'v1.cluster' is running with SELinux in 'DISABLED' mode
As soon as the VM is unresponsive, the VM console that was already open freezes. I can resume the VM only by powering off and on.
I am using ovirt 4.1.9 with 3 nodes and self-hosted engine. I am running mostly Windows 10 and Windows 2016 server VMs. I have installed latest guest agents from:
http://resources.ovirt.org/pub/ovirt-4.2/iso/oVirt-toolsSetu p/4.2-1.el7.centos/ At the screen where one takes a snapshot I get a warning saying "Could not detect guest agent on the VM. Note that without guest agent the data on the created snapshot may be inconsistent". See attached. I have verified that ovirt guest tools are installed and shown at installed apps at engine GUI. Also Ovirt Guest Agent (32 bit) and qemu-ga are listed as running at the windows tasks manager. Shouldn't ovirt guest agent be 64 bit on Windows 64 bit?No idea, but I do not think it's related to your problem of freezing while taking a snapshot.This error was already discussed in the past, see e.g.:
http://lists.ovirt.org/pipermail/users/2017-June/082577.html Best regards,Didi_________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users