ovirt snapshot issue

25 Mar 2018

      Hi folks,

I am facing frequently the following issue:

On some large VMs (Windows 2016 with two disk drives, 60GB and 500GB) when
attempting to create a snapshot of the VM, the VM becomes unresponsive.

The errors that I managed to collect were:

vdsm error at host hosting the VM:
2018-03-25 14:40:13,442+0000 WARN  (vdsm.Scheduler) [Executor] Worker
blocked: <Worker name=jsonrpc/7 running <Task <JsonRpcTask {'params':
{u'frozen': False, u'vmID': u'a5c761a2-41cd-40c2-b65f-f3819293e8a4',
u'snapDrives': [{u'baseVolumeID': u'2a33e585-ece8-4f4d-b45d-5ecc9239200e',
u'domainID': u'888e3aae-f49f-42f7-a7fa-76700befabea', u'volumeID':
u'e9a01ebd-83dd-40c3-8c83-5302b0d15e04', u'imageID':
u'c75b8e93-3067-4472-bf24-dafada224e4d'}, {u'baseVolumeID':
u'3fb2278c-1b0d-4677-a529-99084e4b08af', u'domainID':
u'888e3aae-f49f-42f7-a7fa-76700befabea', u'volumeID':
u'78e6b6b1-2406-4393-8d92-831a6d4f1337', u'imageID':
u'd4223744-bf5d-427b-bec2-f14b9bc2ef81'}]}, 'jsonrpc': '2.0', 'method':
u'VM.snapshot', 'id': u'89555c87-9701-4260-9952-789965261e65'} at
0x7fca4004cc90> timeout=60, duration=60 at 0x39d8210> task#=155842 at
0x2240e10> (executor:351)
2018-03-25 14:40:15,261+0000 INFO  (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC
call VM.getStats failed (error 1) in 0.01 seconds (__init__:539)
2018-03-25 14:40:17,471+0000 WARN  (jsonrpc/5) [virt.vm]
(vmId='a5c761a2-41cd-40c2-b65f-f3819293e8a4') monitor became unresponsive
(command timeout, age=67.9100000001) (vm:5132)

engine.log:
2018-03-25 14:40:19,875Z WARN
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler2) [1d737df7] EVENT_ID: VM_NOT_RESPONDING(126),
Correlation ID: null, Call Stack: null, Custom ID: null, Custom Event ID:
-1, Message: VM Data-Server is not responding.

2018-03-25 14:42:13,708Z ERROR
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler5) [17789048-009a-454b-b8ad-2c72c7cd37aa] EVENT_ID:
VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null,
Custom ID: null, Custom Event ID: -1, Message: VDSM v1.cluster command
SnapshotVDS failed: Message timeout which can be caused by communication
issues
2018-03-25 14:42:13,708Z ERROR
[org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand]
(DefaultQuartzScheduler5) [17789048-009a-454b-b8ad-2c72c7cd37aa] Command
'SnapshotVDSCommand(HostName = v1.cluster,
SnapshotVDSCommandParameters:{runAsync='true',
hostId='a713d988-ee03-4ff0-a0cd-dc4cde1507f4',
vmId='a5c761a2-41cd-40c2-b65f-f3819293e8a4'})' execution failed:
VDSGenericException: VDSNetworkException: Message timeout which can be
caused by communication issues
2018-03-25 14:42:13,708Z WARN
[org.ovirt.engine.core.bll.snapshots.CreateAllSnapshotsFromVmCommand]
(DefaultQuartzScheduler5) [17789048-009a-454b-b8ad-2c72c7cd37aa] Could not
perform live snapshot due to error, VM will still be configured to the new
created snapshot: EngineException:
org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException:
VDSGenericException: VDSNetworkException: Message timeout which can be
caused by communication issues (Failed with error VDS_NETWORK_ERROR and
code 5022)
2018-03-25 14:42:13,708Z WARN  [org.ovirt.engine.core.vdsbroker.VdsManager]
(org.ovirt.thread.pool-6-thread-15) [17789048-009a-454b-b8ad-2c72c7cd37aa]
Host 'v1.cluster' is not responding. It will stay in Connecting state for a
grace period of 61 seconds and after that an attempt to fence the host will
be issued.
2018-03-25 14:42:13,725Z WARN
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(org.ovirt.thread.pool-6-thread-15) [17789048-009a-454b-b8ad-2c72c7cd37aa]
EVENT_ID: VDS_HOST_NOT_RESPONDING_CONNECTING(9,008), Correlation ID: null,
Call Stack: null, Custom ID: null, Custom Event ID: -1, Message: Host
v1.cluster is not responding. It will stay in Connecting state for a grace
period of 61 seconds and after that an attempt to fence the host will be
issued.
2018-03-25 14:42:13,751Z WARN
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler5) [17789048-009a-454b-b8ad-2c72c7cd37aa] EVENT_ID:
USER_CREATE_LIVE_SNAPSHOT_FINISHED_FAILURE(170), Correlation ID:
17789048-009a-454b-b8ad-2c72c7cd37aa, Job ID:
16e48c28-a8c7-4841-bd81-1f2d370f345d, Call Stack:
org.ovirt.engine.core.common.errors.EngineException: EngineException:
org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException:
VDSGenericException: VDSNetworkException: Message timeout which can be
caused by communication issues (Failed with error VDS_NETWORK_ERROR and
code 5022)
2018-03-25 14:42:14,372Z ERROR
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler5) [] EVENT_ID:
USER_CREATE_SNAPSHOT_FINISHED_FAILURE(69), Correlation ID:
17789048-009a-454b-b8ad-2c72c7cd37aa, Job ID:
16e48c28-a8c7-4841-bd81-1f2d370f345d, Call Stack:
org.ovirt.engine.core.common.errors.EngineException: EngineException:
org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException:
VDSGenericException: VDSNetworkException: Message timeout which can be
caused by communication issues (Failed with error VDS_NETWORK_ERROR and
code 5022)
2018-03-25 14:42:14,372Z WARN
[org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback]
(DefaultQuartzScheduler5) [] Command 'CreateAllSnapshotsFromVm' id:
'bad4f5be-5306-413f-a86a-513b3cfd3c66' end method execution failed, as the
command isn't marked for endAction() retries silently ignoring
2018-03-25 14:42:15,951Z WARN
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler5) [5017c163] EVENT_ID:
VDS_NO_SELINUX_ENFORCEMENT(25), Correlation ID: null, Call Stack: null,
Custom ID: null, Custom Event ID: -1, Message: Host v1.cluster does not
enforce SELinux. Current status: DISABLED
2018-03-25 14:42:15,951Z WARN  [org.ovirt.engine.core.vdsbroker.VdsManager]
(DefaultQuartzScheduler5) [5017c163] Host 'v1.cluster' is running with
SELinux in 'DISABLED' mode

As soon as the VM is unresponsive, the VM console that was already open
freezes. I can resume the VM only by powering off and on.

I am using ovirt 4.1.9 with 3 nodes and self-hosted engine. I am running
mostly Windows 10 and Windows 2016 server VMs. I have installed latest
guest agents from:

http://resources.ovirt.org/pub/ovirt-4.2/iso/oVirt-toolsSetup/4.2-1.el7.cent...

At the screen where one takes a snapshot I get a warning saying "Could not
detect guest agent on the VM. Note that without guest agent the data on the
created snapshot may be inconsistent". See attached. I have verified that
ovirt guest tools are installed and shown at installed apps at engine GUI.
Also Ovirt Guest Agent (32 bit) and qemu-ga are listed as running at the
windows tasks manager. Shouldn't ovirt guest agent be 64 bit on Windows 64
bit?

Any advice will be much appreciated.

Alex

Alex K

Alex K

Sandro Bonazzola

Yedidyah Bar David

Alex K

Alex K

Alex K

Yaniv Kaul

Alex K

Alex K

tags

participants (4)