Re: [ovirt-users] VM has been paused due to storage I/O problem

2 Feb 2017

      On Thu, Feb 2, 2017 at 1:11 AM, Gianluca Cecchi
<gianluca.cecchi@gmail.com> wrote:
...
On Wed, Feb 1, 2017 at 8:22 PM, Gianluca Cecchi <gianluca.cecchi@gmail.com>
wrote:
...
OK. In the mean time I have applied your suggested config and restarted
the 2 nodes.
Let we test and see if I find any problems running also some I/O tests.
Thanks in the mean time,
Gianluca
Quick test without much success
Inside the guest I run this loop
while true
do
time dd if=/dev/zero bs=1024k count=1024 of=/home/g.cecchi/testfile
sleep 5
done
I don't think this test is related to the issues you reported earlier.

What you test here is how fast ovirt thin provisioning can extend your
disk when writing zeros. We don't handle that very well, each extend
needs at couple 4-6 seconds to complete before we refresh the lv
on the host running the vm, and this is the *best* case. In the worst
case it can take much longer.

Also what you tested here is how fast you can write to your vm buffer cache,
since you are not using direct io.

A better way to perform this test is:

time dd if=/dev/zero bs=1024k count=1024 of=/home/g.cecchi/testfile oflag=direct

This will give you the time to actually write data to storage.

If you have a real issue with vms pausing during writes when vm disk has to
be extended, you can enlarge the extend chunk, 1GiB by default.

To use chunks of 2GiB, set:

[irs]
volume_utilization_percent = 50
volume_utilization_chunk_mb = 2048

This will extend the drive when free space is less than 1024MiB
(volume_utilization_chunk_mb * (100 - volume_utilization_percent) / 100)

If this is not enough, you can also use lower volume_utilization_percent,
for example, this will extend the disk in 2GiB chunks when free space
is bellow 1536MiB:

[irs]
volume_utilization_percent = 25
volume_utilization_chunk_mb = 2048
...
BTW: my home is inside / filesystem on guest that has space to accomodate
1Gb of the dd command:
[g.cecchi@ol65 ~]$ df -h /home/g.cecchi/
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_ol65-lv_root
                       20G  4.9G   14G  27% /
[g.cecchi@ol65 ~]$
After about 7 rounds I get this in messages of the host where the VM is
running:
Feb  1 23:31:39 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
Feb  1 23:31:39 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
Feb  1 23:31:44 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
Feb  1 23:31:45 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
Feb  1 23:31:45 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
Feb  1 23:31:47 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
Feb  1 23:31:50 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
Feb  1 23:31:50 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
Feb  1 23:31:56 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
Feb  1 23:31:57 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
This is interesting, we have seen this messages before, but could never
detect the flow causing them, are you sure you see this each time you extend
your disk?

If you can reproduce this, please file a bug.
...
Nothing on the other host.
In web admin events pane:
Feb 1, 2017 11:31:44 PM VM ol65 has been paused due to no Storage space
error.
Feb 1, 2017 11:31:44 PM VM ol65 has been paused.
I stop the dd loop and after some seconds:
Feb 1, 2017 11:32:32 PM VM ol65 has recovered from paused back to up
Multipath status for my device:
3600a0b8000299aa80000d08b55014119 dm-2 IBM     ,1814      FAStT
size=4.0T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1
rdac' wp=rw
|-+- policy='service-time 0' prio=0 status=active
| |- 0:0:1:3 sdj 8:144 active undef running
| `- 2:0:1:3 sdp 8:240 active undef running
`-+- policy='service-time 0' prio=0 status=enabled
  |- 0:0:0:3 sdd 8:48  active undef running
  `- 2:0:0:3 sdk 8:160 active undef running
In engine.log
2017-02-01 23:22:01,449 INFO
[org.ovirt.engine.core.bll.aaa.CreateUserSessionCommand] (default task-15)
[530aee87] Running command: CreateUserSessionCommand internal: false.
2017-02-01 23:22:04,011 INFO
[org.ovirt.engine.docs.utils.servlet.ContextSensitiveHelpMappingServlet]
(default task-12) [] Context-sensitive help is not installed. Manual
directory doesn't exist: /usr/share/ovirt-engine/manual
2017-02-01 23:31:43,936 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(ForkJoinPool-1-worker-10) [10c15e39] VM
'932db7c7-4121-4cbe-ad8d-09e4e99b3cdd'(ol65) moved from 'Up' --> 'Paused'
2017-02-01 23:31:44,087 INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(ForkJoinPool-1-worker-10) [10c15e39] Correlation ID: null, Call Stack:
null, Custom Event ID: -1, Message: VM ol65 has been paused.
2017-02-01 23:31:44,227 ERROR
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(ForkJoinPool-1-worker-10) [10c15e39] Correlation ID: null, Call Stack:
null, Custom Event ID: -1, Message: VM ol65 has been paused due to no
Storage space error.
2017-02-01 23:32:32,100 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(DefaultQuartzScheduler5) [1c02a371] VM
'932db7c7-4121-4cbe-ad8d-09e4e99b3cdd'(ol65) moved from 'Paused' --> 'Up'
2017-02-01 23:32:32,259 INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler5) [1c02a371] Correlation ID: null, Call Stack: null,
Custom Event ID: -1, Message: VM ol65 has recovered from paused back to up.
In vdsm.log of the host where the VM is running I see this:
mailbox.SPMMonitor::DEBUG::2017-02-01
23:31:37,729::storage_mailbox::733::Storage.Misc.excCmd::(_chec
kForMail) /usr/bin/taskset --cpu-list 0-7 dd
if=/rhev/data-center/588237b8-0031-02f6-035d-00000000013
6/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None)
mailbox.SPMMonitor::DEBUG::2017-02-01
23:31:37,764::storage_mailbox::733::Storage.Misc.excCmd::(_chec
kForMail) SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes
(1.0 MB) copied, 0.0061866
 s, 166 MB/s\n'; <rc> = 0
periodic/8::INFO::2017-02-01
23:31:39,699::vm::1028::virt.vm::(extendDrivesIfNeeded) vmId=`932db7c7-4
121-4cbe-ad8d-09e4e99b3cdd`::Requesting extension for volume
66615920-619f-4f52-ad5a-3f062c054094 on
domain 922b5269-ab56-4c4d-838f-49d33427e2ab (apparent: 12884901888,
capacity: 96636764160, allocated:
 12355698688, physical: 12884901888)
periodic/8::DEBUG::2017-02-01
23:31:39,699::vm::1093::virt.vm::(__extendDriveVolume) vmId=`932db7c7-4
121-4cbe-ad8d-09e4e99b3cdd`::Requesting an extension for the volume:
{'newSize': 13958643712, 'domain
ID': u'922b5269-ab56-4c4d-838f-49d33427e2ab', 'name': u'vda', 'volumeID':
u'66615920-619f-4f52-ad5a-3
f062c054094', 'imageID': u'32b0e94d-0c8e-4d3d-a4eb-448885297e93',
'internal': False, 'poolID': u'5882
37b8-0031-02f6-035d-000000000136'}
periodic/8::DEBUG::2017-02-01
23:31:39,700::task::597::Storage.TaskManager.Task::(_updateState) Task=
`d5169f55-8fa6-494e-bf3c-b077b922e06c`::moving from state init -> state
preparing
periodic/8::INFO::2017-02-01
23:31:39,700::logUtils::49::dispatcher::(wrapper) Run and protect:
sendExtendMsg(spUUID=u'588237b8-0031-02f6-035d-000000000136',
volDict={'newSize': 13958643712, 'domainID':
u'922b5269-ab56-4c4d-838f-49d33427e2ab', 'name': u'vda', 'volumeID':
u'66615920-619f-4f52-ad5a-3f062c054094', 'imageID':
u'32b0e94d-0c8e-4d3d-a4eb-448885297e93', 'internal': False, 'poolID':
u'588237b8-0031-02f6-035d-000000000136'}, newSize=13958643712,
callbackFunc=<bound method Vm.__afterVolumeExtension of <virt.vm.Vm object
at 0x2feb5d0>>)
periodic/8::DEBUG::2017-02-01
23:31:39,701::storage_mailbox::125::Storage.SPM.Messages.Extend::(__init__)
new extend msg created: domain: 922b5269-ab56-4c4d-838f-49d33427e2ab,
volume: 66615920-619f-4f52-ad5a-3f062c054094
periodic/8::INFO::2017-02-01
23:31:39,701::logUtils::52::dispatcher::(wrapper) Run and protect:
sendExtendMsg, Return response: None
periodic/8::DEBUG::2017-02-01
23:31:39,701::task::1193::Storage.TaskManager.Task::(prepare)
Task=`d5169f55-8fa6-494e-bf3c-b077b922e06c`::finished: None
periodic/8::DEBUG::2017-02-01
23:31:39,701::task::597::Storage.TaskManager.Task::(_updateState)
Task=`d5169f55-8fa6-494e-bf3c-b077b922e06c`::moving from state preparing ->
state finished
periodic/8::DEBUG::2017-02-01
23:31:39,701::resourceManager::952::Storage.ResourceManager.Owner::(releaseAll)
Owner.releaseAll requests {} resources {}
periodic/8::DEBUG::2017-02-01
23:31:39,701::resourceManager::989::Storage.ResourceManager.Owner::(cancelAll)
Owner.cancelAll requests {}
periodic/8::DEBUG::2017-02-01
23:31:39,701::task::995::Storage.TaskManager.Task::(_decref)
Task=`d5169f55-8fa6-494e-bf3c-b077b922e06c`::ref 0 aborting False
mailbox.HSMMonitor::DEBUG::2017-02-01
23:31:39,702::storage_mailbox::428::Storage.MailBox.HsmMailMonitor::(_handleMessage)
HSM_MailMonitor - start: 64, end: 128, len: 4096, message(1/63):
"1xtnd\xab\xe2'4\xd3I\x8f\x83MLV\xabiR+\x92\x94@\x05,\x06?Z\xadRO\x9fa
Yaf000000000000340000000000000"
mailbox.HSMMonitor::DEBUG::2017-02-01
23:31:39,702::storage_mailbox::375::Storage.Misc.excCmd::(_checkForMail)
/usr/bin/taskset --cpu-list 0-7 /usr/bin/dd
if=/rhev/data-center/588237b8-0031-02f6-035d-00
0000000136/mastersd/dom_md/outbox iflag=direct,fullblock bs=512 count=8
skip=16 (cwd None)
mailbox.SPMMonitor::DEBUG::2017-02-01
23:31:39,790::storage_mailbox::733::Storage.Misc.excCmd::(_checkForMail)
/usr/bin/taskset --cpu-list 0-7 dd
if=/rhev/data-center/588237b8-0031-02f6-035d-000000000136/mastersd/dom_md/inbox
iflag=direct,fullblock count=1 bs=1024000 (cwd None)
mailbox.HSMMonitor::DEBUG::2017-02-01
23:31:39,815::storage_mailbox::375::Storage.Misc.excCmd::(_checkForMail)
SUCCESS: <err> = '8+0 records in\n8+0 records out\n4096 bytes (4.1 kB)
copied, 0.0900633 s, 45.5 kB/s\n'; <rc> = 0
mailbox.HSMMonitor::INFO::2017-02-01
23:31:39,816::storage_mailbox::387::Storage.MailBox.HsmMailMonitor::(_sendMail)
HSM_MailMonitor sending mail to SPM - ['/usr/bin/dd',
'of=/rhev/data-center/588237b8-0031-02f6-035d-000000000136/mastersd/dom_md/inbox',
'iflag=fullblock', 'oflag=direct', 'conv=notrunc', 'bs=512', 'seek=16']
mailbox.HSMMonitor::DEBUG::2017-02-01
23:31:39,816::storage_mailbox::86::Storage.Misc.excCmd::(_mboxExecCmd)
/usr/bin/taskset --cpu-list 0-7 /usr/bin/dd
of=/rhev/data-center/588237b8-0031-02f6-035d-000000000136/mastersd/dom_md/inbox
iflag=fullblock oflag=direct conv=notrunc bs=512 seek=16 (cwd None)
mailbox.SPMMonitor::DEBUG::2017-02-01
23:31:39,844::storage_mailbox::733::Storage.Misc.excCmd::(_checkForMail)
SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB)
copied, 0.0290379 s, 35.3 MB/s\n'; <rc> = 0
periodic/7::INFO::2017-02-01
23:31:41,700::vm::1028::virt.vm::(extendDrivesIfNeeded)
vmId=`932db7c7-4121-4cbe-ad8d-09e4e99b3cdd`::Requesting extension for volume
66615920-619f-4f52-ad5a-3f062c054094 on domain
922b5269-ab56-4c4d-838f-49d33427e2ab (apparent: 12884901888, capacity:
96636764160, allocated: 12617187328, physical: 12884901888)
...
mailbox.SPMMonitor::DEBUG::2017-02-01
23:31:42,181::storage_mailbox::733::Storage.Misc.excCmd::(_checkForMail)
SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB)
copied, 0.30031 s, 3.4 MB/s\n'; <rc> = 0
mailbox.SPMMonitor::ERROR::2017-02-01
23:31:42,182::storage_mailbox::619::Storage.MailBox.SpmMailMonitor::(_validateMailbox)
SPM_MailMonitor: mailbox 2 checksum failed, not clearing mailbox, clearing
newMail.
jsonrpc.Executor/6::DEBUG::2017-02-01
23:31:42,587::__init__::530::jsonrpc.JsonRpcServer::(_handle_request)
Calling 'Host.getAllVmStats' in bridge with {}
jsonrpc.Executor/6::DEBUG::2017-02-01
23:31:42,589::__init__::555::jsonrpc.JsonRpcServer::(_handle_request) Return
'Host.getAllVmStats' in bridge with (suppressed)
jsonrpc.Executor/6::INFO::2017-02-01
23:31:42,590::__init__::513::jsonrpc.JsonRpcServer::(_serveRequest) RPC call
Host.getAllVmStats succeeded in 0.00 seconds
libvirtEventLoop::INFO::2017-02-01
23:31:43,586::vm::4041::virt.vm::(onIOError)
vmId=`932db7c7-4121-4cbe-ad8d-09e4e99b3cdd`::abnormal vm stop device
virtio-disk0 error enospc
libvirtEventLoop::INFO::2017-02-01
23:31:43,586::vm::4877::virt.vm::(_logGuestCpuStatus)
vmId=`932db7c7-4121-4cbe-ad8d-09e4e99b3cdd`::CPU stopped: onIOError
periodic/6::WARNING::2017-02-01
23:31:43,698::periodic::277::virt.periodic.VmDispatcher::(__call__) could
not run <class 'vdsm.virt.periodic.DriveWatermarkMonitor'> on
[u'932db7c7-4121-4cbe-ad8d-09e4e99b3cdd']
mailbox.HSMMonitor::DEBUG::2017-02-01
23:31:43,790::storage_mailbox::86::Storage.Misc.excCmd::(_mboxExecCmd)
SUCCESS: <err> = '8+0 records in\n8+0 records out\n4096 bytes (4.1 kB)
copied, 3.95001 s, 1.0 kB/s\n'; <rc> = 0
The disk of the VM is 90Gb in virtual size, thin provisioned
Output of the commands inside guest until Ctrl+C (I interrupted the lop as I
received the VM paused error...):
[g.cecchi@ol65 ~]$ while true; do time dd if=/dev/zero bs=1024k count=1024
of=/home/g.cecchi/testfile; sleep 5; done
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 9.27353 s, 116 MB/s
real    0m9.353s
user    0m0.007s
sys    0m3.174s
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 9.68965 s, 111 MB/s
real    0m9.929s
user    0m0.000s
sys    0m2.972s
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 9.16642 s, 117 MB/s
real    0m9.412s
user    0m0.008s
sys    0m2.766s
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 2.68209 s, 400 MB/s
real    0m2.931s
user    0m0.004s
sys    0m2.927s
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 6.11601 s, 176 MB/s
real    0m9.158s
user    0m0.002s
sys    0m2.922s
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 2.67673 s, 401 MB/s
real    0m2.921s
user    0m0.004s
sys    0m2.916s
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 4.10537 s, 262 MB/s
real    0m6.885s
user    0m0.009s
sys    0m2.984s
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 5.56343 s, 193 MB/s
real    0m7.866s
user    0m0.004s
sys    0m3.016s
^C
For the VM disk I get this info:
[root@ovmsrv06 vdsm]# qemu-img info
/rhev/data-center/mnt/blockSD/922b5269-ab56-4c4d-838f-49d33427e2ab/images/32b0e94d-0c8e-4d3d-a4eb-448885297e93/66615920-619f-4f52-ad5a-3f062c054094
image:
/rhev/data-center/mnt/blockSD/922b5269-ab56-4c4d-838f-49d33427e2ab/images/32b0e94d-0c8e-4d3d-a4eb-448885297e93/66615920-619f-4f52-ad5a-3f062c054094
file format: qcow2
virtual size: 90G (96636764160 bytes)
disk size: 0
cluster_size: 65536
Format specific information:
    compat: 0.10
    refcount bits: 16
[root@ovmsrv06 vdsm]#
Apparently disk size 0?????
Qemu probably cannot handle disk size on block devices.
...
In
cat
/rhev/data-center/mnt/blockSD/922b5269-ab56-4c4d-838f-49d33427e2ab/dom_md/metadata
I get
EOF
DOMAIN=922b5269-ab56-4c4d-838f-49d33427e2ab
CTIME=1485272737
FORMAT=COW
DISKTYPE=2
LEGALITY=LEGAL
SIZE=188743680
VOLTYPE=LEAF
DESCRIPTION={"DiskAlias":"ol65_Disk1","DiskDescription":""}
IMAGE=32b0e94d-0c8e-4d3d-a4eb-448885297e93
PUUID=00000000-0000-0000-0000-000000000000
MTIME=0
POOL_UUID=
TYPE=SPARSE
EOF

Re: [ovirt-users] VM has been paused due to storage I/O problem

Nir Soffer