Re: [ovirt-users] VM has been paused due to storage I/O problem

1 Feb 2017

      On Wed, Feb 1, 2017 at 8:22 PM, Gianluca Cecchi <gianluca.cecchi@gmail.com>
wrote:
...
OK. In the mean time I have applied your suggested config and restarted
the 2 nodes.
Let we test and see if I find any problems running also some I/O tests.
Thanks in the mean time,
Gianluca
Quick test without much success

Inside the guest I run this loop
while true
do
time dd if=/dev/zero bs=1024k count=1024 of=/home/g.cecchi/testfile
sleep 5
done

BTW: my home is inside / filesystem on guest that has space to accomodate
1Gb of the dd command:
[g.cecchi@ol65 ~]$ df -h /home/g.cecchi/
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_ol65-lv_root
                       20G  4.9G   14G  27% /
[g.cecchi@ol65 ~]$

After about 7 rounds I get this in messages of the host where the VM is
running:

Feb  1 23:31:39 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
Feb  1 23:31:39 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
Feb  1 23:31:44 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
Feb  1 23:31:45 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
Feb  1 23:31:45 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
Feb  1 23:31:47 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
Feb  1 23:31:50 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
Feb  1 23:31:50 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
Feb  1 23:31:56 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
Feb  1 23:31:57 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!

Nothing on the other host.
In web admin events pane:
Feb 1, 2017 11:31:44 PM VM ol65 has been paused due to no Storage space
error.
Feb 1, 2017 11:31:44 PM VM ol65 has been paused.

I stop the dd loop and after some seconds:
Feb 1, 2017 11:32:32 PM VM ol65 has recovered from paused back to up

Multipath status for my device:

3600a0b8000299aa80000d08b55014119 dm-2 IBM     ,1814      FAStT
size=4.0T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1
rdac' wp=rw
|-+- policy='service-time 0' prio=0 status=active
| |- 0:0:1:3 sdj 8:144 active undef running
| `- 2:0:1:3 sdp 8:240 active undef running
`-+- policy='service-time 0' prio=0 status=enabled
  |- 0:0:0:3 sdd 8:48  active undef running
  `- 2:0:0:3 sdk 8:160 active undef running

In engine.log

2017-02-01 23:22:01,449 INFO
[org.ovirt.engine.core.bll.aaa.CreateUserSessionCommand] (default task-15)
[530aee87] Running command: CreateUserSessionCommand internal: false.
2017-02-01 23:22:04,011 INFO
[org.ovirt.engine.docs.utils.servlet.ContextSensitiveHelpMappingServlet]
(default task-12) [] Context-sensitive help is not installed. Manual
directory doesn't exist: /usr/share/ovirt-engine/manual
2017-02-01 23:31:43,936 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(ForkJoinPool-1-worker-10) [10c15e39] VM
'932db7c7-4121-4cbe-ad8d-09e4e99b3cdd'(ol65) moved from 'Up' --> 'Paused'
2017-02-01 23:31:44,087 INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(ForkJoinPool-1-worker-10) [10c15e39] Correlation ID: null, Call Stack:
null, Custom Event ID: -1, Message: VM ol65 has been paused.
2017-02-01 23:31:44,227 ERROR
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(ForkJoinPool-1-worker-10) [10c15e39] Correlation ID: null, Call Stack:
null, Custom Event ID: -1, Message: VM ol65 has been paused due to no
Storage space error.
2017-02-01 23:32:32,100 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(DefaultQuartzScheduler5) [1c02a371] VM
'932db7c7-4121-4cbe-ad8d-09e4e99b3cdd'(ol65) moved from 'Paused' --> 'Up'
2017-02-01 23:32:32,259 INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler5) [1c02a371] Correlation ID: null, Call Stack:
null, Custom Event ID: -1, Message: VM ol65 has recovered from paused back
to up.

In vdsm.log of the host where the VM is running I see this:

mailbox.SPMMonitor::DEBUG::2017-02-01 23:31:37,729::storage_mailbox:
:733::Storage.Misc.excCmd::(_chec
kForMail) /usr/bin/taskset --cpu-list 0-7 dd if=/rhev/data-center/588237b8-
0031-02f6-035d-00000000013
6/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None)
mailbox.SPMMonitor::DEBUG::2017-02-01 23:31:37,764::storage_mailbox:
:733::Storage.Misc.excCmd::(_chec
kForMail) SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes
(1.0 MB) copied, 0.0061866
 s, 166 MB/s\n'; <rc> = 0
periodic/8::INFO::2017-02-01
23:31:39,699::vm::1028::virt.vm::(extendDrivesIfNeeded)
vmId=`932db7c7-4
121-4cbe-ad8d-09e4e99b3cdd`::Requesting extension for volume
66615920-619f-4f52-ad5a-3f062c054094 on
domain 922b5269-ab56-4c4d-838f-49d33427e2ab (apparent: 12884901888,
capacity: 96636764160, allocated:
 12355698688, physical: 12884901888)
periodic/8::DEBUG::2017-02-01
23:31:39,699::vm::1093::virt.vm::(__extendDriveVolume)
vmId=`932db7c7-4
121-4cbe-ad8d-09e4e99b3cdd`::Requesting an extension for the volume:
{'newSize': 13958643712, 'domain
ID': u'922b5269-ab56-4c4d-838f-49d33427e2ab', 'name': u'vda', 'volumeID':
u'66615920-619f-4f52-ad5a-3
f062c054094', 'imageID': u'32b0e94d-0c8e-4d3d-a4eb-448885297e93',
'internal': False, 'poolID': u'5882
37b8-0031-02f6-035d-000000000136'}
periodic/8::DEBUG::2017-02-01 23:31:39,700::task::597::
Storage.TaskManager.Task::(_updateState) Task=
`d5169f55-8fa6-494e-bf3c-b077b922e06c`::moving from state init -> state
preparing
periodic/8::INFO::2017-02-01 23:31:39,700::logUtils::49::dispatcher::(wrapper)
Run and protect: sendExtendMsg(spUUID=u'588237b8-0031-02f6-035d-000000000136',
volDict={'newSize': 13958643712, 'domainID':
u'922b5269-ab56-4c4d-838f-49d33427e2ab',
'name': u'vda', 'volumeID': u'66615920-619f-4f52-ad5a-3f062c054094',
'imageID': u'32b0e94d-0c8e-4d3d-a4eb-448885297e93', 'internal': False,
'poolID': u'588237b8-0031-02f6-035d-000000000136'}, newSize=13958643712,
callbackFunc=<bound method Vm.__afterVolumeExtension of <virt.vm.Vm object
at 0x2feb5d0>>)
periodic/8::DEBUG::2017-02-01 23:31:39,701::storage_mailbox:
:125::Storage.SPM.Messages.Extend::(__init__) new extend msg created:
domain: 922b5269-ab56-4c4d-838f-49d33427e2ab, volume:
66615920-619f-4f52-ad5a-3f062c054094
periodic/8::INFO::2017-02-01 23:31:39,701::logUtils::52::dispatcher::(wrapper)
Run and protect: sendExtendMsg, Return response: None
periodic/8::DEBUG::2017-02-01 23:31:39,701::task::1193::
Storage.TaskManager.Task::(prepare)
Task=`d5169f55-8fa6-494e-bf3c-b077b922e06c`::finished:
None
periodic/8::DEBUG::2017-02-01 23:31:39,701::task::597::
Storage.TaskManager.Task::(_updateState)
Task=`d5169f55-8fa6-494e-bf3c-b077b922e06c`::moving
from state preparing -> state finished
periodic/8::DEBUG::2017-02-01 23:31:39,701::resourceManager:
:952::Storage.ResourceManager.Owner::(releaseAll) Owner.releaseAll requests
{} resources {}
periodic/8::DEBUG::2017-02-01 23:31:39,701::resourceManager:
:989::Storage.ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
periodic/8::DEBUG::2017-02-01 23:31:39,701::task::995::
Storage.TaskManager.Task::(_decref)
Task=`d5169f55-8fa6-494e-bf3c-b077b922e06c`::ref
0 aborting False
mailbox.HSMMonitor::DEBUG::2017-02-01 23:31:39,702::storage_mailbox:
:428::Storage.MailBox.HsmMailMonitor::(_handleMessage) HSM_MailMonitor -
start: 64, end: 128, len: 4096, message(1/63): "1xtnd\xab\xe2'4\xd3I\x8f\
x83MLV\xabiR+\x92\x94@\x05,\x06?Z\xadRO\x9fa Yaf000000000000340000000000000"
mailbox.HSMMonitor::DEBUG::2017-02-01 23:31:39,702::storage_mailbox:
:375::Storage.Misc.excCmd::(_checkForMail) /usr/bin/taskset --cpu-list 0-7
/usr/bin/dd if=/rhev/data-center/588237b8-0031-02f6-035d-00
0000000136/mastersd/dom_md/outbox iflag=direct,fullblock bs=512 count=8
skip=16 (cwd None)
mailbox.SPMMonitor::DEBUG::2017-02-01 23:31:39,790::storage_mailbox:
:733::Storage.Misc.excCmd::(_checkForMail) /usr/bin/taskset --cpu-list 0-7
dd if=/rhev/data-center/588237b8-0031-02f6-035d-000000000136/mastersd/dom_md/inbox
iflag=direct,fullblock count=1 bs=1024000 (cwd None)
mailbox.HSMMonitor::DEBUG::2017-02-01 23:31:39,815::storage_mailbox:
:375::Storage.Misc.excCmd::(_checkForMail) SUCCESS: <err> = '8+0 records
in\n8+0 records out\n4096 bytes (4.1 kB) copied, 0.0900633 s, 45.5 kB/s\n';
<rc> = 0
mailbox.HSMMonitor::INFO::2017-02-01 23:31:39,816::storage_mailbox:
:387::Storage.MailBox.HsmMailMonitor::(_sendMail) HSM_MailMonitor sending
mail to SPM - ['/usr/bin/dd', 'of=/rhev/data-center/588237b8-0031-02f6-035d-
000000000136/mastersd/dom_md/inbox', 'iflag=fullblock', 'oflag=direct',
'conv=notrunc', 'bs=512', 'seek=16']
mailbox.HSMMonitor::DEBUG::2017-02-01 23:31:39,816::storage_mailbox:
:86::Storage.Misc.excCmd::(_mboxExecCmd) /usr/bin/taskset --cpu-list 0-7
/usr/bin/dd of=/rhev/data-center/588237b8-0031-02f6-035d-000000000136/mastersd/dom_md/inbox
iflag=fullblock oflag=direct conv=notrunc bs=512 seek=16 (cwd None)
mailbox.SPMMonitor::DEBUG::2017-02-01 23:31:39,844::storage_mailbox:
:733::Storage.Misc.excCmd::(_checkForMail) SUCCESS: <err> = '1+0 records
in\n1+0 records out\n1024000 bytes (1.0 MB) copied, 0.0290379 s, 35.3
MB/s\n'; <rc> = 0
periodic/7::INFO::2017-02-01
23:31:41,700::vm::1028::virt.vm::(extendDrivesIfNeeded)
vmId=`932db7c7-4121-4cbe-ad8d-09e4e99b3cdd`::Requesting extension for
volume 66615920-619f-4f52-ad5a-3f062c054094 on domain
922b5269-ab56-4c4d-838f-49d33427e2ab (apparent: 12884901888, capacity:
96636764160, allocated: 12617187328, physical: 12884901888)
...
mailbox.SPMMonitor::DEBUG::2017-02-01 23:31:42,181::storage_mailbox:
:733::Storage.Misc.excCmd::(_checkForMail) SUCCESS: <err> = '1+0 records
in\n1+0 records out\n1024000 bytes (1.0 MB) copied, 0.30031 s, 3.4 MB/s\n';
<rc> = 0
mailbox.SPMMonitor::ERROR::2017-02-01 23:31:42,182::storage_mailbox:
:619::Storage.MailBox.SpmMailMonitor::(_validateMailbox) SPM_MailMonitor:
mailbox 2 checksum failed, not clearing mailbox, clearing newMail.
jsonrpc.Executor/6::DEBUG::2017-02-01 23:31:42,587::__init__::530::
jsonrpc.JsonRpcServer::(_handle_request) Calling 'Host.getAllVmStats' in
bridge with {}
jsonrpc.Executor/6::DEBUG::2017-02-01 23:31:42,589::__init__::555::
jsonrpc.JsonRpcServer::(_handle_request) Return 'Host.getAllVmStats' in
bridge with (suppressed)
jsonrpc.Executor/6::INFO::2017-02-01 23:31:42,590::__init__::513::
jsonrpc.JsonRpcServer::(_serveRequest) RPC call Host.getAllVmStats
succeeded in 0.00 seconds
libvirtEventLoop::INFO::2017-02-01 23:31:43,586::vm::4041::virt.vm::(onIOError)
vmId=`932db7c7-4121-4cbe-ad8d-09e4e99b3cdd`::abnormal vm stop device
virtio-disk0 error enospc
libvirtEventLoop::INFO::2017-02-01
23:31:43,586::vm::4877::virt.vm::(_logGuestCpuStatus)
vmId=`932db7c7-4121-4cbe-ad8d-09e4e99b3cdd`::CPU stopped: onIOError
periodic/6::WARNING::2017-02-01 23:31:43,698::periodic::277::
virt.periodic.VmDispatcher::(__call__) could not run <class
'vdsm.virt.periodic.DriveWatermarkMonitor'> on [u'932db7c7-4121-4cbe-ad8d-
09e4e99b3cdd']
mailbox.HSMMonitor::DEBUG::2017-02-01 23:31:43,790::storage_mailbox:
:86::Storage.Misc.excCmd::(_mboxExecCmd) SUCCESS: <err> = '8+0 records
in\n8+0 records out\n4096 bytes (4.1 kB) copied, 3.95001 s, 1.0 kB/s\n';
<rc> = 0

The disk of the VM is 90Gb in virtual size, thin provisioned

Output of the commands inside guest until Ctrl+C (I interrupted the lop as
I received the VM paused error...):

[g.cecchi@ol65 ~]$ while true; do time dd if=/dev/zero bs=1024k count=1024
of=/home/g.cecchi/testfile; sleep 5; done
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 9.27353 s, 116 MB/s

real    0m9.353s
user    0m0.007s
sys    0m3.174s

1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 9.68965 s, 111 MB/s

real    0m9.929s
user    0m0.000s
sys    0m2.972s
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 9.16642 s, 117 MB/s

real    0m9.412s
user    0m0.008s
sys    0m2.766s
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 2.68209 s, 400 MB/s

real    0m2.931s
user    0m0.004s
sys    0m2.927s
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 6.11601 s, 176 MB/s

real    0m9.158s
user    0m0.002s
sys    0m2.922s
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 2.67673 s, 401 MB/s

real    0m2.921s
user    0m0.004s
sys    0m2.916s
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 4.10537 s, 262 MB/s

real    0m6.885s
user    0m0.009s
sys    0m2.984s
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 5.56343 s, 193 MB/s

real    0m7.866s
user    0m0.004s
sys    0m3.016s
^C

For the VM disk I get this info:

[root@ovmsrv06 vdsm]# qemu-img info
/rhev/data-center/mnt/blockSD/922b5269-ab56-4c4d-838f-49d33427e2ab/images/32b0e94d-0c8e-4d3d-a4eb-448885297e93/66615920-619f-4f52-ad5a-3f062c054094

image:
/rhev/data-center/mnt/blockSD/922b5269-ab56-4c4d-838f-49d33427e2ab/images/32b0e94d-0c8e-4d3d-a4eb-448885297e93/66615920-619f-4f52-ad5a-3f062c054094
file format: qcow2
virtual size: 90G (96636764160 bytes)
disk size: 0
cluster_size: 65536
Format specific information:
    compat: 0.10
    refcount bits: 16
[root@ovmsrv06 vdsm]#

Apparently disk size 0?????

In
cat
/rhev/data-center/mnt/blockSD/922b5269-ab56-4c4d-838f-49d33427e2ab/dom_md/metadata

I get
EOF
DOMAIN=922b5269-ab56-4c4d-838f-49d33427e2ab
CTIME=1485272737
FORMAT=COW
DISKTYPE=2
LEGALITY=LEGAL
SIZE=188743680
VOLTYPE=LEAF
DESCRIPTION={"DiskAlias":"ol65_Disk1","DiskDescription":""}
IMAGE=32b0e94d-0c8e-4d3d-a4eb-448885297e93
PUUID=00000000-0000-0000-0000-000000000000
MTIME=0
POOL_UUID=
TYPE=SPARSE
EOF