On Wed, Feb 1, 2017 at 8:22 PM, Gianluca Cecchi <gianluca.cecchi@gmail.com> wrote:

OK. In the mean time I have applied your suggested config and restarted the 2 nodes.
Let we test and see if I find any problems running also some I/O tests.
Thanks in the mean time,
Gianluca


Quick test without much success

Inside the guest I run this loop
while true
do
time dd if=/dev/zero bs=1024k count=1024 of=/home/g.cecchi/testfile
sleep 5
done

BTW: my home is inside / filesystem on guest that has space to accomodate 1Gb of the dd command:
[g.cecchi@ol65 ~]$ df -h /home/g.cecchi/
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_ol65-lv_root
                       20G  4.9G   14G  27% /
[g.cecchi@ol65 ~]$

After about 7 rounds I get this in messages of the host where the VM is running:

Feb  1 23:31:39 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
Feb  1 23:31:39 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
Feb  1 23:31:44 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
Feb  1 23:31:45 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
Feb  1 23:31:45 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
Feb  1 23:31:47 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
Feb  1 23:31:50 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
Feb  1 23:31:50 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
Feb  1 23:31:56 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!
Feb  1 23:31:57 ovmsrv06 kernel: dd: sending ioctl 80306d02 to a partition!

Nothing on the other host.
In web admin events pane:
Feb 1, 2017 11:31:44 PM VM ol65 has been paused due to no Storage space error.
Feb 1, 2017 11:31:44 PM VM ol65 has been paused.

I stop the dd loop and after some seconds:
Feb 1, 2017 11:32:32 PM VM ol65 has recovered from paused back to up

Multipath status for my device:

3600a0b8000299aa80000d08b55014119 dm-2 IBM     ,1814      FAStT
size=4.0T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 rdac' wp=rw
|-+- policy='service-time 0' prio=0 status=active
| |- 0:0:1:3 sdj 8:144 active undef running
| `- 2:0:1:3 sdp 8:240 active undef running
`-+- policy='service-time 0' prio=0 status=enabled
  |- 0:0:0:3 sdd 8:48  active undef running
  `- 2:0:0:3 sdk 8:160 active undef running

In engine.log

2017-02-01 23:22:01,449 INFO  [org.ovirt.engine.core.bll.aaa.CreateUserSessionCommand] (default task-15) [530aee87] Running command: CreateUserSessionCommand internal: false.
2017-02-01 23:22:04,011 INFO  [org.ovirt.engine.docs.utils.servlet.ContextSensitiveHelpMappingServlet] (default task-12) [] Context-sensitive help is not installed. Manual directory doesn't exist: /usr/share/ovirt-engine/manual
2017-02-01 23:31:43,936 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-10) [10c15e39] VM '932db7c7-4121-4cbe-ad8d-09e4e99b3cdd'(ol65) moved from 'Up' --> 'Paused'
2017-02-01 23:31:44,087 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-10) [10c15e39] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM ol65 has been paused.
2017-02-01 23:31:44,227 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-10) [10c15e39] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM ol65 has been paused due to no Storage space error.
2017-02-01 23:32:32,100 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler5) [1c02a371] VM '932db7c7-4121-4cbe-ad8d-09e4e99b3cdd'(ol65) moved from 'Paused' --> 'Up'
2017-02-01 23:32:32,259 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler5) [1c02a371] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM ol65 has recovered from paused back to up.


In vdsm.log of the host where the VM is running I see this:

mailbox.SPMMonitor::DEBUG::2017-02-01 23:31:37,729::storage_mailbox::733::Storage.Misc.excCmd::(_chec
kForMail) /usr/bin/taskset --cpu-list 0-7 dd if=/rhev/data-center/588237b8-0031-02f6-035d-00000000013
6/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None)
mailbox.SPMMonitor::DEBUG::2017-02-01 23:31:37,764::storage_mailbox::733::Storage.Misc.excCmd::(_chec
kForMail) SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB) copied, 0.0061866
 s, 166 MB/s\n'; <rc> = 0
periodic/8::INFO::2017-02-01 23:31:39,699::vm::1028::virt.vm::(extendDrivesIfNeeded) vmId=`932db7c7-4
121-4cbe-ad8d-09e4e99b3cdd`::Requesting extension for volume 66615920-619f-4f52-ad5a-3f062c054094 on
domain 922b5269-ab56-4c4d-838f-49d33427e2ab (apparent: 12884901888, capacity: 96636764160, allocated:
 12355698688, physical: 12884901888)
periodic/8::DEBUG::2017-02-01 23:31:39,699::vm::1093::virt.vm::(__extendDriveVolume) vmId=`932db7c7-4
121-4cbe-ad8d-09e4e99b3cdd`::Requesting an extension for the volume: {'newSize': 13958643712, 'domain
ID': u'922b5269-ab56-4c4d-838f-49d33427e2ab', 'name': u'vda', 'volumeID': u'66615920-619f-4f52-ad5a-3
f062c054094', 'imageID': u'32b0e94d-0c8e-4d3d-a4eb-448885297e93', 'internal': False, 'poolID': u'5882
37b8-0031-02f6-035d-000000000136'}
periodic/8::DEBUG::2017-02-01 23:31:39,700::task::597::Storage.TaskManager.Task::(_updateState) Task=
`d5169f55-8fa6-494e-bf3c-b077b922e06c`::moving from state init -> state preparing
periodic/8::INFO::2017-02-01 23:31:39,700::logUtils::49::dispatcher::(wrapper) Run and protect: sendExtendMsg(spUUID=u'588237b8-0031-02f6-035d-000000000136', volDict={'newSize': 13958643712, 'domainID': u'922b5269-ab56-4c4d-838f-49d33427e2ab', 'name': u'vda', 'volumeID': u'66615920-619f-4f52-ad5a-3f062c054094', 'imageID': u'32b0e94d-0c8e-4d3d-a4eb-448885297e93', 'internal': False, 'poolID': u'588237b8-0031-02f6-035d-000000000136'}, newSize=13958643712, callbackFunc=<bound method Vm.__afterVolumeExtension of <virt.vm.Vm object at 0x2feb5d0>>)
periodic/8::DEBUG::2017-02-01 23:31:39,701::storage_mailbox::125::Storage.SPM.Messages.Extend::(__init__) new extend msg created: domain: 922b5269-ab56-4c4d-838f-49d33427e2ab, volume: 66615920-619f-4f52-ad5a-3f062c054094
periodic/8::INFO::2017-02-01 23:31:39,701::logUtils::52::dispatcher::(wrapper) Run and protect: sendExtendMsg, Return response: None
periodic/8::DEBUG::2017-02-01 23:31:39,701::task::1193::Storage.TaskManager.Task::(prepare) Task=`d5169f55-8fa6-494e-bf3c-b077b922e06c`::finished: None
periodic/8::DEBUG::2017-02-01 23:31:39,701::task::597::Storage.TaskManager.Task::(_updateState) Task=`d5169f55-8fa6-494e-bf3c-b077b922e06c`::moving from state preparing -> state finished
periodic/8::DEBUG::2017-02-01 23:31:39,701::resourceManager::952::Storage.ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}
periodic/8::DEBUG::2017-02-01 23:31:39,701::resourceManager::989::Storage.ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
periodic/8::DEBUG::2017-02-01 23:31:39,701::task::995::Storage.TaskManager.Task::(_decref) Task=`d5169f55-8fa6-494e-bf3c-b077b922e06c`::ref 0 aborting False
mailbox.HSMMonitor::DEBUG::2017-02-01 23:31:39,702::storage_mailbox::428::Storage.MailBox.HsmMailMonitor::(_handleMessage) HSM_MailMonitor - start: 64, end: 128, len: 4096, message(1/63): "1xtnd\xab\xe2'4\xd3I\x8f\x83MLV\xabiR+\x92\x94@\x05,\x06?Z\xadRO\x9fa Yaf000000000000340000000000000"
mailbox.HSMMonitor::DEBUG::2017-02-01 23:31:39,702::storage_mailbox::375::Storage.Misc.excCmd::(_checkForMail) /usr/bin/taskset --cpu-list 0-7 /usr/bin/dd if=/rhev/data-center/588237b8-0031-02f6-035d-00
0000000136/mastersd/dom_md/outbox iflag=direct,fullblock bs=512 count=8 skip=16 (cwd None)
mailbox.SPMMonitor::DEBUG::2017-02-01 23:31:39,790::storage_mailbox::733::Storage.Misc.excCmd::(_checkForMail) /usr/bin/taskset --cpu-list 0-7 dd if=/rhev/data-center/588237b8-0031-02f6-035d-000000000136/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None)
mailbox.HSMMonitor::DEBUG::2017-02-01 23:31:39,815::storage_mailbox::375::Storage.Misc.excCmd::(_checkForMail) SUCCESS: <err> = '8+0 records in\n8+0 records out\n4096 bytes (4.1 kB) copied, 0.0900633 s, 45.5 kB/s\n'; <rc> = 0
mailbox.HSMMonitor::INFO::2017-02-01 23:31:39,816::storage_mailbox::387::Storage.MailBox.HsmMailMonitor::(_sendMail) HSM_MailMonitor sending mail to SPM - ['/usr/bin/dd', 'of=/rhev/data-center/588237b8-0031-02f6-035d-000000000136/mastersd/dom_md/inbox', 'iflag=fullblock', 'oflag=direct', 'conv=notrunc', 'bs=512', 'seek=16']
mailbox.HSMMonitor::DEBUG::2017-02-01 23:31:39,816::storage_mailbox::86::Storage.Misc.excCmd::(_mboxExecCmd) /usr/bin/taskset --cpu-list 0-7 /usr/bin/dd of=/rhev/data-center/588237b8-0031-02f6-035d-000000000136/mastersd/dom_md/inbox iflag=fullblock oflag=direct conv=notrunc bs=512 seek=16 (cwd None)
mailbox.SPMMonitor::DEBUG::2017-02-01 23:31:39,844::storage_mailbox::733::Storage.Misc.excCmd::(_checkForMail) SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB) copied, 0.0290379 s, 35.3 MB/s\n'; <rc> = 0
periodic/7::INFO::2017-02-01 23:31:41,700::vm::1028::virt.vm::(extendDrivesIfNeeded) vmId=`932db7c7-4121-4cbe-ad8d-09e4e99b3cdd`::Requesting extension for volume 66615920-619f-4f52-ad5a-3f062c054094 on domain 922b5269-ab56-4c4d-838f-49d33427e2ab (apparent: 12884901888, capacity: 96636764160, allocated: 12617187328, physical: 12884901888)
...
mailbox.SPMMonitor::DEBUG::2017-02-01 23:31:42,181::storage_mailbox::733::Storage.Misc.excCmd::(_checkForMail) SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB) copied, 0.30031 s, 3.4 MB/s\n'; <rc> = 0
mailbox.SPMMonitor::ERROR::2017-02-01 23:31:42,182::storage_mailbox::619::Storage.MailBox.SpmMailMonitor::(_validateMailbox) SPM_MailMonitor: mailbox 2 checksum failed, not clearing mailbox, clearing newMail.
jsonrpc.Executor/6::DEBUG::2017-02-01 23:31:42,587::__init__::530::jsonrpc.JsonRpcServer::(_handle_request) Calling 'Host.getAllVmStats' in bridge with {}
jsonrpc.Executor/6::DEBUG::2017-02-01 23:31:42,589::__init__::555::jsonrpc.JsonRpcServer::(_handle_request) Return 'Host.getAllVmStats' in bridge with (suppressed)
jsonrpc.Executor/6::INFO::2017-02-01 23:31:42,590::__init__::513::jsonrpc.JsonRpcServer::(_serveRequest) RPC call Host.getAllVmStats succeeded in 0.00 seconds
libvirtEventLoop::INFO::2017-02-01 23:31:43,586::vm::4041::virt.vm::(onIOError) vmId=`932db7c7-4121-4cbe-ad8d-09e4e99b3cdd`::abnormal vm stop device virtio-disk0 error enospc
libvirtEventLoop::INFO::2017-02-01 23:31:43,586::vm::4877::virt.vm::(_logGuestCpuStatus) vmId=`932db7c7-4121-4cbe-ad8d-09e4e99b3cdd`::CPU stopped: onIOError
periodic/6::WARNING::2017-02-01 23:31:43,698::periodic::277::virt.periodic.VmDispatcher::(__call__) could not run <class 'vdsm.virt.periodic.DriveWatermarkMonitor'> on [u'932db7c7-4121-4cbe-ad8d-09e4e99b3cdd']
mailbox.HSMMonitor::DEBUG::2017-02-01 23:31:43,790::storage_mailbox::86::Storage.Misc.excCmd::(_mboxExecCmd) SUCCESS: <err> = '8+0 records in\n8+0 records out\n4096 bytes (4.1 kB) copied, 3.95001 s, 1.0 kB/s\n'; <rc> = 0

The disk of the VM is 90Gb in virtual size, thin provisioned

Output of the commands inside guest until Ctrl+C (I interrupted the lop as I received the VM paused error...):

[g.cecchi@ol65 ~]$ while true; do time dd if=/dev/zero bs=1024k count=1024 of=/home/g.cecchi/testfile; sleep 5; done
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 9.27353 s, 116 MB/s

real    0m9.353s
user    0m0.007s
sys    0m3.174s



1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 9.68965 s, 111 MB/s

real    0m9.929s
user    0m0.000s
sys    0m2.972s
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 9.16642 s, 117 MB/s

real    0m9.412s
user    0m0.008s
sys    0m2.766s
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 2.68209 s, 400 MB/s

real    0m2.931s
user    0m0.004s
sys    0m2.927s
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 6.11601 s, 176 MB/s

real    0m9.158s
user    0m0.002s
sys    0m2.922s
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 2.67673 s, 401 MB/s

real    0m2.921s
user    0m0.004s
sys    0m2.916s
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 4.10537 s, 262 MB/s

real    0m6.885s
user    0m0.009s
sys    0m2.984s
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 5.56343 s, 193 MB/s

real    0m7.866s
user    0m0.004s
sys    0m3.016s
^C

For the VM disk I get this info:

[root@ovmsrv06 vdsm]# qemu-img info /rhev/data-center/mnt/blockSD/922b5269-ab56-4c4d-838f-49d33427e2ab/images/32b0e94d-0c8e-4d3d-a4eb-448885297e93/66615920-619f-4f52-ad5a-3f062c054094
image: /rhev/data-center/mnt/blockSD/922b5269-ab56-4c4d-838f-49d33427e2ab/images/32b0e94d-0c8e-4d3d-a4eb-448885297e93/66615920-619f-4f52-ad5a-3f062c054094
file format: qcow2
virtual size: 90G (96636764160 bytes)
disk size: 0
cluster_size: 65536
Format specific information:
    compat: 0.10
    refcount bits: 16
[root@ovmsrv06 vdsm]#

Apparently disk size 0?????

In
cat /rhev/data-center/mnt/blockSD/922b5269-ab56-4c4d-838f-49d33427e2ab/dom_md/metadata

I get
EOF
DOMAIN=922b5269-ab56-4c4d-838f-49d33427e2ab
CTIME=1485272737
FORMAT=COW
DISKTYPE=2
LEGALITY=LEGAL
SIZE=188743680
VOLTYPE=LEAF
DESCRIPTION={"DiskAlias":"ol65_Disk1","DiskDescription":""}
IMAGE=32b0e94d-0c8e-4d3d-a4eb-448885297e93
PUUID=00000000-0000-0000-0000-000000000000
MTIME=0
POOL_UUID=
TYPE=SPARSE
EOF