
On March 28, 2020 3:21:45 AM GMT+02:00, Gianluca Cecchi <gianluca.cecchi@gmail.com> wrote:
Hello, having deployed oVirt 4.3.9 single host HCI with Gluster, I see some times VM going into paused state for the error above and needing to manually run it (sometimes this resumal operation fails). Actually it only happened with empty disk (thin provisioned) and sudden high I/O during the initial phase of install of the OS; it didn't happened then during normal operaton (even with 600MB/s of throughput). I suspect something related to metadata extension not able to be in pair with the speed of the physical disk growing.... similar to what happens for block based storage domains where the LVM layer has to extend the logical volume representing the virtual disk
My real world reproduction of the error is during install of OCP 4.3.8 master node, when Red Hat Cores OS boots from network and wipes the disk and I think then transfer an image, so doing high immediate I/O. The VM used as master node has been created with a 120Gb thin provisioned disk (virtio-scsi type) and starts with disk just initialized and empty, going through PXE install. I get this line inside events for the VM
Mar 27, 2020, 12:35:23 AM VM master01 has been paused due to unknown storage error.
Here logs around the time frame above:
- engine.log https://drive.google.com/file/d/1zpNo5IgFVTAlKXHiAMTL-uvaoXSNMVRO/view?usp=s...
- vdsm.log https://drive.google.com/file/d/1v8kR0N6PdHBJ5hYzEYKl4-m7v1Lb_cYX/view?usp=s...
Any suggestions?
The disk of the VM is on vmstore storage domain and its gluster volume settings are:
[root@ovirt tmp]# gluster volume info vmstore
Volume Name: vmstore Type: Distribute Volume ID: a6203d77-3b9d-49f9-94c5-9e30562959c4 Status: Started Snapshot Count: 0 Number of Bricks: 1 Transport-type: tcp Bricks: Brick1: ovirtst.mydomain.storage:/gluster_bricks/vmstore/vmstore Options Reconfigured: performance.low-prio-threads: 32 storage.owner-gid: 36 performance.read-ahead: off user.cifs: off storage.owner-uid: 36 performance.io-cache: off performance.quick-read: off network.ping-timeout: 30 features.shard: on network.remote-dio: off cluster.eager-lock: enable performance.strict-o-direct: on transport.address-family: inet nfs.disable: on [root@ovirt tmp]#
What about config above, related to eventual optimizations to be done based on having single host? And comparing with the virt group of options:
[root@ovirt tmp]# cat /var/lib/glusterd/groups/virt performance.quick-read=off performance.read-ahead=off performance.io-cache=off performance.low-prio-threads=32 network.remote-dio=enable cluster.eager-lock=enable cluster.quorum-type=auto cluster.server-quorum-type=server cluster.data-self-heal-algorithm=full cluster.locking-scheme=granular cluster.shd-max-threads=8 cluster.shd-wait-qlength=10000 features.shard=on user.cifs=off cluster.choose-local=off client.event-threads=4 server.event-threads=4 performance.client-io-threads=on [root@ovirt tmp]#
?
Thanks Gianluca
Hi Gianluca, Is it happening to machines with preallocated disks or on machines with thin disks ? Best Regards, Strahil Nikolov