[ovirt-users] Re: Sometimes paused due to unknown storage error on gluster

28 Mar 2020

      On March 28, 2020 3:21:45 AM GMT+02:00, Gianluca Cecchi <gianluca.cecchi@gmail.com> wrote:
...
Hello,
having deployed oVirt 4.3.9 single host HCI with Gluster, I see some
times
VM going into paused state for the error above and needing to manually
run
it (sometimes this resumal operation fails).
Actually it only happened with empty disk (thin provisioned) and sudden
high I/O during the initial phase of install of the OS; it didn't
happened
then during normal operaton (even with 600MB/s of throughput).
I suspect something related to metadata extension not able to be in
pair
with the speed of the physical disk growing.... similar to what happens
for
block based storage domains where the LVM layer has to extend the
logical
volume representing the virtual disk
My real world reproduction of the error is during install of OCP 4.3.8
master node, when Red Hat Cores OS boots from network and wipes the
disk
and I think then transfer an image, so doing high immediate I/O.
The VM used as master node has been created with a 120Gb thin
provisioned
disk (virtio-scsi type) and starts with disk just initialized and
empty,
going through PXE install.
I get this line inside events for the VM
Mar 27, 2020, 12:35:23 AM VM master01 has been paused due to unknown
storage error.
Here logs around the time frame above:
- engine.log
https://drive.google.com/file/d/1zpNo5IgFVTAlKXHiAMTL-uvaoXSNMVRO/view?usp=s...
- vdsm.log
https://drive.google.com/file/d/1v8kR0N6PdHBJ5hYzEYKl4-m7v1Lb_cYX/view?usp=s...
Any suggestions?
The disk of the VM is on vmstore storage domain and its gluster volume
settings are:
[root@ovirt tmp]# gluster volume info vmstore
Volume Name: vmstore
Type: Distribute
Volume ID: a6203d77-3b9d-49f9-94c5-9e30562959c4
Status: Started
Snapshot Count: 0
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: ovirtst.mydomain.storage:/gluster_bricks/vmstore/vmstore
Options Reconfigured:
performance.low-prio-threads: 32
storage.owner-gid: 36
performance.read-ahead: off
user.cifs: off
storage.owner-uid: 36
performance.io-cache: off
performance.quick-read: off
network.ping-timeout: 30
features.shard: on
network.remote-dio: off
cluster.eager-lock: enable
performance.strict-o-direct: on
transport.address-family: inet
nfs.disable: on
[root@ovirt tmp]#
What about config above, related to eventual optimizations to be done
based
on having single host?
And comparing with the virt group of options:
[root@ovirt tmp]# cat /var/lib/glusterd/groups/virt
performance.quick-read=off
performance.read-ahead=off
performance.io-cache=off
performance.low-prio-threads=32
network.remote-dio=enable
cluster.eager-lock=enable
cluster.quorum-type=auto
cluster.server-quorum-type=server
cluster.data-self-heal-algorithm=full
cluster.locking-scheme=granular
cluster.shd-max-threads=8
cluster.shd-wait-qlength=10000
features.shard=on
user.cifs=off
cluster.choose-local=off
client.event-threads=4
server.event-threads=4
performance.client-io-threads=on
[root@ovirt tmp]#
?
Thanks Gianluca
Hi Gianluca,

Is it happening to machines with preallocated disks or on machines with thin disks ?

Best Regards,
Strahil Nikolov