I tried to measure IOs using gluster volume top but its results seem very cryptic to me (I
need a deep analyze and don't have the time now)
Thank you very much for your analysis, if I understood the problem is that the consumer
SSD cache is too weak to help in times under a smoll number ~15 not particularly IO
intensive VMs, so the IO hangs as the performance is poor and this hangs the VMs. The VMs
kernel think that the CPU had hanged and so it crash.
This seem to be the case....
If it's possible would be very useful a sort of profiler in the gluster enviromnent
that raise up the evidence of issue related to speed of the undelying storage
infrastructure, it can be a problem related to disks or to network, in any case the errors
reported to user are almost misleading as it seem there is a data integrity issue (cannot
read... or something like this.
Only for reference these are the first lines of the "open" top command
(currently I don't experience problems):
[root@ovirt-node2 ~]# gluster volume top gv1 open
Brick: ovirt-node2.ovirt:/brickgv1/gv1
Current open fds: 15, Max open fds: 38, Max openfd time: 2022-09-19 07:27:20.033304 +0000
Count filename
=======================
331763 /45b4f14c-8323-482f-90ab-99d8fd610018/dom_md/inbox
66284 /45b4f14c-8323-482f-90ab-99d8fd610018/dom_md/leases
53939 /45b4f14c-8323-482f-90ab-99d8fd610018/dom_md/metadata.new
169
/45b4f14c-8323-482f-90ab-99d8fd610018/images/910fa026-d30b-4be2-9111-3c9f4f646fde/b7d6f39a-1481-4f5c-84fd-fc43f9e14d71
[...]