Storage issues on hosted-engine gluster volume

Dear all, after a switch failure, our three-host oVirt hyperconverged setup has strange issues with the gluster replicate-3 volume that contains the hosted-engine VM. Basically, after a host is properly rebooted (but not always after every reboot, it happens quite randomly), the hosted-engine starts, but it is immediately paused. On the other hosts, it runs perfectly. After some digging in the documentation, I realized that this is due to a storage issue. However, the health of the gluster volume is OK, and forcing heal does not fix the problem. The only solution (or workaround, I would say) is to reset the brick on the faulty host and re-format the brick XFS file system. This leaves me with some questions, which are: Why is the volume health OK, while it is clearly not OK? If so, which commands do I need to use to detect gluster issues? And, why is this situation happening? Any suggestion is appreciated. Regards, Dario -- Dario Pilori, PhD I.N.Ri.M. - Istituto Nazionale di Ricerca Metrologica Sistemi Informatici Strada delle Cacce, 91 - 10135 - Torino - Italy Ph: +39 011 3919 459

On March 16, 2020 4:47:01 PM GMT+02:00, Dario Pilori <d.pilori@inrim.it> wrote:
Dear all,
after a switch failure, our three-host oVirt hyperconverged setup has strange issues with the gluster replicate-3 volume that contains the hosted-engine VM.
Basically, after a host is properly rebooted (but not always after every reboot, it happens quite randomly), the hosted-engine starts, but it is immediately paused. On the other hosts, it runs perfectly. After some digging in the documentation, I realized that this is due to a storage issue. However, the health of the gluster volume is OK, and forcing heal does not fix the problem.
The only solution (or workaround, I would say) is to reset the brick on the faulty host and re-format the brick XFS file system.
This leaves me with some questions, which are: Why is the volume health OK, while it is clearly not OK? If so, which commands do I need to use to detect gluster issues? And, why is this situation happening?
Any suggestion is appreciated.
Regards, Dario
You will need to give some info about the environment: Gluster version Gluster op-version Gluster Bricks' file system Have you tried to write in the gluster volume ? Anything in the gluster brick logs (/var/log/gluster/bricks/<mountpoint>.log) ? Best Regards, Strahil Nikolov

Hi Strahil, Thank you for your reply. On Tue, 2020-03-17 at 21:47 +0200, Strahil Nikolov wrote:
You will need to give some info about the environment: Gluster version Gluster op-version Gluster Bricks' file system The environment is a standard oVirt Node v4.3.8 installation, with Gluster v6.7, op-version 60000, and XFS file system.
Have you tried to write in the gluster volume ? Actually, I did not. I will try that the next time we experience this issue.
Anything in the gluster brick logs (/var/log/gluster/bricks/<mountpoint>.log) ? The log file is full with these two lines:
E [MSGID: 113072] [posix-inode-fd-ops.c:1886:posix_writev] 0-engine- posix: write failed: offset 0, [Invalid argument] E [MSGID: 115067] [server-rpc-fops_v2.c:1373:server4_writev_cbk] 0- engine-server: 4365343: WRITEV 3 (c0a06f65-d09a-484f-bdfb- f01b60844e3b), client: CTX_ID:9f2606c3-1e5a-42fc-babd-5aef1f2ea999- GRAPH_ID:0-PID:31739-HOST:srv.example.com-PC_NAME:engine-client-1- RECON_NO:-6, error-xlator: engine-posix [Invalid argument] Regards, Dario -- Dario Pilori, PhD INRiM - Istituto Nazionale di Ricerca Metrologica Sistemi Informatici Strada delle Cacce, 91 - 10135 - Torino - Italy
participants (2)
-
Dario Pilori
-
Strahil Nikolov