Well, I'll be -- you're absolutely right, and I'm a bit embarrassed I didn't consider that before. The node that's not healing shows connections from 2 FUSE clients, which I expect, and 3 glustershd, which I also expect.

[root@gluster1 ~]# gluster volume status ssd-san client-list
Client connections for volume ssd-san
Name count
----- ------
glustershd 3
fuse 2

total clients for volume ssd-san : 5
-----------------------------------------------------------------

But the secondary node, which is constantly healing, shows that it's missing a FUSE connection:

[root@gluster2 ~]# gluster volume status ssd-san client-list
Client connections for volume ssd-san
Name count
----- ------
glustershd 3
fuse 1

total clients for volume ssd-san : 4
-----------------------------------------------------------------

I had to restart the glusterd service on node 2 twice before the FUSE client reconnected and stayed connected.

Thanks a ton, I really appreciate your help!

On Mon, Mar 22, 2021 at 12:37 AM Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

Healing must happen only after a maintenance (for example patch + reboot) on any of the nodes .
Once the node is up , the FUSE client (any host) should reconnect to all gluster bricks and write to all bricks simultaneously.

If you got constant healing, this indicates that a client is not writing to all bricks.

Check with the following command if there is such client:
'gluster volume status VOLNAME clients'

Best Regards,
Strahil Nikolov

On Mon, Mar 22, 2021 at 3:24, Ben
<gravyfish@gmail.com> wrote:

Sorry, just saw this -- I'm not sure I understand what you mean, but in any case, the healing process does complete when I stop all of my VMs, which I believe indicates that something about the oVirt writes to Gluster is causing the problem in the first place.

On Sun, Mar 14, 2021 at 8:06 AM Strahil Nikolov <hunter86_bg@yahoo.com> wrote:
Are you sure that gluster volume client's count is the same on all nodes ?

Best Regards,
Strahil Nikolov

On Sat, Mar 13, 2021 at 23:58, Ben
<gravyfish@gmail.com> wrote:

Hi, I could use some help with a problem I'm having with the Gluster storage servers I use in my oVirt data center. I first noticed the problem when files would constantly heal after rebooting one of the Gluster nodes -- in the replica 2/arbiter, the node that remained online and the arbiter would begin healing files and never finish.

I raised the issue with the helpful folks over at Gluster: https://github.com/gluster/glusterfs/issues/2226

The short version is this: after running a tcpdump and noticing malformed RPC calls to Gluster from one of my oVirt nodes, they're looking for a stack trace of whatever process is running I/O on the Gluster cluster from oVirt in order to figure out what it's doing and if the write problems could cause the indefinite healing I'm seeing. After checking the qemu PIDs, it doesn't look like they are actually performing the writes -- is there a particular part of the oVirt stack I can look at to find the write operations to Gluster? I don't see anything else doing read/write on the VM image files on the Gluster mount, but I could be missing something.

NB: I'm using a traditional Gluster setup with the FUSE client, not hyperconverged.

Thanks in advance for any assistance.

_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/76H5BY6IH55GCGE6ONLNPWMA4EJL76NP/