Gluster volumes not healing (perhaps after host maintenance?)

I discovered that the servers I purchased did not come with 10Gbps network cards, like I thought they did. So my storage network has been running on a 1Gbps connection for the past week, since I deployed the servers into the datacenter a little over a week ago. I purchased 10Gbps cards, and put one of my hosts into maintenance mode yesterday, prior to replacing the daughter card. It is now back online running fine on the 10Gbps card. All VMs seem to be working, even when I migrate them onto cha2, which is the host I did maintenance on yesterday morning. The other two hosts are still running on the 1Gbps connection, but I plan to do maintenance on them next week. The oVirt manager shows that all 3 hosts are up, and that all of my volumes - and all of my bricks - are up. However, every time I look at the storage, it appears that the self-heal info for 1 of the volumes is 10 minutes, and the self-heal info for another volume is 50+ minutes. This morning is the first time in the last couple of days that I've paid close attention to the numbers, but I don't see them going down. When I log into each of the hosts, I do see everything is connected in gluster. It is interesting to me, in this particular case, though that gluster on cha3 notices the hostname of 10.1.0.10 to be the IP address, and not the hostname (cha1). The host that I did the maintenance on is cha2. [root@cha3-storage dwhite]# gluster peer statusNumber of Peers: 2Hostname: 10.1.0.10Uuid: 87a4f344-321a-48b9-adfb-e3d2b56b8e7bState: Peer in Cluster (Connected)Hostname: cha2-storage.mgt.barredowlweb.comUuid: 93e12dee-c37d-43aa-a9e9-f4740b9cab14State: Peer in Cluster (Connected) When I run `gluster volume heal data`, I see the following: [root@cha3-storage dwhite]# gluster volume heal data Launching heal operation to perform index self heal on volume data has been unsuccessful: Commit failed on cha2-storage.mgt.barredowlweb.com. Please check log file for details. I get the same results if I run the command on cha2, for any volume: [root@cha2-storage dwhite]# gluster volume heal data Launching heal operation to perform index self heal on volume data has been unsuccessful: Glusterd Syncop Mgmt brick op 'Heal' failed. Please check glustershd log file for details. [root@cha2-storage dwhite]# gluster volume heal vmstore Launching heal operation to perform index self heal on volume vmstore has been unsuccessful: Glusterd Syncop Mgmt brick op 'Heal' failed. Please check glustershd log file for details. I see a lot of stuff like this on cha2 /var/log/glusterfs/glustershd.log: [2021-04-24 11:33:01.319888] I [rpc-clnt.c:1975:rpc_clnt_reconfig] 2-engine-client-0: changing port to 49153 (from 0)[2021-04-24 11:33:01.329463] I [MSGID: 114057] [client-handshake.c:1128:select_server_supported_programs] 2-engine-client-0: Using Program [{Program-name=GlusterFS 4.x v1}, {Num=1298437}, {Version=400}][2021-04-24 11:33:01.330075] W [MSGID: 114043] [client-handshake.c:727:client_setvolume_cbk] 2-engine-client-0: failed to set the volume [{errno=2}, {error=No such file or directory}][2021-04-24 11:33:01.330116] W [MSGID: 114007] [client-handshake.c:752:client_setvolume_cbk] 2-engine-client-0: failed to get from reply dict [{process-uuid}, {errno=22}, {error=Invalid argument}][2021-04-24 11:33:01.330140] E [MSGID: 114044] [client-handshake.c:757:client_setvolume_cbk] 2-engine-client-0: SETVOLUME on remote-host failed [{remote-error=Brick not found}, {errno=2}, {error=No such file or directory}][2021-04-24 11:33:01.330155] I [MSGID: 114051] [client-handshake.c:879:client_setvolume_cbk] 2-engine-client-0: sending CHILD_CONNECTING event [][2021-04-24 11:33:01.640480] I [rpc-clnt.c:1975:rpc_clnt_reconfig] 3-vmstore-client-0: changing port to 49154 (from 0)The message "W [MSGID: 114007] [client-handshake.c:752:client_setvolume_cbk] 3-vmstore-client-0: failed to get from reply dict [{process-uuid}, {errno=22}, {error=Invalid argument}]" repeated 4 times between [2021-04-24 11:32:49.602164] and [2021-04-24 11:33:01.649850][2021-04-24 11:33:01.649867] E [MSGID: 114044] [client-handshake.c:757:client_setvolume_cbk] 3-vmstore-client-0: SETVOLUME on remote-host failed [{remote-error=Brick not found}, {errno=2}, {error=No such file or directory}][2021-04-24 11:33:01.649969] I [MSGID: 114051] [client-handshake.c:879:client_setvolume_cbk] 3-vmstore-client-0: sending CHILD_CONNECTING event [][2021-04-24 11:33:01.650095] I [MSGID: 114018] [client.c:2225:client_rpc_notify] 3-vmstore-client-0: disconnected from client, process will keep trying to connect glusterd until brick's port is available [{conn-name=vmstore-client-0}] How do I further troubleshoot? Sent with ProtonMail Secure Email.

Hi David, let's start with the DNS.Check that both nodes resolve each other (both A/AAAA & PTR records). If you set entries in /etc/hosts, check them out. Also , check the output of 'hostname -s' & 'hostname -f' on both hosts. Best Regards,Strahil Nikolov

As part of my troubleshooting earlier this morning, I gracefully shut down the ovirt-engine so that it would come up on a different host (can't remember if I mentioned that or not). I just verified forward DNS on all 3 of the hosts. All 3 resolve each other just fine, and are able to ping each other. The hostnames look good, too. I'm fairly certain that this problem didn't exist prior to me shutting the host down and replacing the network card. That said, I don't think I ever setup rdns / ptr records to begin with. I don't recall reading that rdns was a requirement, nor do I remember setting it up when I built the cluster a couple weeks ago. Is this a requirement? I did setup forward dns entries into /etc/hosts on each server, though. Sent with ProtonMail Secure Email. ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Saturday, April 24, 2021 11:03 AM, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:
Hi David,
let's start with the DNS. Check that both nodes resolve each other (both A/AAAA & PTR records).
If you set entries in /etc/hosts, check them out.
Also , check the output of 'hostname -s' & 'hostname -f' on both hosts.
Best Regards, Strahil Nikolov

A/AAAA & PTR records are pretty important.As long as you setup your /etc/hosts jn the format like this you will be OK: 10.10.10.10 host1.anysubdomain.domain host110.10.10.11 host2.anysubdomain.domain host2 Usually the hostname is defined for each peer in the /var/lib/glusterd/peers. Can you check the contents on all nodes ? Best Regards,Strahil Nikolov On Sat, Apr 24, 2021 at 21:57, David White via Users<users@ovirt.org> wrote: _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/CYPYALTFM7ITZZ...

I did have my /etc/hosts setup on all 3 of the oVirt Hosts in the format you described, with the exception of the trailing "host1" and "host2". I only had the FQDN in there. I had an outage of almost an hour this morning that may or may not be related to this. An "ETL Service" started, at which point a lot of things broke down, and I saw a lot of storage-related errors. Everything came back on its own, though. See my other thread that I just started on that topic. As of now, there are NOT indications that any of the volumes or disks are out of sync. Sent with ProtonMail Secure Email. ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Sunday, April 25, 2021 1:43 AM, Strahil Nikolov via Users <users@ovirt.org> wrote:
A/AAAA & PTR records are pretty important. As long as you setup your /etc/hosts jn the format like this you will be OK:
10.10.10.10 host1.anysubdomain.domain host1 10.10.10.11 host2.anysubdomain.domain host2
Usually the hostname is defined for each peer in the /var/lib/glusterd/peers. Can you check the contents on all nodes ?
Best Regards, Strahil Nikolov
On Sat, Apr 24, 2021 at 21:57, David White via Users <users@ovirt.org> wrote: _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/CYPYALTFM7ITZZ...

Hi David, just spotted this post from a couple of weeks ago -- I have the same problem (Gluster volume not healing) since the upgrade from 7.x to 8.4. Same exact errors on glustershd.log -- and same errors if I try to heal manually. Typically I can get the volume healed by killing the specific brick processes manually and forcing a volume start (to restart the failed bricks). Just wondering if you've got any progress on your side? I have also tried to upgrade to 9.1 in one of the clusters (I have three different ones affected) but didn't solve the issue. Regards. Marco On Mon, 26 Apr 2021 at 21:55, David White via Users <users@ovirt.org> wrote:
I did have my /etc/hosts setup on all 3 of the oVirt Hosts in the format you described, with the exception of the trailing "host1" and "host2". I only had the FQDN in there.
I had an outage of almost an hour this morning that may or may not be related to this. An "ETL Service" started, at which point a lot of things broke down, and I saw a lot of storage-related errors. Everything came back on its own, though.
See my other thread that I just started on that topic. As of now, there are NOT indications that any of the volumes or disks are out of sync.
Sent with ProtonMail <https://protonmail.com> Secure Email.
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Sunday, April 25, 2021 1:43 AM, Strahil Nikolov via Users < users@ovirt.org> wrote:
A/AAAA & PTR records are pretty important. As long as you setup your /etc/hosts jn the format like this you will be OK:
10.10.10.10 host1.anysubdomain.domain host1 10.10.10.11 host2.anysubdomain.domain host2
Usually the hostname is defined for each peer in the /var/lib/glusterd/peers. Can you check the contents on all nodes ?
Best Regards, Strahil Nikolov
On Sat, Apr 24, 2021 at 21:57, David White via Users <users@ovirt.org> wrote: _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/CYPYALTFM7ITZZ...
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/NU6PXEUVVSCHVU...
participants (3)
-
David White
-
Marco Fais
-
Strahil Nikolov