potential split-brain after upgrading Gluster version and rebooting one of three storage nodes

Hello everyone, I'm running an oVirt cluster (4.3.10.4-1.el7) on a bunch of physical nodes with Centos 7.9.2009 and the Hosted Engine is running as a virtual machine on one of these nodes. As for the storage, I'm running GlusterFS 6.7 on three separate physical storage nodes (also Centos 7). Gluster itself has three different volumes of the type "Replicate" or "Distributed-Replicate". I recently updated both the system packages and the GlusterFS version to 6.10 on the first storage node (storage1) and now I'm seeing a potential split-brain situation for one of the three volumes when running "gluster volume heal info": Brick storage1:/data/glusterfs/nvme/brick1/brick Status: Connected Number of entries: 0 Brick storage2:/data/glusterfs/nvme/brick1/brick /c32d664d-69ba-4c3f-8ea1-240133963815/dom_md/ids / /.shard/.remove_me Status: Connected Number of entries: 3 Brick storage3:/data/glusterfs/nvme/brick1/brick /c32d664d-69ba-4c3f-8ea1-240133963815/dom_md/ids / /.shard/.remove_me Status: Connected Number of entries: 3 I checked the hashes and the "dom_md/ids" file has a different md5 on every node. Using the heal on the volume command doesn't do anything and the entries remain. The heal info for the other two volumes shows no entries. The affected gluster volume (type: replicate) is mounted as a Storage Domain using the path "storage1:/nvme" inside of oVirt and is used to store the root partitions of all virtual machines, which were running at the time of the upgrade and reboot of storage1. The volume has three bricks, with one brick being stored on each storage node. For the upgrade process I followed the steps shown at https://docs.gluster.org/en/latest/Upgrade-Guide/generic-upgrade-procedure/. I stopped and killed all gluster related services, upgraded both system and gluster packages and rebooted storage1. Is this a split brain situation and how can I solve this? I would be very grateful for any help. Please let me know if you require any additional information. Best regards

Is this a split brain situation and how can I solve this? I would be very grateful for any help. I've seen it before. Just check on the nodes which brick contains the newest file (there is a timestamp inside it) and then rsync that file from the node with newest version to the rest. If gluster keeps showing that the file is still needing heal - just "cat" it from the FUSE client (the mountpoint in /rhev/....).
Best Regards, Strahil Nikolov

Correct me if I'm wrong but according to the docs, there might be a more elegant way of doing something similar with gluster cli ex: gluster volume heal <VOLNAME> split-brain latest-mtime <FILE> -- although I have never tried it myself. On Mon, Jan 11, 2021 at 1:50 PM Strahil Nikolov via Users <users@ovirt.org> wrote:
Is this a split brain situation and how can I solve this? I would be very grateful for any help. I've seen it before. Just check on the nodes which brick contains the newest file (there is a timestamp inside it) and then rsync that file from the node with newest version to the rest. If gluster keeps showing that the file is still needing heal - just "cat" it from the FUSE client (the mountpoint in /rhev/....).
Best Regards, Strahil Nikolov _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/2GLJIZQLFUZFSI...

В 14:48 -0400 на 11.01.2021 (пн), Jayme написа:
Correct me if I'm wrong but according to the docs, there might be a more elegant way of doing something similar with gluster cli ex: gluster volume heal <VOLNAME> split-brain latest-mtime <FILE> -- although I have never tried it myself. True... yet rsync is far simpler for most users ;)
Best Regards, Strahil Nikolov

On Mon, Jan 11, 2021, 20:51 Jayme <jaymef@gmail.com> wrote:
Correct me if I'm wrong but according to the docs, there might be a more elegant way of doing something similar with gluster cli ex: gluster volume heal <VOLNAME> split-brain latest-mtime <FILE> -- although I have never tried it myself.
This is the usual way I resolve split brains.
On Mon, Jan 11, 2021 at 1:50 PM Strahil Nikolov via Users <users@ovirt.org> wrote:
Is this a split brain situation and how can I solve this? I would be very grateful for any help. I've seen it before. Just check on the nodes which brick contains the newest file (there is a timestamp inside it) and then rsync that file from the node with newest version to the rest. If gluster keeps showing that the file is still needing heal - just "cat" it from the FUSE client (the mountpoint in /rhev/....).
Best Regards, Strahil Nikolov _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/2GLJIZQLFUZFSI...
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/DRZ76K554HN24Z...

newest file (there is a timestamp inside it) and then rsync that file They are binary files and I can't seem to find a timestamp. The file consists of 2000 lines, where would I find this timestamp and what does it look like? Or do you mean the linux mtime?
Upon further research I noticed that apparently it's not a split-brain: Brick storage1:/data/glusterfs/nvme/brick1/brick Status: Connected Total Number of entries: 0 Number of entries in heal pending: 0 Number of entries in split-brain: 0 Number of entries possibly healing: 0 Brick storage2:/data/glusterfs/nvme/brick1/brick Status: Connected Total Number of entries: 3 Number of entries in heal pending: 3 Number of entries in split-brain: 0 Number of entries possibly healing: 0 Brick storage3:/data/glusterfs/nvme/brick1/brick Status: Connected Total Number of entries: 3 Number of entries in heal pending: 3 Number of entries in split-brain: 0 Number of entries possibly healing: 0 Should I run the split-brain latest-mtime or rsync command anyway once I know the most recent timestamp? Are there any other way to solve entries that are stuck in pending?

newest file (there is a timestamp inside it) and then rsync that file They are binary files and I can't seem to find a timestamp. The file consists of 2000 lines, where would I find this timestamp and what does it look like? Or do you mean the linux mtime? I got confused by the name... I just checked on my cluster and it's binary. Just "stat" the file from the mount point in /rhev/data-
В 12:13 +0000 на 12.01.2021 (вт), user-5138@yandex.com написа: center/mnt/glusterSD/<host>:_<volume> mount point and it should get healed. Best Regards, Strahil Nikolov

Sadly that did not fix the issue either. I found an old thread and was able to fix the issue by following the advice: https://lists.ovirt.org/pipermail/users/2016-February/038046.html https://lists.ovirt.org/pipermail/users/2016-February/038049.html Thanks for your help everyone.
participants (4)
-
Alex K
-
Jayme
-
Strahil Nikolov
-
user-5138@yandex.com