Gluster volume heals and after 5 seconds has /dom_md/ids dirty again

I restored my engine to a gluster volume named :/engine on a three node hyperconverged oVirt 4.3.3.1 cluster. Before restoring I was checking the status of the volumes. They were clean. No heal entries. All peers connected. gluster volume status looked good. Then I restored. This went well. The engine is up. But the engine gluster volume shows entries on node02 and node03. The engine was installed to node01. I have to deploy the engine to the other two hosts to reach full HA, but I bet maintenance is not possible until the volume is healed. I tried "gluster volume heal engine" also with added "full". The heal entries will disappear for a few seconds and then /dom_md/ids will pop up again. The __DIRECT_IO_TEST__ will join later. The split-brain info has no entries. Is this some kind of hidden split brain? Maybe there is data on node01 brick which got not synced to the other two nodes? I can only speculate. Gluster docs say: this should heal. But it doesn't. I have two other volumes. Those are fine. One of them containing 3 VMs that are running. I also tried to shut down the engine, so no-one was using the volume. Then heal. Same effect. Those two files will always show up. But none other. Heal can always be started successfully from any of the participating nodes. Reset the volume bricks one by one and cross fingers? [root@node03 ~]# gluster volume heal engine info Brick node01.infra.solutions.work:/gluster_bricks/engine/engine Status: Connected Number of entries: 0 Brick node02.infra.solutions.work:/gluster_bricks/engine/engine /9f4d5ae9-e01d-4b73-8b6d-e349279e9782/dom_md/ids /__DIRECT_IO_TEST__ Status: Connected Number of entries: 2 Brick node03.infra.solutions.work:/gluster_bricks/engine/engine /9f4d5ae9-e01d-4b73-8b6d-e349279e9782/dom_md/ids /__DIRECT_IO_TEST__ Status: Connected Number of entries: 2

I see this sometimes after rebooting a server, and it usually stops happening, generally within a few hours, I’ve never tracked it down further. Don’t know for sure, but I assume it’s related to healing and goes away once everything syncs up. Occasionally it turns out to be a communications problem between servers (usually an update to something screws up my firewall), so I always check my peer status when I see it and make sure all servers are talking to each other.
On May 13, 2019, at 4:13 AM, Andreas Elvers <andreas.elvers+ovirtforum@solutions.work> wrote:
I restored my engine to a gluster volume named :/engine on a three node hyperconverged oVirt 4.3.3.1 cluster. Before restoring I was checking the status of the volumes. They were clean. No heal entries. All peers connected. gluster volume status looked good. Then I restored. This went well. The engine is up. But the engine gluster volume shows entries on node02 and node03. The engine was installed to node01. I have to deploy the engine to the other two hosts to reach full HA, but I bet maintenance is not possible until the volume is healed.
I tried "gluster volume heal engine" also with added "full". The heal entries will disappear for a few seconds and then /dom_md/ids will pop up again. The __DIRECT_IO_TEST__ will join later. The split-brain info has no entries. Is this some kind of hidden split brain? Maybe there is data on node01 brick which got not synced to the other two nodes? I can only speculate. Gluster docs say: this should heal. But it doesn't. I have two other volumes. Those are fine. One of them containing 3 VMs that are running. I also tried to shut down the engine, so no-one was using the volume. Then heal. Same effect. Those two files will always show up. But none other. Heal can always be started successfully from any of the participating nodes.
Reset the volume bricks one by one and cross fingers?
[root@node03 ~]# gluster volume heal engine info Brick node01.infra.solutions.work:/gluster_bricks/engine/engine Status: Connected Number of entries: 0
Brick node02.infra.solutions.work:/gluster_bricks/engine/engine /9f4d5ae9-e01d-4b73-8b6d-e349279e9782/dom_md/ids /__DIRECT_IO_TEST__ Status: Connected Number of entries: 2
Brick node03.infra.solutions.work:/gluster_bricks/engine/engine /9f4d5ae9-e01d-4b73-8b6d-e349279e9782/dom_md/ids /__DIRECT_IO_TEST__ Status: Connected Number of entries: 2 _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/L3YCRPRAGPUMBZ...

I've been having problems with my gluster Engine volume recently as well after updating to latest stable 4.3.3. For the past few days I've seen a random brick in the Engine volume go down and I have to force start it to get it working again. Right now I'm seeing that there are unsynced entries and one node is showing transport endpoint not connected, even though peer status if fine and all other volumes are working normally. On Mon, May 13, 2019 at 12:14 PM Darrell Budic <budic@onholyground.com> wrote:
I see this sometimes after rebooting a server, and it usually stops happening, generally within a few hours, I’ve never tracked it down further. Don’t know for sure, but I assume it’s related to healing and goes away once everything syncs up.
Occasionally it turns out to be a communications problem between servers (usually an update to something screws up my firewall), so I always check my peer status when I see it and make sure all servers are talking to each other.
On May 13, 2019, at 4:13 AM, Andreas Elvers < andreas.elvers+ovirtforum@solutions.work> wrote:
I restored my engine to a gluster volume named :/engine on a three node hyperconverged oVirt 4.3.3.1 cluster. Before restoring I was checking the status of the volumes. They were clean. No heal entries. All peers connected. gluster volume status looked good. Then I restored. This went well. The engine is up. But the engine gluster volume shows entries on node02 and node03. The engine was installed to node01. I have to deploy the engine to the other two hosts to reach full HA, but I bet maintenance is not possible until the volume is healed.
I tried "gluster volume heal engine" also with added "full". The heal entries will disappear for a few seconds and then /dom_md/ids will pop up again. The __DIRECT_IO_TEST__ will join later. The split-brain info has no entries. Is this some kind of hidden split brain? Maybe there is data on node01 brick which got not synced to the other two nodes? I can only speculate. Gluster docs say: this should heal. But it doesn't. I have two other volumes. Those are fine. One of them containing 3 VMs that are running. I also tried to shut down the engine, so no-one was using the volume. Then heal. Same effect. Those two files will always show up. But none other. Heal can always be started successfully from any of the participating nodes.
Reset the volume bricks one by one and cross fingers?
[root@node03 ~]# gluster volume heal engine info Brick node01.infra.solutions.work:/gluster_bricks/engine/engine Status: Connected Number of entries: 0
Brick node02.infra.solutions.work:/gluster_bricks/engine/engine /9f4d5ae9-e01d-4b73-8b6d-e349279e9782/dom_md/ids /__DIRECT_IO_TEST__ Status: Connected Number of entries: 2
Brick node03.infra.solutions.work:/gluster_bricks/engine/engine /9f4d5ae9-e01d-4b73-8b6d-e349279e9782/dom_md/ids /__DIRECT_IO_TEST__ Status: Connected Number of entries: 2 _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/L3YCRPRAGPUMBZ...
Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/6XOCRXRCQOUKE4...

Yes. After a reboot you could have a sync issue for up to a few hours. But this issue persists now for 24 days. Additionally I see errors in the glustershd.log of the two hosts that are having heal info for that volume. The first node shows as OK and has no errors in its glustershd.log. The errors are like this: [2019-05-13 15:18:40.808945] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-engine-client-1: remote operation failed. Path: <gfid:95ba9fb2-b0ae-436c-9c31-2779cf202235> (95ba9fb2-b0ae-436c-9c31-2779cf202235) [No such file or directory] [2019-05-13 15:18:40.809113] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-engine-client-2: remote operation failed. Path: <gfid:95ba9fb2-b0ae-436c-9c31-2779cf202235> (95ba9fb2-b0ae-436c-9c31-2779cf202235) [No such file or directory] [root@node02 ~]# Looks like the first node is sane and the other two are the masters but are not so sane. :-/

What version of gluster are you running at the moment?
On May 13, 2019, at 10:25 AM, Andreas Elvers <andreas.elvers+ovirtforum@solutions.work> wrote:
Yes. After a reboot you could have a sync issue for up to a few hours. But this issue persists now for 24 days. Additionally I see errors in the glustershd.log of the two hosts that are having heal info for that volume. The first node shows as OK and has no errors in its glustershd.log.
The errors are like this:
[2019-05-13 15:18:40.808945] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-engine-client-1: remote operation failed. Path: <gfid:95ba9fb2-b0ae-436c-9c31-2779cf202235> (95ba9fb2-b0ae-436c-9c31-2779cf202235) [No such file or directory] [2019-05-13 15:18:40.809113] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-engine-client-2: remote operation failed. Path: <gfid:95ba9fb2-b0ae-436c-9c31-2779cf202235> (95ba9fb2-b0ae-436c-9c31-2779cf202235) [No such file or directory] [root@node02 ~]#
Looks like the first node is sane and the other two are the masters but are not so sane. :-/ _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/4JROBW3YYGM65Y...

I encountered serious issues with 5.3-5.5 (crashing bricks, multiple brick processes for the same brick causing disconnects and excessive heals). I had better luck with 5.6, although it’s not clear to me if the duplicate brick process issue is still present in that version. I finally jumped to 6 which has been more stable for me. I’d recommend upgrading at least to 5.6 if not going right to 6.1.
On May 13, 2019, at 10:30 AM, Andreas Elvers <andreas.elvers+ovirtforum@solutions.work> wrote:
What version of gluster are you running at the moment?
I'm running glusterfs-5.5-1.el7.x86_64 the one that comes with oVirt Node 4.3.3.1 _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/OPMOUE5FQRYF4Q...

Please note that I am running a hyper-converged NodeNG setup. I understand that upgrading single components is not really possible with a ovirt Node NG. And could probably break the datacenter upgrade path. Could you point out some reference for your suggestions? Docs, Bug reports or the sorts?
I encountered serious issues with 5.3-5.5 (crashing bricks, multiple brick processes for the same brick causing disconnects and excessive heals). I had better luck with 5.6, although it’s not clear to me if the duplicate brick process issue is still present in that version. I finally jumped to 6 which has been more stable for me. I’d recommend upgrading at least to 5.6 if not going right to 6.1.

Ovirt just pulls in the gluster5 repos, if you upgrade now you should get gluster 5.6 on your nodes. If you’re running them on centos, you can install centos-release-gluster6 to go to gluster6. Ovirt NodeNG is a different story, as you mention, but I believe you can still run an update on it to get the latest gluster version? Those recommendations are based on my personal experience, but see also: https://bugzilla.redhat.com/show_bug.cgi?id=1683602 <https://bugzilla.redhat.com/show_bug.cgi?id=1683602> https://bugzilla.redhat.com/show_bug.cgi?id=1677319 <https://bugzilla.redhat.com/show_bug.cgi?id=1677319>
On May 13, 2019, at 10:47 AM, Andreas Elvers <andreas.elvers+ovirtforum@solutions.work> wrote:
Please note that I am running a hyper-converged NodeNG setup. I understand that upgrading single components is not really possible with a ovirt Node NG. And could probably break the datacenter upgrade path.
Could you point out some reference for your suggestions? Docs, Bug reports or the sorts?
I encountered serious issues with 5.3-5.5 (crashing bricks, multiple brick processes for the same brick causing disconnects and excessive heals). I had better luck with 5.6, although it’s not clear to me if the duplicate brick process issue is still present in that version. I finally jumped to 6 which has been more stable for me. I’d recommend upgrading at least to 5.6 if not going right to 6.1.
Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/BFOB6QEST6EAGE...

I use node NG as well, I just updated to 4.3.3 two days ago and I'm on Gluster 5.5. Yum update on host node yields no updates available On Mon, May 13, 2019 at 1:03 PM Darrell Budic <budic@onholyground.com> wrote:
Ovirt just pulls in the gluster5 repos, if you upgrade now you should get gluster 5.6 on your nodes. If you’re running them on centos, you can install centos-release-gluster6 to go to gluster6. Ovirt NodeNG is a different story, as you mention, but I believe you can still run an update on it to get the latest gluster version?
Those recommendations are based on my personal experience, but see also: *https://bugzilla.redhat.com/show_bug.cgi?id=1683602 <https://bugzilla.redhat.com/show_bug.cgi?id=1683602>* https://bugzilla.redhat.com/show_bug.cgi?id=1677319
On May 13, 2019, at 10:47 AM, Andreas Elvers < andreas.elvers+ovirtforum@solutions.work> wrote:
Please note that I am running a hyper-converged NodeNG setup. I understand that upgrading single components is not really possible with a ovirt Node NG. And could probably break the datacenter upgrade path.
Could you point out some reference for your suggestions? Docs, Bug reports or the sorts?
I encountered serious issues with 5.3-5.5 (crashing bricks, multiple brick processes for the same brick causing disconnects and excessive heals). I had better luck with 5.6, although it’s not clear to me if the duplicate brick process issue is still present in that version. I finally jumped to 6 which has been more stable for me. I’d recommend upgrading at least to 5.6 if not going right to 6.1.
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/BFOB6QEST6EAGE...
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/5HKFS376DJVADI...

Ah, must be more fixed than I thought. I don’t have a NodeNG setup to examine, so I’m afraid I won’t have many more suggestions.
On May 13, 2019, at 11:29 AM, Jayme <jaymef@gmail.com> wrote:
I use node NG as well, I just updated to 4.3.3 two days ago and I'm on Gluster 5.5. Yum update on host node yields no updates available
On Mon, May 13, 2019 at 1:03 PM Darrell Budic <budic@onholyground.com <mailto:budic@onholyground.com>> wrote: Ovirt just pulls in the gluster5 repos, if you upgrade now you should get gluster 5.6 on your nodes. If you’re running them on centos, you can install centos-release-gluster6 to go to gluster6. Ovirt NodeNG is a different story, as you mention, but I believe you can still run an update on it to get the latest gluster version?
Those recommendations are based on my personal experience, but see also: https://bugzilla.redhat.com/show_bug.cgi?id=1683602 <https://bugzilla.redhat.com/show_bug.cgi?id=1683602> https://bugzilla.redhat.com/show_bug.cgi?id=1677319 <https://bugzilla.redhat.com/show_bug.cgi?id=1677319>
On May 13, 2019, at 10:47 AM, Andreas Elvers <andreas.elvers+ovirtforum@solutions.work <mailto:andreas.elvers+ovirtforum@solutions.work>> wrote:
Please note that I am running a hyper-converged NodeNG setup. I understand that upgrading single components is not really possible with a ovirt Node NG. And could probably break the datacenter upgrade path.
Could you point out some reference for your suggestions? Docs, Bug reports or the sorts?
I encountered serious issues with 5.3-5.5 (crashing bricks, multiple brick processes for the same brick causing disconnects and excessive heals). I had better luck with 5.6, although it’s not clear to me if the duplicate brick process issue is still present in that version. I finally jumped to 6 which has been more stable for me. I’d recommend upgrading at least to 5.6 if not going right to 6.1.
Users mailing list -- users@ovirt.org <mailto:users@ovirt.org> To unsubscribe send an email to users-leave@ovirt.org <mailto:users-leave@ovirt.org> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ <https://www.ovirt.org/site/privacy-policy/> oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ <https://www.ovirt.org/community/about/community-guidelines/> List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/BFOB6QEST6EAGE... <https://lists.ovirt.org/archives/list/users@ovirt.org/message/BFOB6QEST6EAGEZGSCM4GO7BFRUYCKEI/>
_______________________________________________ Users mailing list -- users@ovirt.org <mailto:users@ovirt.org> To unsubscribe send an email to users-leave@ovirt.org <mailto:users-leave@ovirt.org> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ <https://www.ovirt.org/site/privacy-policy/> oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ <https://www.ovirt.org/community/about/community-guidelines/> List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/5HKFS376DJVADI... <https://lists.ovirt.org/archives/list/users@ovirt.org/message/5HKFS376DJVADIICLLVJSRJKXE73EBZC/>

There are plans for Node NG 4.3.5 to switch to Gluster V6. So far I still have unsynced entries with gluster 5.5. I tried to reset every brick, but that resolved to having the same heal info for that volume. I'm currently re-installing a node that has heal info on the gluster volume. Let's see what happens after that. See: https://lists.ovirt.org/archives/list/devel@ovirt.org/thread/IVTYCFKCQ5P2WZ4...
participants (3)
-
Andreas Elvers
-
Darrell Budic
-
Jayme