Hi Ravishankar,
This doesn't look like split-brain...
[root@ovirtnode1 ~]# gluster volume heal engine info
Brick ovirtnode1:/gluster_bricks/engine/engine
Status: Connected
Number of entries: 0
Brick ovirtnode3:/gluster_bricks/engine/engine
Status: Connected
Number of entries: 0
Brick ovirtnode4:/gluster_bricks/engine/engine
Status: Connected
Number of entries: 0
[root@ovirtnode1 ~]# gluster volume heal engine info split-brain
Brick ovirtnode1:/gluster_bricks/engine/engine
Status: Connected
Number of entries in split-brain: 0
Brick ovirtnode3:/gluster_bricks/engine/engine
Status: Connected
Number of entries in split-brain: 0
Brick ovirtnode4:/gluster_bricks/engine/engine
Status: Connected
Number of entries in split-brain: 0
[root@ovirtnode1 ~]# gluster volume info engine
Volume Name: engine
Type: Replicate
Volume ID: c8dc1b04-bc25-4e97-81bb-4d94929918b1
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: ovirtnode1:/gluster_bricks/engine/engine
Brick2: ovirtnode3:/gluster_bricks/engine/engine
Brick3: ovirtnode4:/gluster_bricks/engine/engine
Thanks,
Hanson
On 07/02/2018 07:09 AM, Ravishankar N wrote:
On 07/02/2018 02:15 PM, Krutika Dhananjay wrote:
> Hi,
>
> So it seems some of the files in the volume have mismatching gfids. I
> see the following logs from 15th June, ~8pm EDT:
>
> <snip>
> ...
> ...
> [2018-06-16 04:00:10.264690] E [MSGID: 108008]
> [afr-self-heal-common.c:335:afr_gfid_split_brain_source]
> 0-engine-replicate-0: Gfid mismatch detected for
> <gfid:941edf0c-d363-488e-a333-d12320f96480>/hosted-engine.lockspace>,
> 6bbe6097-8520-4a61-971e-6e30c2ee0abe on engine-client-2 and
> ef21a706-41cf-4519-8659-87ecde4bbfbf on engine-client-0.
You can use
https://docs.gluster.org/en/latest/Troubleshooting/resolving-splitbrain/
(see 3. Resolution of split-brain using gluster CLI).
Nit: The doc says in the beginning that gfid split-brain cannot be
fixed automatically but newer releases do support it, so the methods
in section 3 should work to solve gfid split-brains.
> [2018-06-16 04:00:10.265861] W [fuse-bridge.c:540:fuse_entry_cbk]
> 0-glusterfs-fuse: 4411: LOOKUP()
> /c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace
> => -1 (Input/output error)
> [2018-06-16 04:00:11.522600] E [MSGID: 108008]
> [afr-self-heal-common.c:212:afr_gfid_split_brain_source]
> 0-engine-replicate-0: All the bricks should be up to resolve the gfid
> split barin
This is a concern. For the commands to work, all 3 bricks must be online.
Thanks,
Ravi
> [2018-06-16 04:00:11.522632] E [MSGID: 108008]
> [afr-self-heal-common.c:335:afr_gfid_split_brain_source]
> 0-engine-replicate-0: Gfid mismatch detected for
> <gfid:941edf0c-d363-488e-a333-d12320f96480>/hosted-engine.lockspace>,
> 6bbe6097-8520-4a61-971e-6e30c2ee0abe on engine-client-2 and
> ef21a706-41cf-4519-8659-87ecde4bbfbf on engine-client-0.
> [2018-06-16 04:00:11.523750] W [fuse-bridge.c:540:fuse_entry_cbk]
> 0-glusterfs-fuse: 4493: LOOKUP()
> /c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace
> => -1 (Input/output error)
> [2018-06-16 04:00:12.864393] E [MSGID: 108008]
> [afr-self-heal-common.c:212:afr_gfid_split_brain_source]
> 0-engine-replicate-0: All the bricks should be up to resolve the gfid
> split barin
> [2018-06-16 04:00:12.864426] E [MSGID: 108008]
> [afr-self-heal-common.c:335:afr_gfid_split_brain_source]
> 0-engine-replicate-0: Gfid mismatch detected for
> <gfid:941edf0c-d363-488e-a333-d12320f96480>/hosted-engine.lockspace>,
> 6bbe6097-8520-4a61-971e-6e30c2ee0abe on engine-client-2 and
> ef21a706-41cf-4519-8659-87ecde4bbfbf on engine-client-0.
> [2018-06-16 04:00:12.865392] W [fuse-bridge.c:540:fuse_entry_cbk]
> 0-glusterfs-fuse: 4575: LOOKUP()
> /c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace
> => -1 (Input/output error)
> [2018-06-16 04:00:18.716007] W [fuse-bridge.c:540:fuse_entry_cbk]
> 0-glusterfs-fuse: 4657: LOOKUP()
> /c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace
> => -1 (Input/output error)
> [2018-06-16 04:00:20.553365] W [fuse-bridge.c:540:fuse_entry_cbk]
> 0-glusterfs-fuse: 4739: LOOKUP()
> /c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace
> => -1 (Input/output error)
> [2018-06-16 04:00:21.771698] W [fuse-bridge.c:540:fuse_entry_cbk]
> 0-glusterfs-fuse: 4821: LOOKUP()
> /c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace
> => -1 (Input/output error)
> [2018-06-16 04:00:23.871647] W [fuse-bridge.c:540:fuse_entry_cbk]
> 0-glusterfs-fuse: 4906: LOOKUP()
> /c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace
> => -1 (Input/output error)
> [2018-06-16 04:00:25.034780] W [fuse-bridge.c:540:fuse_entry_cbk]
> 0-glusterfs-fuse: 4987: LOOKUP()
> /c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace
> => -1 (Input/output error)
> ...
> ...
> </snip>
>
> Adding Ravi who works on replicate component to hep resolve the
> mismatches.
>
> -Krutika
>
>
> On Mon, Jul 2, 2018 at 12:27 PM, Krutika Dhananjay
> <kdhananj(a)redhat.com <mailto:kdhananj@redhat.com>> wrote:
>
> Hi,
>
> Sorry, I was out sick on Friday. I am looking into the logs. Will
> get back to you in some time.
>
> -Krutika
>
> On Fri, Jun 29, 2018 at 7:47 PM, Hanson Turner
> <hanson(a)andrewswireless.net <mailto:hanson@andrewswireless.net>>
> wrote:
>
> Hi Krutika,
>
> Did you need any other logs?
>
>
> Thanks,
>
> Hanson
>
>
> On 06/27/2018 02:04 PM, Hanson Turner wrote:
>>
>> Hi Krutika,
>>
>> Looking at the email spams, it looks like it started at
>> 8:04PM EDT on Jun 15 2018.
>>
>> From my memory, I think the cluster was working fine until
>> sometime that night. Somewhere between midnight and the next
>> (Saturday) morning, the engine crashed and all vm's stopped.
>>
>> I do have nightly backups that ran every night, using the
>> engine-backup command. Looks like my last valid backup was
>> 2018-06-15.
>>
>> I've included all logs I think might be of use. Please
>> forgive the use of 7zip, as the raw logs took 50mb which is
>> greater than my attachment limit.
>>
>> I think the just of what happened, is we had a downed node
>> for a period of time. Earlier that day, the node was brought
>> back into service. Later that night or early the next
>> morning, the engine was gone and hopping from node to node.
>>
>> I have tried to mount the engine's hdd file to see if I
>> could fix it. There are a few corrupted partitions, and
>> those are xfs formatted. Trying to mount gives me issues
>> about needing repaired, trying to repair gives me issues
>> about needing something cleaned first. I cannot remember
>> exactly what it was, but it wanted me to run a command that
>> ended -L to clear out the logs. I said no way and have left
>> the engine vm in a powered down state, as well as the
>> cluster in global maintenance.
>>
>> I can see no sign of the vm booting, (ie no networking)
>> except for what I've described earlier in the VNC session.
>>
>>
>> Thanks,
>>
>> Hanson
>>
>>
>>
>> On 06/27/2018 12:04 PM, Krutika Dhananjay wrote:
>>> Yeah, complete logs would help. Also let me know when you
>>> saw this issue - data and approx time (do specify the
>>> timezone as well).
>>>
>>> -Krutika
>>>
>>> On Wed, Jun 27, 2018 at 7:00 PM, Hanson Turner
>>> <hanson(a)andrewswireless.net
>>> <mailto:hanson@andrewswireless.net>> wrote:
>>>
>>> #more
>>>
rhev-data-center-mnt-glusterSD-ovirtnode1.abcxyzdomains.net
>>>
<
http://rhev-data-center-mnt-glusterSD-ovirtnode1.abcxyzdomains.net>\:_...
>>> [2018-06-24 07:39:12.161323] I
>>> [glusterfsd-mgmt.c:1888:mgmt_getspec_cbk] 0-glusterfs:
>>> No change in volfile,continuing
>>>
>>> # more gluster_bricks-engine-engine.log
>>> [2018-06-24 07:39:14.194222] I
>>> [glusterfsd-mgmt.c:1888:mgmt_getspec_cbk] 0-glusterfs:
>>> No change in volfile,continuing
>>> [2018-06-24 19:58:28.608469] E [MSGID: 101063]
>>> [event-epoll.c:551:event_dispatch_epoll_handler]
>>> 0-epoll: stale fd found on idx=12, gen=1, events=1,
>>> slot->gen=3
>>> [2018-06-25 14:24:19.716822] I
>>> [addr.c:55:compare_addr_and_update]
>>> 0-/gluster_bricks/engine/engine: allowed = "*",
>>> received addr = "192.168.0.57"
>>> [2018-06-25 14:24:19.716868] I [MSGID: 115029]
>>> [server-handshake.c:793:server_setvolume]
>>> 0-engine-server: accepted client from
>>>
CTX_ID:79b9d5b7-0bbb-4d67-87cf-11e27dfb6c1d-GRAPH_ID:0-PID:9901-HOST:sp3Kali-PC_NAME:engine-client-0-RECON_NO:-0
>>> (version: 4.0.2)
>>> [2018-06-25 14:45:35.061350] I [MSGID: 115036]
>>> [server.c:527:server_rpc_notify] 0-engine-server:
>>> disconnecting connection from
>>>
CTX_ID:79b9d5b7-0bbb-4d67-87cf-11e27dfb6c1d-GRAPH_ID:0-PID:9901-HOST:sp3Kali-PC_NAME:engine-client-0-RECON_NO:-0
>>> [2018-06-25 14:45:35.061415] I [MSGID: 115013]
>>> [server-helpers.c:289:do_fd_cleanup] 0-engine-server:
>>> fd cleanup on
>>>
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/images/82cde976-0650-4db9-9487-e2b52ffe25ee/e53806d9-3de5-4b26-aadc-157d745a9e0a
>>> [2018-06-25 14:45:35.062290] I [MSGID: 101055]
>>> [client_t.c:443:gf_client_unref] 0-engine-server:
>>> Shutting down connection
>>>
CTX_ID:79b9d5b7-0bbb-4d67-87cf-11e27dfb6c1d-GRAPH_ID:0-PID:9901-HOST:sp3Kali-PC_NAME:engine-client-0-RECON_NO:-0
>>> [2018-06-25 14:46:34.284195] I [MSGID: 115036]
>>> [server.c:527:server_rpc_notify] 0-engine-server:
>>> disconnecting connection from
>>>
CTX_ID:13e88614-31e8-4618-9f7f-067750f5971e-GRAPH_ID:0-PID:2615-HOST:workbench-PC_NAME:engine-client-0-RECON_NO:-0
>>> [2018-06-25 14:46:34.284546] I [MSGID: 101055]
>>> [client_t.c:443:gf_client_unref] 0-engine-server:
>>> Shutting down connection
>>>
CTX_ID:13e88614-31e8-4618-9f7f-067750f5971e-GRAPH_ID:0-PID:2615-HOST:workbench-PC_NAME:engine-client-0-RECON_NO:-0
>>>
>>>
>>> # gluster volume info engine
>>>
>>> Volume Name: engine
>>> Type: Replicate
>>> Volume ID: c8dc1b04-bc25-4e97-81bb-4d94929918b1
>>> Status: Started
>>> Snapshot Count: 0
>>> Number of Bricks: 1 x 3 = 3
>>> Transport-type: tcp
>>> Bricks:
>>> Brick1:
>>> ovirtnode1.core.abcxyzdomains.net:/gluster_bricks/engine/engine
>>> Brick2:
>>> ovirtnode3.core.abcxyzdomains.net:/gluster_bricks/engine/engine
>>> Brick3:
>>> ovirtnode4.core.abcxyzdomains.net:/gluster_bricks/engine/engine
>>> Options Reconfigured:
>>> performance.strict-write-ordering: off
>>> server.event-threads: 4
>>> client.event-threads: 4
>>> features.shard-block-size: 512MB
>>> cluster.granular-entry-heal: enable
>>> performance.strict-o-direct: off
>>> network.ping-timeout: 30
>>> storage.owner-gid: 36
>>> storage.owner-uid: 36
>>> user.cifs: off
>>> features.shard: on
>>> cluster.shd-wait-qlength: 10000
>>> cluster.shd-max-threads: 8
>>> cluster.locking-scheme: granular
>>> cluster.data-self-heal-algorithm: full
>>> cluster.server-quorum-type: server
>>> cluster.quorum-type: auto
>>> cluster.eager-lock: enable
>>> network.remote-dio: off
>>> performance.low-prio-threads: 32
>>> performance.io-cache: off
>>> performance.read-ahead: off
>>> performance.quick-read: off
>>> transport.address-family: inet
>>> nfs.disable: on
>>> performance.client-io-threads: off
>>>
>>> # gluster --version
>>> glusterfs 3.12.9
>>> Repository revision:
>>>
git://git.gluster.org/glusterfs.git
>>> <
http://git.gluster.org/glusterfs.git>
>>> Copyright (c) 2006-2016 Red Hat, Inc.
>>> <
https://www.gluster.org/>
<
https://www.gluster.org/>
>>> GlusterFS comes with ABSOLUTELY NO WARRANTY.
>>> It is licensed to you under your choice of the GNU Lesser
>>> General Public License, version 3 or any later version
>>> (LGPLv3
>>> or later), or the GNU General Public License, version 2
>>> (GPLv2),
>>> in all cases as published by the Free Software Foundation.
>>>
>>> Let me know if you want log further back, I can attach
>>> and send directly to you.
>>>
>>> Thanks,
>>>
>>> Hanson
>>>
>>>
>>>
>>> On 06/26/2018 12:30 AM, Krutika Dhananjay wrote:
>>>> Could you share the gluster mount and brick logs?
>>>> You'll find them under /var/log/glusterfs.
>>>> Also, what's the version of gluster you're using?
>>>> Also, output of `gluster volume info
<ENGINE_VOLNAME>`?
>>>>
>>>> -Krutika
>>>>
>>>> On Thu, Jun 21, 2018 at 9:50 AM, Sahina Bose
>>>> <sabose(a)redhat.com <mailto:sabose@redhat.com>>
wrote:
>>>>
>>>>
>>>>
>>>> On Wed, Jun 20, 2018 at 11:33 PM, Hanson Turner
>>>> <hanson(a)andrewswireless.net
>>>> <mailto:hanson@andrewswireless.net>> wrote:
>>>>
>>>> Hi Benny,
>>>>
>>>> Who should I be reaching out to for help with
>>>> a gluster based hosted engine corruption?
>>>>
>>>>
>>>>
>>>> Krutika, could you help?
>>>>
>>>>
>>>>
>>>> --== Host 1 status ==--
>>>>
>>>> conf_on_shared_storage : True
>>>> Status up-to-date : True
>>>> Hostname :
ovirtnode1.abcxyzdomains.net
>>>> <
http://ovirtnode1.abcxyzdomains.net>
>>>> Host ID : 1
>>>> Engine status : {"reason": "failed
liveliness
>>>> check", "health": "bad",
"vm": "up", "detail":
>>>> "Up"}
>>>> Score : 3400
>>>> stopped : False
>>>> Local maintenance : False
>>>> crc32 : 92254a68
>>>> local_conf_timestamp : 115910
>>>> Host timestamp : 115910
>>>> Extra metadata (valid at timestamp):
>>>> metadata_parse_version=1
>>>> metadata_feature_version=1
>>>> timestamp=115910 (Mon Jun 18 09:43:20 2018)
>>>> host-id=1
>>>> score=3400
>>>> vm_conf_refresh_time=115910 (Mon Jun 18
>>>> 09:43:20 2018)
>>>> conf_on_shared_storage=True
>>>> maintenance=False
>>>> state=GlobalMaintenance
>>>> stopped=False
>>>>
>>>>
>>>> My when I VNC into my HE, All I get is:
>>>> Probing EDD (edd=off to disable)... ok
>>>>
>>>>
>>>> So, that's why it's failing the liveliness
>>>> check... I cannot get the screen on HE to
>>>> change short of ctl-alt-del which will reboot
>>>> the HE.
>>>> I do have backups for the HE that are/were run
>>>> on a nightly basis.
>>>>
>>>> If the cluster was left alone, the HE vm would
>>>> bounce from machine to machine trying to boot.
>>>> This is why the cluster is in maintenance mode.
>>>> One of the nodes was down for a period of time
>>>> and brought back, sometime through the night,
>>>> which is when the automated backup kicks, the
>>>> HE started bouncing around. Got nearly 1000
>>>> emails.
>>>>
>>>> This seems to be the same error (but may not
>>>> be the same cause) as listed here:
>>>>
https://bugzilla.redhat.com/show_bug.cgi?id=1569827
>>>>
<
https://bugzilla.redhat.com/show_bug.cgi?id=1569827>
>>>>
>>>> Thanks,
>>>>
>>>> Hanson
>>>>
>>>>
>>>> _______________________________________________
>>>> Users mailing list -- users(a)ovirt.org
>>>> <mailto:users@ovirt.org>
>>>> To unsubscribe send an email to
>>>> users-leave(a)ovirt.org
>>>> <mailto:users-leave@ovirt.org>
>>>> Privacy Statement:
>>>>
https://www.ovirt.org/site/privacy-policy/
>>>> <
https://www.ovirt.org/site/privacy-policy/>
>>>> oVirt Code of Conduct:
>>>>
https://www.ovirt.org/community/about/community-guidelines/
>>>>
<
https://www.ovirt.org/community/about/community-guidelines/>
>>>> List Archives:
>>>>
https://lists.ovirt.org/archives/list/users@ovirt.org/message/3NLA2URX3KN...
>>>>
<
https://lists.ovirt.org/archives/list/users@ovirt.org/message/3NLA2URX3KN...
>>>>
>>>>
>>>>
>>>
>>>
>>
>
>
>