Hi,
So it seems some of the files in the volume have mismatching gfids. I
see the following logs from 15th June, ~8pm EDT:
<snip>
...
...
[2018-06-16 04:00:10.264690] E [MSGID: 108008]
[afr-self-heal-common.c:335:afr_gfid_split_brain_source]
0-engine-replicate-0: Gfid mismatch detected for
<gfid:941edf0c-d363-488e-a333-d12320f96480>/hosted-engine.lockspace>,
6bbe6097-8520-4a61-971e-6e30c2ee0abe on engine-client-2 and
ef21a706-41cf-4519-8659-87ecde4bbfbf on engine-client-0.
(see 3. Resolution of split-brain using gluster CLI).
Nit: The doc says in the beginning that gfid split-brain cannot be fixed
automatically but newer releases do support it, so the methods in
section 3 should work to solve gfid split-brains.
[2018-06-16 04:00:10.265861] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4411: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace
=> -1 (Input/output error)
[2018-06-16 04:00:11.522600] E [MSGID: 108008]
[afr-self-heal-common.c:212:afr_gfid_split_brain_source]
0-engine-replicate-0: All the bricks should be up to resolve the gfid
split barin
This is a concern. For the commands to work, all 3 bricks must be
online.
Thanks,
Ravi
[2018-06-16 04:00:11.522632] E [MSGID: 108008]
[afr-self-heal-common.c:335:afr_gfid_split_brain_source]
0-engine-replicate-0: Gfid mismatch detected for
<gfid:941edf0c-d363-488e-a333-d12320f96480>/hosted-engine.lockspace>,
6bbe6097-8520-4a61-971e-6e30c2ee0abe on engine-client-2 and
ef21a706-41cf-4519-8659-87ecde4bbfbf on engine-client-0.
[2018-06-16 04:00:11.523750] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4493: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace
=> -1 (Input/output error)
[2018-06-16 04:00:12.864393] E [MSGID: 108008]
[afr-self-heal-common.c:212:afr_gfid_split_brain_source]
0-engine-replicate-0: All the bricks should be up to resolve the gfid
split barin
[2018-06-16 04:00:12.864426] E [MSGID: 108008]
[afr-self-heal-common.c:335:afr_gfid_split_brain_source]
0-engine-replicate-0: Gfid mismatch detected for
<gfid:941edf0c-d363-488e-a333-d12320f96480>/hosted-engine.lockspace>,
6bbe6097-8520-4a61-971e-6e30c2ee0abe on engine-client-2 and
ef21a706-41cf-4519-8659-87ecde4bbfbf on engine-client-0.
[2018-06-16 04:00:12.865392] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4575: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace
=> -1 (Input/output error)
[2018-06-16 04:00:18.716007] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4657: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace
=> -1 (Input/output error)
[2018-06-16 04:00:20.553365] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4739: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace
=> -1 (Input/output error)
[2018-06-16 04:00:21.771698] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4821: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace
=> -1 (Input/output error)
[2018-06-16 04:00:23.871647] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4906: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace
=> -1 (Input/output error)
[2018-06-16 04:00:25.034780] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4987: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace
=> -1 (Input/output error)
...
...
</snip>
Adding Ravi who works on replicate component to hep resolve the
mismatches.
-Krutika
On Mon, Jul 2, 2018 at 12:27 PM, Krutika Dhananjay
<kdhananj(a)redhat.com <mailto:kdhananj@redhat.com>> wrote:
Hi,
Sorry, I was out sick on Friday. I am looking into the logs. Will
get back to you in some time.
-Krutika
On Fri, Jun 29, 2018 at 7:47 PM, Hanson Turner
<hanson(a)andrewswireless.net <mailto:hanson@andrewswireless.net>>
wrote:
Hi Krutika,
Did you need any other logs?
Thanks,
Hanson
On 06/27/2018 02:04 PM, Hanson Turner wrote:
>
> Hi Krutika,
>
> Looking at the email spams, it looks like it started at
> 8:04PM EDT on Jun 15 2018.
>
> From my memory, I think the cluster was working fine until
> sometime that night. Somewhere between midnight and the next
> (Saturday) morning, the engine crashed and all vm's stopped.
>
> I do have nightly backups that ran every night, using the
> engine-backup command. Looks like my last valid backup was
> 2018-06-15.
>
> I've included all logs I think might be of use. Please
> forgive the use of 7zip, as the raw logs took 50mb which is
> greater than my attachment limit.
>
> I think the just of what happened, is we had a downed node
> for a period of time. Earlier that day, the node was brought
> back into service. Later that night or early the next
> morning, the engine was gone and hopping from node to node.
>
> I have tried to mount the engine's hdd file to see if I could
> fix it. There are a few corrupted partitions, and those are
> xfs formatted. Trying to mount gives me issues about needing
> repaired, trying to repair gives me issues about needing
> something cleaned first. I cannot remember exactly what it
> was, but it wanted me to run a command that ended -L to clear
> out the logs. I said no way and have left the engine vm in a
> powered down state, as well as the cluster in global maintenance.
>
> I can see no sign of the vm booting, (ie no networking)
> except for what I've described earlier in the VNC session.
>
>
> Thanks,
>
> Hanson
>
>
>
> On 06/27/2018 12:04 PM, Krutika Dhananjay wrote:
>> Yeah, complete logs would help. Also let me know when you
>> saw this issue - data and approx time (do specify the
>> timezone as well).
>>
>> -Krutika
>>
>> On Wed, Jun 27, 2018 at 7:00 PM, Hanson Turner
>> <hanson(a)andrewswireless.net
>> <mailto:hanson@andrewswireless.net>> wrote:
>>
>> #more
>>
rhev-data-center-mnt-glusterSD-ovirtnode1.abcxyzdomains.net
>>
<
http://rhev-data-center-mnt-glusterSD-ovirtnode1.abcxyzdomains.net>\:_...
>> [2018-06-24 07:39:12.161323] I
>> [glusterfsd-mgmt.c:1888:mgmt_getspec_cbk] 0-glusterfs:
>> No change in volfile,continuing
>>
>> # more gluster_bricks-engine-engine.log
>> [2018-06-24 07:39:14.194222] I
>> [glusterfsd-mgmt.c:1888:mgmt_getspec_cbk] 0-glusterfs:
>> No change in volfile,continuing
>> [2018-06-24 19:58:28.608469] E [MSGID: 101063]
>> [event-epoll.c:551:event_dispatch_epoll_handler]
>> 0-epoll: stale fd found on idx=12, gen=1, events=1,
>> slot->gen=3
>> [2018-06-25 14:24:19.716822] I
>> [addr.c:55:compare_addr_and_update]
>> 0-/gluster_bricks/engine/engine: allowed = "*", received
>> addr = "192.168.0.57"
>> [2018-06-25 14:24:19.716868] I [MSGID: 115029]
>> [server-handshake.c:793:server_setvolume]
>> 0-engine-server: accepted client from
>>
CTX_ID:79b9d5b7-0bbb-4d67-87cf-11e27dfb6c1d-GRAPH_ID:0-PID:9901-HOST:sp3Kali-PC_NAME:engine-client-0-RECON_NO:-0
>> (version: 4.0.2)
>> [2018-06-25 14:45:35.061350] I [MSGID: 115036]
>> [server.c:527:server_rpc_notify] 0-engine-server:
>> disconnecting connection from
>>
CTX_ID:79b9d5b7-0bbb-4d67-87cf-11e27dfb6c1d-GRAPH_ID:0-PID:9901-HOST:sp3Kali-PC_NAME:engine-client-0-RECON_NO:-0
>> [2018-06-25 14:45:35.061415] I [MSGID: 115013]
>> [server-helpers.c:289:do_fd_cleanup] 0-engine-server: fd
>> cleanup on
>>
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/images/82cde976-0650-4db9-9487-e2b52ffe25ee/e53806d9-3de5-4b26-aadc-157d745a9e0a
>> [2018-06-25 14:45:35.062290] I [MSGID: 101055]
>> [client_t.c:443:gf_client_unref] 0-engine-server:
>> Shutting down connection
>>
CTX_ID:79b9d5b7-0bbb-4d67-87cf-11e27dfb6c1d-GRAPH_ID:0-PID:9901-HOST:sp3Kali-PC_NAME:engine-client-0-RECON_NO:-0
>> [2018-06-25 14:46:34.284195] I [MSGID: 115036]
>> [server.c:527:server_rpc_notify] 0-engine-server:
>> disconnecting connection from
>>
CTX_ID:13e88614-31e8-4618-9f7f-067750f5971e-GRAPH_ID:0-PID:2615-HOST:workbench-PC_NAME:engine-client-0-RECON_NO:-0
>> [2018-06-25 14:46:34.284546] I [MSGID: 101055]
>> [client_t.c:443:gf_client_unref] 0-engine-server:
>> Shutting down connection
>>
CTX_ID:13e88614-31e8-4618-9f7f-067750f5971e-GRAPH_ID:0-PID:2615-HOST:workbench-PC_NAME:engine-client-0-RECON_NO:-0
>>
>>
>> # gluster volume info engine
>>
>> Volume Name: engine
>> Type: Replicate
>> Volume ID: c8dc1b04-bc25-4e97-81bb-4d94929918b1
>> Status: Started
>> Snapshot Count: 0
>> Number of Bricks: 1 x 3 = 3
>> Transport-type: tcp
>> Bricks:
>> Brick1:
>> ovirtnode1.core.abcxyzdomains.net:/gluster_bricks/engine/engine
>> Brick2:
>> ovirtnode3.core.abcxyzdomains.net:/gluster_bricks/engine/engine
>> Brick3:
>> ovirtnode4.core.abcxyzdomains.net:/gluster_bricks/engine/engine
>> Options Reconfigured:
>> performance.strict-write-ordering: off
>> server.event-threads: 4
>> client.event-threads: 4
>> features.shard-block-size: 512MB
>> cluster.granular-entry-heal: enable
>> performance.strict-o-direct: off
>> network.ping-timeout: 30
>> storage.owner-gid: 36
>> storage.owner-uid: 36
>> user.cifs: off
>> features.shard: on
>> cluster.shd-wait-qlength: 10000
>> cluster.shd-max-threads: 8
>> cluster.locking-scheme: granular
>> cluster.data-self-heal-algorithm: full
>> cluster.server-quorum-type: server
>> cluster.quorum-type: auto
>> cluster.eager-lock: enable
>> network.remote-dio: off
>> performance.low-prio-threads: 32
>> performance.io-cache: off
>> performance.read-ahead: off
>> performance.quick-read: off
>> transport.address-family: inet
>> nfs.disable: on
>> performance.client-io-threads: off
>>
>> # gluster --version
>> glusterfs 3.12.9
>> Repository revision:
git://git.gluster.org/glusterfs.git
>> <
http://git.gluster.org/glusterfs.git>
>> Copyright (c) 2006-2016 Red Hat, Inc.
>> <
https://www.gluster.org/> <
https://www.gluster.org/>
>> GlusterFS comes with ABSOLUTELY NO WARRANTY.
>> It is licensed to you under your choice of the GNU Lesser
>> General Public License, version 3 or any later version
>> (LGPLv3
>> or later), or the GNU General Public License, version 2
>> (GPLv2),
>> in all cases as published by the Free Software Foundation.
>>
>> Let me know if you want log further back, I can attach
>> and send directly to you.
>>
>> Thanks,
>>
>> Hanson
>>
>>
>>
>> On 06/26/2018 12:30 AM, Krutika Dhananjay wrote:
>>> Could you share the gluster mount and brick logs?
>>> You'll find them under /var/log/glusterfs.
>>> Also, what's the version of gluster you're using?
>>> Also, output of `gluster volume info <ENGINE_VOLNAME>`?
>>>
>>> -Krutika
>>>
>>> On Thu, Jun 21, 2018 at 9:50 AM, Sahina Bose
>>> <sabose(a)redhat.com <mailto:sabose@redhat.com>>
wrote:
>>>
>>>
>>>
>>> On Wed, Jun 20, 2018 at 11:33 PM, Hanson Turner
>>> <hanson(a)andrewswireless.net
>>> <mailto:hanson@andrewswireless.net>> wrote:
>>>
>>> Hi Benny,
>>>
>>> Who should I be reaching out to for help with a
>>> gluster based hosted engine corruption?
>>>
>>>
>>>
>>> Krutika, could you help?
>>>
>>>
>>>
>>> --== Host 1 status ==--
>>>
>>> conf_on_shared_storage : True
>>> Status up-to-date : True
>>> Hostname :
ovirtnode1.abcxyzdomains.net
>>> <
http://ovirtnode1.abcxyzdomains.net>
>>> Host ID : 1
>>> Engine status : {"reason": "failed
liveliness
>>> check", "health": "bad",
"vm": "up", "detail":
>>> "Up"}
>>> Score : 3400
>>> stopped : False
>>> Local maintenance : False
>>> crc32 : 92254a68
>>> local_conf_timestamp : 115910
>>> Host timestamp : 115910
>>> Extra metadata (valid at timestamp):
>>> metadata_parse_version=1
>>> metadata_feature_version=1
>>> timestamp=115910 (Mon Jun 18 09:43:20 2018)
>>> host-id=1
>>> score=3400
>>> vm_conf_refresh_time=115910 (Mon Jun 18
>>> 09:43:20 2018)
>>> conf_on_shared_storage=True
>>> maintenance=False
>>> state=GlobalMaintenance
>>> stopped=False
>>>
>>>
>>> My when I VNC into my HE, All I get is:
>>> Probing EDD (edd=off to disable)... ok
>>>
>>>
>>> So, that's why it's failing the liveliness
>>> check... I cannot get the screen on HE to
>>> change short of ctl-alt-del which will reboot
>>> the HE.
>>> I do have backups for the HE that are/were run
>>> on a nightly basis.
>>>
>>> If the cluster was left alone, the HE vm would
>>> bounce from machine to machine trying to boot.
>>> This is why the cluster is in maintenance mode.
>>> One of the nodes was down for a period of time
>>> and brought back, sometime through the night,
>>> which is when the automated backup kicks, the
>>> HE started bouncing around. Got nearly 1000 emails.
>>>
>>> This seems to be the same error (but may not be
>>> the same cause) as listed here:
>>>
https://bugzilla.redhat.com/show_bug.cgi?id=1569827
>>>
<
https://bugzilla.redhat.com/show_bug.cgi?id=1569827>
>>>
>>> Thanks,
>>>
>>> Hanson
>>>
>>>
>>> _______________________________________________
>>> Users mailing list -- users(a)ovirt.org
>>> <mailto:users@ovirt.org>
>>> To unsubscribe send an email to
>>> users-leave(a)ovirt.org
>>> <mailto:users-leave@ovirt.org>
>>> Privacy Statement:
>>>
https://www.ovirt.org/site/privacy-policy/
>>> <
https://www.ovirt.org/site/privacy-policy/>
>>> oVirt Code of Conduct:
>>>
https://www.ovirt.org/community/about/community-guidelines/
>>>
<
https://www.ovirt.org/community/about/community-guidelines/>
>>> List Archives:
>>>
https://lists.ovirt.org/archives/list/users@ovirt.org/message/3NLA2URX3KN...
>>>
<
https://lists.ovirt.org/archives/list/users@ovirt.org/message/3NLA2URX3KN...
>>>
>>>
>>>
>>
>>
>