Hi,
So it seems some of the files in the volume have mismatching gfids. I see
the following logs from 15th June, ~8pm EDT:
<snip>
...
...
[2018-06-16 04:00:10.264690] E [MSGID: 108008]
[afr-self-heal-common.c:335:afr_gfid_split_brain_source]
0-engine-replicate-0: Gfid mismatch detected for
<gfid:941edf0c-d363-488e-a333-d12320f96480>/hosted-engine.lockspace>,
6bbe6097-8520-4a61-971e-6e30c2ee0abe on engine-client-2 and
ef21a706-41cf-4519-8659-87ecde4bbfbf on engine-client-0.
[2018-06-16 04:00:10.265861] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4411: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace =>
-1 (Input/output error)
[2018-06-16 04:00:11.522600] E [MSGID: 108008]
[afr-self-heal-common.c:212:afr_gfid_split_brain_source]
0-engine-replicate-0: All the bricks should be up to resolve the gfid split
barin
[2018-06-16 04:00:11.522632] E [MSGID: 108008]
[afr-self-heal-common.c:335:afr_gfid_split_brain_source]
0-engine-replicate-0: Gfid mismatch detected for
<gfid:941edf0c-d363-488e-a333-d12320f96480>/hosted-engine.lockspace>,
6bbe6097-8520-4a61-971e-6e30c2ee0abe on engine-client-2 and
ef21a706-41cf-4519-8659-87ecde4bbfbf on engine-client-0.
[2018-06-16 04:00:11.523750] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4493: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace =>
-1 (Input/output error)
[2018-06-16 04:00:12.864393] E [MSGID: 108008]
[afr-self-heal-common.c:212:afr_gfid_split_brain_source]
0-engine-replicate-0: All the bricks should be up to resolve the gfid split
barin
[2018-06-16 04:00:12.864426] E [MSGID: 108008]
[afr-self-heal-common.c:335:afr_gfid_split_brain_source]
0-engine-replicate-0: Gfid mismatch detected for
<gfid:941edf0c-d363-488e-a333-d12320f96480>/hosted-engine.lockspace>,
6bbe6097-8520-4a61-971e-6e30c2ee0abe on engine-client-2 and
ef21a706-41cf-4519-8659-87ecde4bbfbf on engine-client-0.
[2018-06-16 04:00:12.865392] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4575: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace =>
-1 (Input/output error)
[2018-06-16 04:00:18.716007] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4657: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace =>
-1 (Input/output error)
[2018-06-16 04:00:20.553365] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4739: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace =>
-1 (Input/output error)
[2018-06-16 04:00:21.771698] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4821: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace =>
-1 (Input/output error)
[2018-06-16 04:00:23.871647] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4906: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace =>
-1 (Input/output error)
[2018-06-16 04:00:25.034780] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4987: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace =>
-1 (Input/output error)
...
...
</snip>
Adding Ravi who works on replicate component to hep resolve the mismatches.
-Krutika
On Mon, Jul 2, 2018 at 12:27 PM, Krutika Dhananjay <kdhananj(a)redhat.com>
wrote:
Hi,
Sorry, I was out sick on Friday. I am looking into the logs. Will get back
to you in some time.
-Krutika
On Fri, Jun 29, 2018 at 7:47 PM, Hanson Turner <hanson(a)andrewswireless.net
> wrote:
> Hi Krutika,
>
> Did you need any other logs?
>
>
> Thanks,
>
> Hanson
>
> On 06/27/2018 02:04 PM, Hanson Turner wrote:
>
> Hi Krutika,
>
> Looking at the email spams, it looks like it started at 8:04PM EDT on Jun
> 15 2018.
>
> From my memory, I think the cluster was working fine until sometime that
> night. Somewhere between midnight and the next (Saturday) morning, the
> engine crashed and all vm's stopped.
>
> I do have nightly backups that ran every night, using the engine-backup
> command. Looks like my last valid backup was 2018-06-15.
>
> I've included all logs I think might be of use. Please forgive the use of
> 7zip, as the raw logs took 50mb which is greater than my attachment limit.
>
> I think the just of what happened, is we had a downed node for a period
> of time. Earlier that day, the node was brought back into service. Later
> that night or early the next morning, the engine was gone and hopping from
> node to node.
>
> I have tried to mount the engine's hdd file to see if I could fix it.
> There are a few corrupted partitions, and those are xfs formatted. Trying
> to mount gives me issues about needing repaired, trying to repair gives me
> issues about needing something cleaned first. I cannot remember exactly
> what it was, but it wanted me to run a command that ended -L to clear out
> the logs. I said no way and have left the engine vm in a powered down
> state, as well as the cluster in global maintenance.
>
> I can see no sign of the vm booting, (ie no networking) except for what
> I've described earlier in the VNC session.
>
>
> Thanks,
>
> Hanson
>
>
>
> On 06/27/2018 12:04 PM, Krutika Dhananjay wrote:
>
> Yeah, complete logs would help. Also let me know when you saw this issue
> - data and approx time (do specify the timezone as well).
>
> -Krutika
>
> On Wed, Jun 27, 2018 at 7:00 PM, Hanson Turner <
> hanson(a)andrewswireless.net> wrote:
>
>> #more rhev-data-center-mnt-glusterSD-ovirtnode1.abcxyzdomains.net\
>> :_engine.log
>> [2018-06-24 07:39:12.161323] I [glusterfsd-mgmt.c:1888:mgmt_getspec_cbk]
>> 0-glusterfs: No change in volfile,continuing
>>
>> # more gluster_bricks-engine-engine.log
>> [2018-06-24 07:39:14.194222] I [glusterfsd-mgmt.c:1888:mgmt_getspec_cbk]
>> 0-glusterfs: No change in volfile,continuing
>> [2018-06-24 19:58:28.608469] E [MSGID: 101063]
>> [event-epoll.c:551:event_dispatch_epoll_handler] 0-epoll: stale fd
>> found on idx=12, gen=1, events=1, slot->gen=3
>> [2018-06-25 14:24:19.716822] I [addr.c:55:compare_addr_and_update]
>> 0-/gluster_bricks/engine/engine: allowed = "*", received addr =
>> "192.168.0.57"
>> [2018-06-25 14:24:19.716868] I [MSGID: 115029]
>> [server-handshake.c:793:server_setvolume] 0-engine-server: accepted
>> client from CTX_ID:79b9d5b7-0bbb-4d67-87cf-11e27dfb6c1d-GRAPH_ID:0-PID:9
>> 901-HOST:sp3Kali-PC_NAME:engine-client-0-RECON_NO:-0 (version: 4.0.2)
>> [2018-06-25 14:45:35.061350] I [MSGID: 115036]
>> [server.c:527:server_rpc_notify] 0-engine-server: disconnecting
>> connection from CTX_ID:79b9d5b7-0bbb-4d67-87cf
>> -11e27dfb6c1d-GRAPH_ID:0-PID:9901-HOST:sp3Kali-PC_NAME:engin
>> e-client-0-RECON_NO:-0
>> [2018-06-25 14:45:35.061415] I [MSGID: 115013]
>> [server-helpers.c:289:do_fd_cleanup] 0-engine-server: fd cleanup on
>> /c65e03f0-d553-4d5d-ba4f-9d378c153b9b/images/82cde976-0650-4
>> db9-9487-e2b52ffe25ee/e53806d9-3de5-4b26-aadc-157d745a9e0a
>> [2018-06-25 14:45:35.062290] I [MSGID: 101055]
>> [client_t.c:443:gf_client_unref] 0-engine-server: Shutting down
>> connection CTX_ID:79b9d5b7-0bbb-4d67-87cf-11e27dfb6c1d-GRAPH_ID:0-PID:9
>> 901-HOST:sp3Kali-PC_NAME:engine-client-0-RECON_NO:-0
>> [2018-06-25 14:46:34.284195] I [MSGID: 115036]
>> [server.c:527:server_rpc_notify] 0-engine-server: disconnecting
>> connection from CTX_ID:13e88614-31e8-4618-9f7f
>> -067750f5971e-GRAPH_ID:0-PID:2615-HOST:workbench-PC_NAME:eng
>> ine-client-0-RECON_NO:-0
>> [2018-06-25 14:46:34.284546] I [MSGID: 101055]
>> [client_t.c:443:gf_client_unref] 0-engine-server: Shutting down
>> connection CTX_ID:13e88614-31e8-4618-9f7f-067750f5971e-GRAPH_ID:0-PID:2
>> 615-HOST:workbench-PC_NAME:engine-client-0-RECON_NO:-0
>>
>>
>> # gluster volume info engine
>>
>> Volume Name: engine
>> Type: Replicate
>> Volume ID: c8dc1b04-bc25-4e97-81bb-4d94929918b1
>> Status: Started
>> Snapshot Count: 0
>> Number of Bricks: 1 x 3 = 3
>> Transport-type: tcp
>> Bricks:
>> Brick1: ovirtnode1.core.abcxyzdomains.net:/gluster_bricks/engine/engine
>> Brick2: ovirtnode3.core.abcxyzdomains.net:/gluster_bricks/engine/engine
>> Brick3: ovirtnode4.core.abcxyzdomains.net:/gluster_bricks/engine/engine
>> Options Reconfigured:
>> performance.strict-write-ordering: off
>> server.event-threads: 4
>> client.event-threads: 4
>> features.shard-block-size: 512MB
>> cluster.granular-entry-heal: enable
>> performance.strict-o-direct: off
>> network.ping-timeout: 30
>> storage.owner-gid: 36
>> storage.owner-uid: 36
>> user.cifs: off
>> features.shard: on
>> cluster.shd-wait-qlength: 10000
>> cluster.shd-max-threads: 8
>> cluster.locking-scheme: granular
>> cluster.data-self-heal-algorithm: full
>> cluster.server-quorum-type: server
>> cluster.quorum-type: auto
>> cluster.eager-lock: enable
>> network.remote-dio: off
>> performance.low-prio-threads: 32
>> performance.io-cache: off
>> performance.read-ahead: off
>> performance.quick-read: off
>> transport.address-family: inet
>> nfs.disable: on
>> performance.client-io-threads: off
>>
>> # gluster --version
>> glusterfs 3.12.9
>> Repository revision:
git://git.gluster.org/glusterfs.git
>> Copyright (c) 2006-2016 Red Hat, Inc. <
https://www.gluster.org/>
>> <
https://www.gluster.org/>
>> GlusterFS comes with ABSOLUTELY NO WARRANTY.
>> It is licensed to you under your choice of the GNU Lesser
>> General Public License, version 3 or any later version (LGPLv3
>> or later), or the GNU General Public License, version 2 (GPLv2),
>> in all cases as published by the Free Software Foundation.
>>
>> Let me know if you want log further back, I can attach and send directly
>> to you.
>>
>> Thanks,
>>
>> Hanson
>>
>>
>>
>> On 06/26/2018 12:30 AM, Krutika Dhananjay wrote:
>>
>> Could you share the gluster mount and brick logs? You'll find them
>> under /var/log/glusterfs.
>> Also, what's the version of gluster you're using?
>> Also, output of `gluster volume info <ENGINE_VOLNAME>`?
>>
>> -Krutika
>>
>> On Thu, Jun 21, 2018 at 9:50 AM, Sahina Bose <sabose(a)redhat.com> wrote:
>>
>>>
>>>
>>> On Wed, Jun 20, 2018 at 11:33 PM, Hanson Turner <
>>> hanson(a)andrewswireless.net> wrote:
>>>
>>>> Hi Benny,
>>>>
>>>> Who should I be reaching out to for help with a gluster based hosted
>>>> engine corruption?
>>>>
>>>
>>>
>>> Krutika, could you help?
>>>
>>>
>>>>
>>>> --== Host 1 status ==--
>>>>
>>>> conf_on_shared_storage : True
>>>> Status up-to-date : True
>>>> Hostname :
ovirtnode1.abcxyzdomains.net
>>>> Host ID : 1
>>>> Engine status : {"reason": "failed
liveliness
>>>> check", "health": "bad", "vm":
"up", "detail": "Up"}
>>>> Score : 3400
>>>> stopped : False
>>>> Local maintenance : False
>>>> crc32 : 92254a68
>>>> local_conf_timestamp : 115910
>>>> Host timestamp : 115910
>>>> Extra metadata (valid at timestamp):
>>>> metadata_parse_version=1
>>>> metadata_feature_version=1
>>>> timestamp=115910 (Mon Jun 18 09:43:20 2018)
>>>> host-id=1
>>>> score=3400
>>>> vm_conf_refresh_time=115910 (Mon Jun 18 09:43:20 2018)
>>>> conf_on_shared_storage=True
>>>> maintenance=False
>>>> state=GlobalMaintenance
>>>> stopped=False
>>>>
>>>>
>>>> My when I VNC into my HE, All I get is:
>>>> Probing EDD (edd=off to disable)... ok
>>>>
>>>>
>>>> So, that's why it's failing the liveliness check... I cannot get
the
>>>> screen on HE to change short of ctl-alt-del which will reboot the HE.
>>>> I do have backups for the HE that are/were run on a nightly basis.
>>>>
>>>> If the cluster was left alone, the HE vm would bounce from machine to
>>>> machine trying to boot. This is why the cluster is in maintenance mode.
>>>> One of the nodes was down for a period of time and brought back,
>>>> sometime through the night, which is when the automated backup kicks,
the
>>>> HE started bouncing around. Got nearly 1000 emails.
>>>>
>>>> This seems to be the same error (but may not be the same cause) as
>>>> listed here:
>>>>
https://bugzilla.redhat.com/show_bug.cgi?id=1569827
>>>>
>>>> Thanks,
>>>>
>>>> Hanson
>>>>
>>>>
>>>> _______________________________________________
>>>> Users mailing list -- users(a)ovirt.org
>>>> To unsubscribe send an email to users-leave(a)ovirt.org
>>>> Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
>>>> oVirt Code of Conduct:
https://www.ovirt.org/communit
>>>> y/about/community-guidelines/
>>>> List Archives:
https://lists.ovirt.org/archiv
>>>> es/list/users(a)ovirt.org/message/3NLA2URX3KN44FGFUVV4N5EJBPICABHH/
>>>>
>>>>
>>>
>>
>>
>
>
>