[ovirt-users] Re: HE + Gluster : Engine corrupted?

19 Jul 2018

      Hi,

"[2018-06-16 04:00:10.264690] E [MSGID: 108008] 
[afr-self-heal-common.c:335:afr_gfid_split_brain_source] 
0-engine-replicate-0: Gfid mismatch detected for 
<gfid:941edf0c-d363-488e-a333-d12320f96480>/hosted-engine.lockspace>, 
6bbe6097-8520-4a61-971e-6e30c2ee0abe on engine-client-2 and 
ef21a706-41cf-4519-8659-87ecde4bbfbf on engine-client-0."

Are the gfids actually different on the bricks as this message says? If 
yes, then the commands shared earlier should have fixed it.

-Ravi

On 07/02/2018 02:15 PM, Krutika Dhananjay wrote:
...
Hi,
So it seems some of the files in the volume have mismatching gfids. I 
see the following logs from 15th June, ~8pm EDT:
<snip>
...
...
[2018-06-16 04:00:10.264690] E [MSGID: 108008] 
[afr-self-heal-common.c:335:afr_gfid_split_brain_source] 
0-engine-replicate-0: Gfid mismatch detected for 
<gfid:941edf0c-d363-488e-a333-d12320f96480>/hosted-engine.lockspace>, 
6bbe6097-8520-4a61-971e-6e30c2ee0abe on engine-client-2 and 
ef21a706-41cf-4519-8659-87ecde4bbfbf on engine-client-0.
[2018-06-16 04:00:10.265861] W [fuse-bridge.c:540:fuse_entry_cbk] 
0-glusterfs-fuse: 4411: LOOKUP() 
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace 
=> -1 (Input/output error)
[2018-06-16 04:00:11.522600] E [MSGID: 108008] 
[afr-self-heal-common.c:212:afr_gfid_split_brain_source] 
0-engine-replicate-0: All the bricks should be up to resolve the gfid 
split barin
[2018-06-16 04:00:11.522632] E [MSGID: 108008] 
[afr-self-heal-common.c:335:afr_gfid_split_brain_source] 
0-engine-replicate-0: Gfid mismatch detected for 
<gfid:941edf0c-d363-488e-a333-d12320f96480>/hosted-engine.lockspace>, 
6bbe6097-8520-4a61-971e-6e30c2ee0abe on engine-client-2 and 
ef21a706-41cf-4519-8659-87ecde4bbfbf on engine-client-0.
[2018-06-16 04:00:11.523750] W [fuse-bridge.c:540:fuse_entry_cbk] 
0-glusterfs-fuse: 4493: LOOKUP() 
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace 
=> -1 (Input/output error)
[2018-06-16 04:00:12.864393] E [MSGID: 108008] 
[afr-self-heal-common.c:212:afr_gfid_split_brain_source] 
0-engine-replicate-0: All the bricks should be up to resolve the gfid 
split barin
[2018-06-16 04:00:12.864426] E [MSGID: 108008] 
[afr-self-heal-common.c:335:afr_gfid_split_brain_source] 
0-engine-replicate-0: Gfid mismatch detected for 
<gfid:941edf0c-d363-488e-a333-d12320f96480>/hosted-engine.lockspace>, 
6bbe6097-8520-4a61-971e-6e30c2ee0abe on engine-client-2 and 
ef21a706-41cf-4519-8659-87ecde4bbfbf on engine-client-0.
[2018-06-16 04:00:12.865392] W [fuse-bridge.c:540:fuse_entry_cbk] 
0-glusterfs-fuse: 4575: LOOKUP() 
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace 
=> -1 (Input/output error)
[2018-06-16 04:00:18.716007] W [fuse-bridge.c:540:fuse_entry_cbk] 
0-glusterfs-fuse: 4657: LOOKUP() 
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace 
=> -1 (Input/output error)
[2018-06-16 04:00:20.553365] W [fuse-bridge.c:540:fuse_entry_cbk] 
0-glusterfs-fuse: 4739: LOOKUP() 
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace 
=> -1 (Input/output error)
[2018-06-16 04:00:21.771698] W [fuse-bridge.c:540:fuse_entry_cbk] 
0-glusterfs-fuse: 4821: LOOKUP() 
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace 
=> -1 (Input/output error)
[2018-06-16 04:00:23.871647] W [fuse-bridge.c:540:fuse_entry_cbk] 
0-glusterfs-fuse: 4906: LOOKUP() 
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace 
=> -1 (Input/output error)
[2018-06-16 04:00:25.034780] W [fuse-bridge.c:540:fuse_entry_cbk] 
0-glusterfs-fuse: 4987: LOOKUP() 
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace 
=> -1 (Input/output error)
...
...
</snip>
Adding Ravi who works on replicate component to hep resolve the 
mismatches.
-Krutika
On Mon, Jul 2, 2018 at 12:27 PM, Krutika Dhananjay 
<kdhananj@redhat.com <mailto:kdhananj@redhat.com>> wrote:
Hi,
Sorry, I was out sick on Friday. I am looking into the logs. Will
    get back to you in some time.
-Krutika
On Fri, Jun 29, 2018 at 7:47 PM, Hanson Turner
    <hanson@andrewswireless.net <mailto:hanson@andrewswireless.net>>
    wrote:
Hi Krutika,
Did you need any other logs?
Thanks,
Hanson
On 06/27/2018 02:04 PM, Hanson Turner wrote:
...
Hi Krutika,
Looking at the email spams, it looks like it started at
        8:04PM EDT on Jun 15 2018.
From my memory, I think the cluster was working fine until
        sometime that night. Somewhere between midnight and the next
        (Saturday) morning, the engine crashed and all vm's stopped.
I do have nightly backups that ran every night, using the
        engine-backup command. Looks like my last valid backup was
        2018-06-15.
I've included all logs I think might be of use. Please
        forgive the use of 7zip, as the raw logs took 50mb which is
        greater than my attachment limit.
I think the just of what happened, is we had a downed node
        for a period of time. Earlier that day, the node was brought
        back into service. Later that night or early the next
        morning, the engine was gone and hopping from node to node.
I have tried to mount the engine's hdd file to see if I could
        fix it. There are a few corrupted partitions, and those are
        xfs formatted. Trying to mount gives me issues about needing
        repaired, trying to repair gives me issues about needing
        something cleaned first. I cannot remember exactly what it
        was, but it wanted me to run a command that ended -L to clear
        out the logs. I said no way and have left the engine vm in a
        powered down state, as well as the cluster in global maintenance.
I can see no sign of the vm booting, (ie no networking)
        except for what I've described earlier in the VNC session.
Thanks,
Hanson
On 06/27/2018 12:04 PM, Krutika Dhananjay wrote:
...
Yeah, complete logs would help. Also let me know when you
        saw this issue - data and approx time (do specify the
        timezone as well).
-Krutika
On Wed, Jun 27, 2018 at 7:00 PM, Hanson Turner
        <hanson@andrewswireless.net
        <mailto:hanson@andrewswireless.net>> wrote:
#more
            rhev-data-center-mnt-glusterSD-ovirtnode1.abcxyzdomains.net
            <http://rhev-data-center-mnt-glusterSD-ovirtnode1.abcxyzdomains.net>\:_engine.log
            [2018-06-24 07:39:12.161323] I
            [glusterfsd-mgmt.c:1888:mgmt_getspec_cbk] 0-glusterfs:
            No change in volfile,continuing
# more gluster_bricks-engine-engine.log
            [2018-06-24 07:39:14.194222] I
            [glusterfsd-mgmt.c:1888:mgmt_getspec_cbk] 0-glusterfs:
            No change in volfile,continuing
            [2018-06-24 19:58:28.608469] E [MSGID: 101063]
            [event-epoll.c:551:event_dispatch_epoll_handler]
            0-epoll: stale fd found on idx=12, gen=1, events=1,
            slot->gen=3
            [2018-06-25 14:24:19.716822] I
            [addr.c:55:compare_addr_and_update]
            0-/gluster_bricks/engine/engine: allowed = "*", received
            addr = "192.168.0.57"
            [2018-06-25 14:24:19.716868] I [MSGID: 115029]
            [server-handshake.c:793:server_setvolume]
            0-engine-server: accepted client from
            CTX_ID:79b9d5b7-0bbb-4d67-87cf-11e27dfb6c1d-GRAPH_ID:0-PID:9901-HOST:sp3Kali-PC_NAME:engine-client-0-RECON_NO:-0
            (version: 4.0.2)
            [2018-06-25 14:45:35.061350] I [MSGID: 115036]
            [server.c:527:server_rpc_notify] 0-engine-server:
            disconnecting connection from
            CTX_ID:79b9d5b7-0bbb-4d67-87cf-11e27dfb6c1d-GRAPH_ID:0-PID:9901-HOST:sp3Kali-PC_NAME:engine-client-0-RECON_NO:-0
            [2018-06-25 14:45:35.061415] I [MSGID: 115013]
            [server-helpers.c:289:do_fd_cleanup] 0-engine-server: fd
            cleanup on
            /c65e03f0-d553-4d5d-ba4f-9d378c153b9b/images/82cde976-0650-4db9-9487-e2b52ffe25ee/e53806d9-3de5-4b26-aadc-157d745a9e0a
            [2018-06-25 14:45:35.062290] I [MSGID: 101055]
            [client_t.c:443:gf_client_unref] 0-engine-server:
            Shutting down connection
            CTX_ID:79b9d5b7-0bbb-4d67-87cf-11e27dfb6c1d-GRAPH_ID:0-PID:9901-HOST:sp3Kali-PC_NAME:engine-client-0-RECON_NO:-0
            [2018-06-25 14:46:34.284195] I [MSGID: 115036]
            [server.c:527:server_rpc_notify] 0-engine-server:
            disconnecting connection from
            CTX_ID:13e88614-31e8-4618-9f7f-067750f5971e-GRAPH_ID:0-PID:2615-HOST:workbench-PC_NAME:engine-client-0-RECON_NO:-0
            [2018-06-25 14:46:34.284546] I [MSGID: 101055]
            [client_t.c:443:gf_client_unref] 0-engine-server:
            Shutting down connection
            CTX_ID:13e88614-31e8-4618-9f7f-067750f5971e-GRAPH_ID:0-PID:2615-HOST:workbench-PC_NAME:engine-client-0-RECON_NO:-0
# gluster volume info engine
Volume Name: engine
            Type: Replicate
            Volume ID: c8dc1b04-bc25-4e97-81bb-4d94929918b1
            Status: Started
            Snapshot Count: 0
            Number of Bricks: 1 x 3 = 3
            Transport-type: tcp
            Bricks:
            Brick1:
            ovirtnode1.core.abcxyzdomains.net:/gluster_bricks/engine/engine
            Brick2:
            ovirtnode3.core.abcxyzdomains.net:/gluster_bricks/engine/engine
            Brick3:
            ovirtnode4.core.abcxyzdomains.net:/gluster_bricks/engine/engine
            Options Reconfigured:
            performance.strict-write-ordering: off
            server.event-threads: 4
            client.event-threads: 4
            features.shard-block-size: 512MB
            cluster.granular-entry-heal: enable
            performance.strict-o-direct: off
            network.ping-timeout: 30
            storage.owner-gid: 36
            storage.owner-uid: 36
            user.cifs: off
            features.shard: on
            cluster.shd-wait-qlength: 10000
            cluster.shd-max-threads: 8
            cluster.locking-scheme: granular
            cluster.data-self-heal-algorithm: full
            cluster.server-quorum-type: server
            cluster.quorum-type: auto
            cluster.eager-lock: enable
            network.remote-dio: off
            performance.low-prio-threads: 32
            performance.io-cache: off
            performance.read-ahead: off
            performance.quick-read: off
            transport.address-family: inet
            nfs.disable: on
            performance.client-io-threads: off
# gluster --version
            glusterfs 3.12.9
            Repository revision: git://git.gluster.org/glusterfs.git
            <http://git.gluster.org/glusterfs.git>
            Copyright (c) 2006-2016 Red Hat, Inc.
            <https://www.gluster.org/> <https://www.gluster.org/>
            GlusterFS comes with ABSOLUTELY NO WARRANTY.
            It is licensed to you under your choice of the GNU Lesser
            General Public License, version 3 or any later version
            (LGPLv3
            or later), or the GNU General Public License, version 2
            (GPLv2),
            in all cases as published by the Free Software Foundation.
Let me know if you want log further back, I can attach
            and send directly to you.
Thanks,
Hanson
On 06/26/2018 12:30 AM, Krutika Dhananjay wrote:
...
Could you share the gluster mount and brick logs?
            You'll find  them under /var/log/glusterfs.
            Also, what's the version of gluster you're using?
            Also, output of `gluster volume info <ENGINE_VOLNAME>`?
-Krutika
On Thu, Jun 21, 2018 at 9:50 AM, Sahina Bose
            <sabose@redhat.com <mailto:sabose@redhat.com>> wrote:
On Wed, Jun 20, 2018 at 11:33 PM, Hanson Turner
                <hanson@andrewswireless.net
                <mailto:hanson@andrewswireless.net>> wrote:
Hi Benny,
Who should I be reaching out to for help with a
                    gluster based hosted engine corruption?
Krutika, could you help?
--== Host 1 status ==--
conf_on_shared_storage      : True
                    Status up-to-date : True
                    Hostname      : ovirtnode1.abcxyzdomains.net
                    <http://ovirtnode1.abcxyzdomains.net>
                    Host ID : 1
                    Engine status : {"reason": "failed liveliness
                    check", "health": "bad", "vm": "up", "detail":
                    "Up"}
                    Score      : 3400
                    stopped      : False
                    Local maintenance : False
                    crc32      : 92254a68
                    local_conf_timestamp      : 115910
                    Host timestamp : 115910
                    Extra metadata (valid at timestamp):
                    metadata_parse_version=1
                    metadata_feature_version=1
                    timestamp=115910 (Mon Jun 18 09:43:20 2018)
                        host-id=1
                        score=3400
                    vm_conf_refresh_time=115910 (Mon Jun 18
                    09:43:20 2018)
                    conf_on_shared_storage=True
                    maintenance=False
                    state=GlobalMaintenance
                    stopped=False
My when I VNC into my HE, All I get is:
                    Probing EDD (edd=off to disable)... ok
So, that's why it's failing the liveliness
                    check... I cannot get the screen on HE to
                    change short of ctl-alt-del which will reboot
                    the HE.
                    I do have backups for the HE that are/were run
                    on a nightly basis.
If the cluster was left alone, the HE vm would
                    bounce from machine to machine trying to boot.
                    This is why the cluster is in maintenance mode.
                    One of the nodes was down for a period of time
                    and brought back, sometime through the night,
                    which is when the automated backup kicks, the
                    HE started bouncing around. Got nearly 1000 emails.
This seems to be the same error (but may not be
                    the same cause) as listed here:
                    https://bugzilla.redhat.com/show_bug.cgi?id=1569827
                    <https://bugzilla.redhat.com/show_bug.cgi?id=1569827>
Thanks,
Hanson
_______________________________________________
                    Users mailing list -- users@ovirt.org
                    <mailto:users@ovirt.org>
                    To unsubscribe send an email to
                    users-leave@ovirt.org
                    <mailto:users-leave@ovirt.org>
                    Privacy Statement:
                    https://www.ovirt.org/site/privacy-policy/
                    <https://www.ovirt.org/site/privacy-policy/>
                    oVirt Code of Conduct:
                    https://www.ovirt.org/community/about/community-guidelines/
                    <https://www.ovirt.org/community/about/community-guidelines/>
                    List Archives:
                    https://lists.ovirt.org/archives/list/users@ovirt.org/message/3NLA2URX3KN44F...
                    <https://lists.ovirt.org/archives/list/users@ovirt.org/message/3NLA2URX3KN44FGFUVV4N5EJBPICABHH/>

[ovirt-users] Re: HE + Gluster : Engine corrupted?

Ravishankar N