[ovirt-users] Re: HE + Gluster : Engine corrupted?

2 Jul 2018

      Hi,

So it seems some of the files in the volume have mismatching gfids. I see
the following logs from 15th June, ~8pm EDT:

<snip>
...
...
[2018-06-16 04:00:10.264690] E [MSGID: 108008]
[afr-self-heal-common.c:335:afr_gfid_split_brain_source]
0-engine-replicate-0: Gfid mismatch detected for
<gfid:941edf0c-d363-488e-a333-d12320f96480>/hosted-engine.lockspace>,
6bbe6097-8520-4a61-971e-6e30c2ee0abe on engine-client-2 and
ef21a706-41cf-4519-8659-87ecde4bbfbf on engine-client-0.
[2018-06-16 04:00:10.265861] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4411: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace =>
-1 (Input/output error)
[2018-06-16 04:00:11.522600] E [MSGID: 108008]
[afr-self-heal-common.c:212:afr_gfid_split_brain_source]
0-engine-replicate-0: All the bricks should be up to resolve the gfid split
barin
[2018-06-16 04:00:11.522632] E [MSGID: 108008]
[afr-self-heal-common.c:335:afr_gfid_split_brain_source]
0-engine-replicate-0: Gfid mismatch detected for
<gfid:941edf0c-d363-488e-a333-d12320f96480>/hosted-engine.lockspace>,
6bbe6097-8520-4a61-971e-6e30c2ee0abe on engine-client-2 and
ef21a706-41cf-4519-8659-87ecde4bbfbf on engine-client-0.
[2018-06-16 04:00:11.523750] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4493: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace =>
-1 (Input/output error)
[2018-06-16 04:00:12.864393] E [MSGID: 108008]
[afr-self-heal-common.c:212:afr_gfid_split_brain_source]
0-engine-replicate-0: All the bricks should be up to resolve the gfid split
barin
[2018-06-16 04:00:12.864426] E [MSGID: 108008]
[afr-self-heal-common.c:335:afr_gfid_split_brain_source]
0-engine-replicate-0: Gfid mismatch detected for
<gfid:941edf0c-d363-488e-a333-d12320f96480>/hosted-engine.lockspace>,
6bbe6097-8520-4a61-971e-6e30c2ee0abe on engine-client-2 and
ef21a706-41cf-4519-8659-87ecde4bbfbf on engine-client-0.
[2018-06-16 04:00:12.865392] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4575: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace =>
-1 (Input/output error)
[2018-06-16 04:00:18.716007] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4657: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace =>
-1 (Input/output error)
[2018-06-16 04:00:20.553365] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4739: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace =>
-1 (Input/output error)
[2018-06-16 04:00:21.771698] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4821: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace =>
-1 (Input/output error)
[2018-06-16 04:00:23.871647] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4906: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace =>
-1 (Input/output error)
[2018-06-16 04:00:25.034780] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4987: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace =>
-1 (Input/output error)
...
...
</snip>

Adding Ravi who works on replicate component to hep resolve the mismatches.

-Krutika

On Mon, Jul 2, 2018 at 12:27 PM, Krutika Dhananjay <kdhananj@redhat.com>
wrote:
...
Hi,
Sorry, I was out sick on Friday. I am looking into the logs. Will get back
to you in some time.
-Krutika
On Fri, Jun 29, 2018 at 7:47 PM, Hanson Turner <hanson@andrewswireless.net
...
wrote:
...
Hi Krutika,
Did you need any other logs?
Thanks,
Hanson
On 06/27/2018 02:04 PM, Hanson Turner wrote:
Hi Krutika,
Looking at the email spams, it looks like it started at 8:04PM EDT on Jun
15 2018.
From my memory, I think the cluster was working fine until sometime that
night. Somewhere between midnight and the next (Saturday) morning, the
engine crashed and all vm's stopped.
I do have nightly backups that ran every night, using the engine-backup
command. Looks like my last valid backup was 2018-06-15.
I've included all logs I think might be of use. Please forgive the use of
7zip, as the raw logs took 50mb which is greater than my attachment limit.
I think the just of what happened, is we had a downed node for a period
of time. Earlier that day, the node was brought back into service. Later
that night or early the next morning, the engine was gone and hopping from
node to node.
I have tried to mount the engine's hdd file to see if I could fix it.
There are a few corrupted partitions, and those are xfs formatted. Trying
to mount gives me issues about needing repaired, trying to repair gives me
issues about needing something cleaned first. I cannot remember exactly
what it was, but it wanted me to run a command that ended -L to clear out
the logs. I said no way and have left the engine vm in a powered down
state, as well as the cluster in global maintenance.
I can see no sign of the vm booting, (ie no networking) except for what
I've described earlier in the VNC session.
Thanks,
Hanson
On 06/27/2018 12:04 PM, Krutika Dhananjay wrote:
Yeah, complete logs would help. Also let me know when you saw this issue
- data and approx time (do specify the timezone as well).
-Krutika
On Wed, Jun 27, 2018 at 7:00 PM, Hanson Turner <
hanson@andrewswireless.net> wrote:
...
#more rhev-data-center-mnt-glusterSD-ovirtnode1.abcxyzdomains.net\
:_engine.log
[2018-06-24 07:39:12.161323] I [glusterfsd-mgmt.c:1888:mgmt_getspec_cbk]
0-glusterfs: No change in volfile,continuing
# more gluster_bricks-engine-engine.log
[2018-06-24 07:39:14.194222] I [glusterfsd-mgmt.c:1888:mgmt_getspec_cbk]
0-glusterfs: No change in volfile,continuing
[2018-06-24 19:58:28.608469] E [MSGID: 101063]
[event-epoll.c:551:event_dispatch_epoll_handler] 0-epoll: stale fd
found on idx=12, gen=1, events=1, slot->gen=3
[2018-06-25 14:24:19.716822] I [addr.c:55:compare_addr_and_update]
0-/gluster_bricks/engine/engine: allowed = "*", received addr =
"192.168.0.57"
[2018-06-25 14:24:19.716868] I [MSGID: 115029]
[server-handshake.c:793:server_setvolume] 0-engine-server: accepted
client from CTX_ID:79b9d5b7-0bbb-4d67-87cf-11e27dfb6c1d-GRAPH_ID:0-PID:9
901-HOST:sp3Kali-PC_NAME:engine-client-0-RECON_NO:-0 (version: 4.0.2)
[2018-06-25 14:45:35.061350] I [MSGID: 115036]
[server.c:527:server_rpc_notify] 0-engine-server: disconnecting
connection from CTX_ID:79b9d5b7-0bbb-4d67-87cf
-11e27dfb6c1d-GRAPH_ID:0-PID:9901-HOST:sp3Kali-PC_NAME:engin
e-client-0-RECON_NO:-0
[2018-06-25 14:45:35.061415] I [MSGID: 115013]
[server-helpers.c:289:do_fd_cleanup] 0-engine-server: fd cleanup on
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/images/82cde976-0650-4
db9-9487-e2b52ffe25ee/e53806d9-3de5-4b26-aadc-157d745a9e0a
[2018-06-25 14:45:35.062290] I [MSGID: 101055]
[client_t.c:443:gf_client_unref] 0-engine-server: Shutting down
connection CTX_ID:79b9d5b7-0bbb-4d67-87cf-11e27dfb6c1d-GRAPH_ID:0-PID:9
901-HOST:sp3Kali-PC_NAME:engine-client-0-RECON_NO:-0
[2018-06-25 14:46:34.284195] I [MSGID: 115036]
[server.c:527:server_rpc_notify] 0-engine-server: disconnecting
connection from CTX_ID:13e88614-31e8-4618-9f7f
-067750f5971e-GRAPH_ID:0-PID:2615-HOST:workbench-PC_NAME:eng
ine-client-0-RECON_NO:-0
[2018-06-25 14:46:34.284546] I [MSGID: 101055]
[client_t.c:443:gf_client_unref] 0-engine-server: Shutting down
connection CTX_ID:13e88614-31e8-4618-9f7f-067750f5971e-GRAPH_ID:0-PID:2
615-HOST:workbench-PC_NAME:engine-client-0-RECON_NO:-0
# gluster volume info engine
Volume Name: engine
Type: Replicate
Volume ID: c8dc1b04-bc25-4e97-81bb-4d94929918b1
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: ovirtnode1.core.abcxyzdomains.net:/gluster_bricks/engine/engine
Brick2: ovirtnode3.core.abcxyzdomains.net:/gluster_bricks/engine/engine
Brick3: ovirtnode4.core.abcxyzdomains.net:/gluster_bricks/engine/engine
Options Reconfigured:
performance.strict-write-ordering: off
server.event-threads: 4
client.event-threads: 4
features.shard-block-size: 512MB
cluster.granular-entry-heal: enable
performance.strict-o-direct: off
network.ping-timeout: 30
storage.owner-gid: 36
storage.owner-uid: 36
user.cifs: off
features.shard: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.server-quorum-type: server
cluster.quorum-type: auto
cluster.eager-lock: enable
network.remote-dio: off
performance.low-prio-threads: 32
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
# gluster --version
glusterfs 3.12.9
Repository revision: git://git.gluster.org/glusterfs.git
Copyright (c) 2006-2016 Red Hat, Inc. <https://www.gluster.org/>
<https://www.gluster.org/>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
It is licensed to you under your choice of the GNU Lesser
General Public License, version 3 or any later version (LGPLv3
or later), or the GNU General Public License, version 2 (GPLv2),
in all cases as published by the Free Software Foundation.
Let me know if you want log further back, I can attach and send directly
to you.
Thanks,
Hanson
On 06/26/2018 12:30 AM, Krutika Dhananjay wrote:
Could you share the gluster mount and brick logs? You'll find  them
under /var/log/glusterfs.
Also, what's the version of gluster you're using?
Also, output of `gluster volume info <ENGINE_VOLNAME>`?
-Krutika
On Thu, Jun 21, 2018 at 9:50 AM, Sahina Bose <sabose@redhat.com> wrote:
...
On Wed, Jun 20, 2018 at 11:33 PM, Hanson Turner <
hanson@andrewswireless.net> wrote:
...
Hi Benny,
Who should I be reaching out to for help with a gluster based hosted
engine corruption?
Krutika, could you help?
...
--== Host 1 status ==--
conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : ovirtnode1.abcxyzdomains.net
Host ID                            : 1
Engine status                      : {"reason": "failed liveliness
check", "health": "bad", "vm": "up", "detail": "Up"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 92254a68
local_conf_timestamp               : 115910
Host timestamp                     : 115910
Extra metadata (valid at timestamp):
    metadata_parse_version=1
    metadata_feature_version=1
    timestamp=115910 (Mon Jun 18 09:43:20 2018)
    host-id=1
    score=3400
    vm_conf_refresh_time=115910 (Mon Jun 18 09:43:20 2018)
    conf_on_shared_storage=True
    maintenance=False
    state=GlobalMaintenance
    stopped=False
My when I VNC into my HE, All I get is:
Probing EDD (edd=off to disable)... ok
So, that's why it's failing the liveliness check... I cannot get the
screen on HE to change short of ctl-alt-del which will reboot the HE.
I do have backups for the HE that are/were run on a nightly basis.
If the cluster was left alone, the HE vm would bounce from machine to
machine trying to boot. This is why the cluster is in maintenance mode.
One of the nodes was down for a period of time and brought back,
sometime through the night, which is when the automated backup kicks, the
HE started bouncing around. Got nearly 1000 emails.
This seems to be the same error (but may not be the same cause) as
listed here:
https://bugzilla.redhat.com/show_bug.cgi?id=1569827
Thanks,
Hanson
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: https://www.ovirt.org/communit
y/about/community-guidelines/
List Archives: https://lists.ovirt.org/archiv
es/list/users@ovirt.org/message/3NLA2URX3KN44FGFUVV4N5EJBPICABHH/