gluster self-heal takes cluster offline

15 Mar 2018

      Hi all:

I'm trying to understand why/how (and most importantly, how to fix) an
substantial issue I had last night.  This happened one other time, but I
didn't know/understand all the parts associated with it until last night.

I have a 3 node hyperconverged (self-hosted engine, Gluster on each node)
cluster.  Gluster is Replica 2 + arbitrar.  Current network configuration
is 2x GigE on load balance ("LAG Group" on switch), plus one GigE from each
server on a separate vlan, intended for Gluster (but not used).  Server
hardware is Dell R610's, each server as an SSD in it.  Server 1 and 2 have
the full replica, server 3 is the arbitrar.

I put server 2 into maintence so I can work on the hardware, including turn
it off and such.  In the course of the work, I found that I needed to
reconfigure the SSD's partitioning somewhat, and it resulted in wiping the
data partition (storing VM images).  I figure, its no big deal, gluster
will rebuild that in short order.  I did take care of the extended attr
settings and the like, and when I booted it up, gluster came up as expected
and began rebuilding the disk.

The problem is that suddenly my entire cluster got very sluggish.  The
entine was marking nodes and VMs failed and unfaling them throughout the
system, fairly randomly.  It didn't matter what node the engine or VM was
on.  At one point, it power cycled server 1 for "non-responsive" (even
though everything was running on it, and the gluster rebuild was working on
it).  As a result of this, about 6 VMs were killed and my entire gluster
system went down hard (suspending all remaining VMs and the engine), as
there were no remaining full copies of the data.  After several minutes
(these are Dell servers, after all...), server 1 came back up, and gluster
resumed the rebuild, and came online on the cluster.  I had to manually
(virtsh command) unpause the engine, and then struggle through trying to
get critical VMs back up.  Everything was super slow, and load averages on
the servers were often seen in excess of 80 (these are 8 core / 16 thread
boxes).  Actual CPU usage (reported by top) was rarely above 40% (inclusive
of all CPUs) for any one server. Glusterfs was often seen using 180%-350%
of a CPU on server 1 and 2.

I ended up putting the cluster in global HA maintence mode and disabling
power fencing on the nodes until the process finished.  It appeared on at
least two occasions a functional node was marked bad and had the fencing
not been disabled, a node would have rebooted, just further exacerbating
the problem.

Its clear that the gluster rebuild overloaded things and caused the
problem.  I don't know why the load was so high (even IOWait was low), but
load averages were definately tied to the glusterfs cpu utilization %.   At
no point did I have any problems pinging any machine (host or VM) unless
the engine decided it was dead and killed it.

Why did my system bite it so hard with the rebuild?  I baby'ed it along
until the rebuild was complete, after which it returned to normal operation.

As of this event, all networking (host/engine management, gluster, and VM
network) were on the same vlan.  I'd love to move things off, but so far
any attempt to do so breaks my cluster.  How can I move my management
interfaces to a separate VLAN/IP Space?  I also want to move Gluster to its
own private space, but it seems if I change anything in the peers file, the
entire gluster cluster goes down.  The dedicated gluster network is listed
as a secondary hostname for all peers already.

Will the above network reconfigurations be enough?  I got the impression
that the issue may not have been purely network based, but possibly server
IO overload.  Is this likely / right?

I appreciate input.  I don't think gluster's recovery is supposed to do as
much damage as it did the last two or three times any healing was required.

Thanks!
--Jim

Jim Kusznir

Sahina Bose

Darrell Budic

tags

participants (3)