[ovirt-users] gluster self-heal takes cluster offline

Sahina Bose sabose at redhat.com
Fri Mar 23 06:26:01 UTC 2018


On Fri, Mar 16, 2018 at 2:45 AM, Jim Kusznir <jim at palousetech.com> wrote:

> Hi all:
>
> I'm trying to understand why/how (and most importantly, how to fix) an
> substantial issue I had last night.  This happened one other time, but I
> didn't know/understand all the parts associated with it until last night.
>
> I have a 3 node hyperconverged (self-hosted engine, Gluster on each node)
> cluster.  Gluster is Replica 2 + arbitrar.  Current network configuration
> is 2x GigE on load balance ("LAG Group" on switch), plus one GigE from each
> server on a separate vlan, intended for Gluster (but not used).  Server
> hardware is Dell R610's, each server as an SSD in it.  Server 1 and 2 have
> the full replica, server 3 is the arbitrar.
>
> I put server 2 into maintence so I can work on the hardware, including
> turn it off and such.  In the course of the work, I found that I needed to
> reconfigure the SSD's partitioning somewhat, and it resulted in wiping the
> data partition (storing VM images).  I figure, its no big deal, gluster
> will rebuild that in short order.  I did take care of the extended attr
> settings and the like, and when I booted it up, gluster came up as expected
> and began rebuilding the disk.
>

How big was the data on this partition? What was the shard size set on the
gluster volume?
Out of curiosity, how long did it take to heal and come back to operational?


> The problem is that suddenly my entire cluster got very sluggish.  The
> entine was marking nodes and VMs failed and unfaling them throughout the
> system, fairly randomly.  It didn't matter what node the engine or VM was
> on.  At one point, it power cycled server 1 for "non-responsive" (even
> though everything was running on it, and the gluster rebuild was working on
> it).  As a result of this, about 6 VMs were killed and my entire gluster
> system went down hard (suspending all remaining VMs and the engine), as
> there were no remaining full copies of the data.  After several minutes
> (these are Dell servers, after all...), server 1 came back up, and gluster
> resumed the rebuild, and came online on the cluster.  I had to manually
> (virtsh command) unpause the engine, and then struggle through trying to
> get critical VMs back up.  Everything was super slow, and load averages on
> the servers were often seen in excess of 80 (these are 8 core / 16 thread
> boxes).  Actual CPU usage (reported by top) was rarely above 40% (inclusive
> of all CPUs) for any one server. Glusterfs was often seen using 180%-350%
> of a CPU on server 1 and 2.
>
> I ended up putting the cluster in global HA maintence mode and disabling
> power fencing on the nodes until the process finished.  It appeared on at
> least two occasions a functional node was marked bad and had the fencing
> not been disabled, a node would have rebooted, just further exacerbating
> the problem.
>
> Its clear that the gluster rebuild overloaded things and caused the
> problem.  I don't know why the load was so high (even IOWait was low), but
> load averages were definately tied to the glusterfs cpu utilization %.   At
> no point did I have any problems pinging any machine (host or VM) unless
> the engine decided it was dead and killed it.
>
> Why did my system bite it so hard with the rebuild?  I baby'ed it along
> until the rebuild was complete, after which it returned to normal operation.
>
> As of this event, all networking (host/engine management, gluster, and VM
> network) were on the same vlan.  I'd love to move things off, but so far
> any attempt to do so breaks my cluster.  How can I move my management
> interfaces to a separate VLAN/IP Space?  I also want to move Gluster to its
> own private space, but it seems if I change anything in the peers file, the
> entire gluster cluster goes down.  The dedicated gluster network is listed
> as a secondary hostname for all peers already.
>
> Will the above network reconfigurations be enough?  I got the impression
> that the issue may not have been purely network based, but possibly server
> IO overload.  Is this likely / right?
>
> I appreciate input.  I don't think gluster's recovery is supposed to do as
> much damage as it did the last two or three times any healing was required.
>
> Thanks!
> --Jim
>
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20180323/4a70f0cb/attachment.html>


More information about the Users mailing list