Hi all:
I'm trying to understand why/how (and most importantly, how to fix) an substantial issue I had last night. This happened one other time, but I didn't know/understand all the parts associated with it until last night.
I have a 3 node hyperconverged (self-hosted engine, Gluster on each node) cluster. Gluster is Replica 2 + arbitrar. Current network configuration is 2x GigE on load balance ("LAG Group" on switch), plus one GigE from each server on a separate vlan, intended for Gluster (but not used). Server hardware is Dell R610's, each server as an SSD in it. Server 1 and 2 have the full replica, server 3 is the arbitrar.
I put server 2 into maintence so I can work on the hardware, including turn it off and such. In the course of the work, I found that I needed to reconfigure the SSD's partitioning somewhat, and it resulted in wiping the data partition (storing VM images). I figure, its no big deal, gluster will rebuild that in short order. I did take care of the extended attr settings and the like, and when I booted it up, gluster came up as expected and began rebuilding the disk.
The problem is that suddenly my entire cluster got very sluggish. The entine was marking nodes and VMs failed and unfaling them throughout the system, fairly randomly. It didn't matter what node the engine or VM was on. At one point, it power cycled server 1 for "non-responsive" (even though everything was running on it, and the gluster rebuild was working on it). As a result of this, about 6 VMs were killed and my entire gluster system went down hard (suspending all remaining VMs and the engine), as there were no remaining full copies of the data. After several minutes (these are Dell servers, after all...), server 1 came back up, and gluster resumed the rebuild, and came online on the cluster. I had to manually (virtsh command) unpause the engine, and then struggle through trying to get critical VMs back up. Everything was super slow, and load averages on the servers were often seen in excess of 80 (these are 8 core / 16 thread boxes). Actual CPU usage (reported by top) was rarely above 40% (inclusive of all CPUs) for any one server. Glusterfs was often seen using 180%-350% of a CPU on server 1 and 2.
I ended up putting the cluster in global HA maintence mode and disabling power fencing on the nodes until the process finished. It appeared on at least two occasions a functional node was marked bad and had the fencing not been disabled, a node would have rebooted, just further exacerbating the problem.
Its clear that the gluster rebuild overloaded things and caused the problem. I don't know why the load was so high (even IOWait was low), but load averages were definately tied to the glusterfs cpu utilization %. At no point did I have any problems pinging any machine (host or VM) unless the engine decided it was dead and killed it.
Why did my system bite it so hard with the rebuild? I baby'ed it along until the rebuild was complete, after which it returned to normal operation.
As of this event, all networking (host/engine management, gluster, and VM network) were on the same vlan. I'd love to move things off, but so far any attempt to do so breaks my cluster. How can I move my management interfaces to a separate VLAN/IP Space? I also want to move Gluster to its own private space, but it seems if I change anything in the peers file, the entire gluster cluster goes down. The dedicated gluster network is listed as a secondary hostname for all peers already.
Will the above network reconfigurations be enough? I got the impression that the issue may not have been purely network based, but possibly server IO overload. Is this likely / right?
I appreciate input. I don't think gluster's recovery is supposed to do as much damage as it did the last two or three times any healing was required.
Thanks!
--Jim