
From: Sahina Bose <sabose@redhat.com> Subject: Re: [ovirt-users] gluster self-heal takes cluster offline Date: March 23, 2018 at 1:26:01 AM CDT To: Jim Kusznir Cc: Ravishankar Narayanankutty; users =20 =20 =20 On Fri, Mar 16, 2018 at 2:45 AM, Jim Kusznir <jim@palousetech.com = <mailto:jim@palousetech.com>> wrote: Hi all: =20 I'm trying to understand why/how (and most importantly, how to fix) an = substantial issue I had last night. This happened one other time, but I = didn't know/understand all the parts associated with it until last = night. =20 I have a 3 node hyperconverged (self-hosted engine, Gluster on each = node) cluster. Gluster is Replica 2 + arbitrar. Current network = configuration is 2x GigE on load balance ("LAG Group" on switch), plus = one GigE from each server on a separate vlan, intended for Gluster (but = not used). Server hardware is Dell R610's, each server as an SSD in it. = Server 1 and 2 have the full replica, server 3 is the arbitrar. =20 I put server 2 into maintence so I can work on the hardware, including = turn it off and such. In the course of the work, I found that I needed = to reconfigure the SSD's partitioning somewhat, and it resulted in = wiping the data partition (storing VM images). I figure, its no big = deal, gluster will rebuild that in short order. I did take care of the = extended attr settings and the like, and when I booted it up, gluster = came up as expected and began rebuilding the disk. =20 How big was the data on this partition? What was the shard size set on =
Out of curiosity, how long did it take to heal and come back to = operational? =20 =20 The problem is that suddenly my entire cluster got very sluggish. The = entine was marking nodes and VMs failed and unfaling them throughout the = system, fairly randomly. It didn't matter what node the engine or VM = was on. At one point, it power cycled server 1 for "non-responsive" = (even though everything was running on it, and the gluster rebuild was = working on it). As a result of this, about 6 VMs were killed and my = entire gluster system went down hard (suspending all remaining VMs and =
=20 I ended up putting the cluster in global HA maintence mode and = disabling power fencing on the nodes until the process finished. It = appeared on at least two occasions a functional node was marked bad and = had the fencing not been disabled, a node would have rebooted, just = further exacerbating the problem. =20 =20 Its clear that the gluster rebuild overloaded things and caused the =
=20 Why did my system bite it so hard with the rebuild? I baby'ed it = along until the rebuild was complete, after which it returned to normal = operation. =20 As of this event, all networking (host/engine management, gluster, and = VM network) were on the same vlan. I'd love to move things off, but so = far any attempt to do so breaks my cluster. How can I move my = management interfaces to a separate VLAN/IP Space? I also want to move = Gluster to its own private space, but it seems if I change anything in =
=20 Will the above network reconfigurations be enough? I got the = impression that the issue may not have been purely network based, but =
--Apple-Mail=_C9658AC7-B5C3-4BB8-9C28-655DB402EFFB Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii What version of ovirt and gluster? Sounds like something I just saw with = gluster 3.12.x, are you using libgfapi or just fuse mounts? the gluster volume? the engine), as there were no remaining full copies of the data. After = several minutes (these are Dell servers, after all...), server 1 came = back up, and gluster resumed the rebuild, and came online on the = cluster. I had to manually (virtsh command) unpause the engine, and = then struggle through trying to get critical VMs back up. Everything = was super slow, and load averages on the servers were often seen in = excess of 80 (these are 8 core / 16 thread boxes). Actual CPU usage = (reported by top) was rarely above 40% (inclusive of all CPUs) for any = one server. Glusterfs was often seen using 180%-350% of a CPU on server = 1 and 2. =20 problem. I don't know why the load was so high (even IOWait was low), = but load averages were definately tied to the glusterfs cpu utilization = %. At no point did I have any problems pinging any machine (host or = VM) unless the engine decided it was dead and killed it. the peers file, the entire gluster cluster goes down. The dedicated = gluster network is listed as a secondary hostname for all peers already. possibly server IO overload. Is this likely / right?
=20 I appreciate input. I don't think gluster's recovery is supposed to = do as much damage as it did the last two or three times any healing was = required. =20 Thanks! --Jim =20 _______________________________________________ Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> http://lists.ovirt.org/mailman/listinfo/users = <http://lists.ovirt.org/mailman/listinfo/users> =20 =20 _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
--Apple-Mail=_C9658AC7-B5C3-4BB8-9C28-655DB402EFFB Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=us-ascii <html><head><meta http-equiv=3D"Content-Type" content=3D"text/html; = charset=3Dus-ascii"></head><body style=3D"word-wrap: break-word; = -webkit-nbsp-mode: space; line-break: after-white-space;" class=3D"">What = version of ovirt and gluster? Sounds like something I just saw with = gluster 3.12.x, are you using libgfapi or just fuse mounts?<div = class=3D""><br class=3D""></div><div class=3D""><div><blockquote = type=3D"cite" class=3D""><hr style=3D"border:none;border-top:solid = #B5C4DF 1.0pt;padding:0 0 0 0;margin:10px 0 5px 0;" class=3D""><span = style=3D"margin: -1.3px 0.0px 0.0px 0.0px" id=3D"RwhHeaderAttributes" = class=3D""><font face=3D"Helvetica" size=3D"4" color=3D"#000000" = style=3D"font: 13.0px Helvetica; color: #000000" class=3D""><b = class=3D"">From:</b> Sahina Bose <<a href=3D"mailto:sabose@redhat.com" = class=3D"">sabose@redhat.com</a>></font></span><br class=3D""> <span style=3D"margin: -1.3px 0.0px 0.0px 0.0px" class=3D""><font = face=3D"Helvetica" size=3D"4" color=3D"#000000" style=3D"font: 13.0px = Helvetica; color: #000000" class=3D""><b class=3D"">Subject:</b> Re: = [ovirt-users] gluster self-heal takes cluster offline</font></span><br = class=3D""> <span style=3D"margin: -1.3px 0.0px 0.0px 0.0px" class=3D""><font = face=3D"Helvetica" size=3D"4" color=3D"#000000" style=3D"font: 13.0px = Helvetica; color: #000000" class=3D""><b class=3D"">Date:</b> March 23, = 2018 at 1:26:01 AM CDT</font></span><br class=3D""> <span style=3D"margin: -1.3px 0.0px 0.0px 0.0px" class=3D""><font = face=3D"Helvetica" size=3D"4" color=3D"#000000" style=3D"font: 13.0px = Helvetica; color: #000000" class=3D""><b class=3D"">To:</b> Jim = Kusznir</font></span><br class=3D""> <span style=3D"margin: -1.3px 0.0px 0.0px 0.0px" class=3D""><font = face=3D"Helvetica" size=3D"4" color=3D"#000000" style=3D"font: 13.0px = Helvetica; color: #000000" class=3D""><b class=3D"">Cc:</b> Ravishankar = Narayanankutty; users</font></span><br class=3D""> <br class=3D"Apple-interchange-newline"><div class=3D""><div dir=3D"ltr" = class=3D""><br class=3D""><div class=3D"gmail_extra"><br class=3D""><div = class=3D"gmail_quote">On Fri, Mar 16, 2018 at 2:45 AM, Jim Kusznir <span = dir=3D"ltr" class=3D""><<a href=3D"mailto:jim@palousetech.com" = target=3D"_blank" class=3D"">jim@palousetech.com</a>></span> = wrote:<br class=3D""><blockquote class=3D"gmail_quote" style=3D"margin:0 = 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr" = class=3D"">Hi all:<div class=3D""><br class=3D""></div><div class=3D"">I'm= trying to understand why/how (and most importantly, how to fix) an = substantial issue I had last night. This happened one other time, = but I didn't know/understand all the parts associated with it until last = night.</div><div class=3D""><br class=3D""></div><div class=3D"">I have = a 3 node hyperconverged (self-hosted engine, Gluster on each node) = cluster. Gluster is Replica 2 + arbitrar. Current network = configuration is 2x GigE on load balance ("LAG Group" on switch), plus = one GigE from each server on a separate vlan, intended for Gluster (but = not used). Server hardware is Dell R610's, each server as an SSD = in it. Server 1 and 2 have the full replica, server 3 is the = arbitrar.</div><div class=3D""><br class=3D""></div><div class=3D"">I = put server 2 into maintence so I can work on the hardware, including = turn it off and such. In the course of the work, I found that I = needed to reconfigure the SSD's partitioning somewhat, and it resulted = in wiping the data partition (storing VM images). I figure, its no = big deal, gluster will rebuild that in short order. I did take = care of the extended attr settings and the like, and when I booted it = up, gluster came up as expected and began rebuilding the = disk.</div></div></blockquote><div class=3D""><br class=3D""></div><div = class=3D"">How big was the data on this partition? What was the shard = size set on the gluster volume?</div><div class=3D"">Out of curiosity, = how long did it take to heal and come back to operational?</div><div = class=3D""><br class=3D""></div><blockquote class=3D"gmail_quote" = style=3D"margin:0 0 0 .8ex;border-left:1px #ccc = solid;padding-left:1ex"><div dir=3D"ltr" class=3D""><div class=3D""><br = class=3D""></div><div class=3D"">The problem is that suddenly my entire = cluster got very sluggish. The entine was marking nodes and VMs = failed and unfaling them throughout the system, fairly randomly. = It didn't matter what node the engine or VM was on. At one point, = it power cycled server 1 for "non-responsive" (even though everything = was running on it, and the gluster rebuild was working on it). As = a result of this, about 6 VMs were killed and my entire gluster system = went down hard (suspending all remaining VMs and the engine), as there = were no remaining full copies of the data. After several minutes = (these are Dell servers, after all...), server 1 came back up, and = gluster resumed the rebuild, and came online on the cluster. I had = to manually (virtsh command) unpause the engine, and then struggle = through trying to get critical VMs back up. Everything was super = slow, and load averages on the servers were often seen in excess of 80 = (these are 8 core / 16 thread boxes). Actual CPU usage (reported = by top) was rarely above 40% (inclusive of all CPUs) for any one server. = Glusterfs was often seen using 180%-350% of a CPU on server 1 and = 2. </div><div class=3D""><br class=3D""></div><div class=3D"">I= ended up putting the cluster in global HA maintence mode and disabling = power fencing on the nodes until the process finished. It appeared = on at least two occasions a functional node was marked bad and had the = fencing not been disabled, a node would have rebooted, just further = exacerbating the problem. </div><div class=3D""><br = class=3D""></div><div class=3D"">Its clear that the gluster rebuild = overloaded things and caused the problem. I don't know why the = load was so high (even IOWait was low), but load averages were = definately tied to the glusterfs cpu utilization %. At no = point did I have any problems pinging any machine (host or VM) unless = the engine decided it was dead and killed it.</div><div class=3D""><br = class=3D""></div><div class=3D"">Why did my system bite it so hard with = the rebuild? I baby'ed it along until the rebuild was complete, = after which it returned to normal operation.</div><div class=3D""><br = class=3D""></div><div class=3D"">As of this event, all networking = (host/engine management, gluster, and VM network) were on the same = vlan. I'd love to move things off, but so far any attempt to do so = breaks my cluster. How can I move my management interfaces to a = separate VLAN/IP Space? I also want to move Gluster to its own = private space, but it seems if I change anything in the peers file, the = entire gluster cluster goes down. The dedicated gluster network is = listed as a secondary hostname for all peers already.</div><div = class=3D""><br class=3D""></div><div class=3D"">Will the above network = reconfigurations be enough? I got the impression that the issue = may not have been purely network based, but possibly server IO = overload. Is this likely / right?</div><div class=3D""><br = class=3D""></div><div class=3D"">I appreciate input. I don't think = gluster's recovery is supposed to do as much damage as it did the last = two or three times any healing was required.</div><div class=3D""><br = class=3D""></div><div class=3D"">Thanks!</div><span class=3D"HOEnZb"><font= color=3D"#888888" class=3D""><div = class=3D"">--Jim</div></font></span></div> <br class=3D"">______________________________<wbr = class=3D"">_________________<br class=3D""> Users mailing list<br class=3D""> <a href=3D"mailto:Users@ovirt.org" class=3D"">Users@ovirt.org</a><br = class=3D""> <a href=3D"http://lists.ovirt.org/mailman/listinfo/users" = rel=3D"noreferrer" target=3D"_blank" = class=3D"">http://lists.ovirt.org/<wbr = class=3D"">mailman/listinfo/users</a><br class=3D""> <br class=3D""></blockquote></div><br class=3D""></div></div> _______________________________________________<br class=3D"">Users = mailing list<br class=3D""><a href=3D"mailto:Users@ovirt.org" = class=3D"">Users@ovirt.org</a><br = class=3D"">http://lists.ovirt.org/mailman/listinfo/users<br = class=3D""></div></blockquote></div><br class=3D""></div></body></html>= --Apple-Mail=_C9658AC7-B5C3-4BB8-9C28-655DB402EFFB--