<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">Found (and caused) my problem. <div class=""><br class=""></div><div class="">I’d been evaluating different settings for (default settings shown):</div><div class=""><div style="margin: 0px; font-stretch: normal; line-height: normal;" class=""><span style="font-variant-ligatures: no-common-ligatures;" class="">cluster.</span><span style="font-variant-ligatures: no-common-ligatures;" class=""><span class="">shd</span></span><span style="font-variant-ligatures: no-common-ligatures;" class="">-max-threads 1 </span></div><div style="margin: 0px; font-stretch: normal; line-height: normal;" class=""><span style="font-variant-ligatures: no-common-ligatures" class="">cluster.</span><span style="font-variant-ligatures: no-common-ligatures;" class=""><span class="">shd</span></span><span style="font-variant-ligatures: no-common-ligatures" class="">-wait-qlength 1024 </span></div><div class=""><span style="font-variant-ligatures: no-common-ligatures" class=""><br class=""></span></div><div class=""><span style="font-variant-ligatures: no-common-ligatures" class="">and had forgotten to reset them after testing. I had them at max-thread 8 and qlength 10000.</span></div><div class=""><span style="font-variant-ligatures: no-common-ligatures" class=""><br class=""></span></div><div class=""><span style="font-variant-ligatures: no-common-ligatures" class="">It worked in that the cluster healed in approximately half the time, and was a total failure in that my cluster experienced IO pauses and at least one VM abnormal shutdown. </span></div><div class=""><span style="font-variant-ligatures: no-common-ligatures" class=""><br class=""></span></div><div class=""><span style="font-variant-ligatures: no-common-ligatures" class="">I have 6 core processers in these boxes, and it looks like I just overloaded them to the point that normal IO wasn’t getting serviced because the self-heal was getting too much priority. I’ve reverted to the defaults for these, and things are now behaving normally, no pauses during healing at all.</span></div><div class=""><span style="font-variant-ligatures: no-common-ligatures" class=""><br class=""></span></div><div class=""><span style="font-variant-ligatures: no-common-ligatures" class="">Moral of the story is don’t forget to undo testing settings when done, and really don’t test extreme settings in production!</span></div><div class=""><span style="font-variant-ligatures: no-common-ligatures" class=""><br class=""></span></div><div class=""><span style="font-variant-ligatures: no-common-ligatures" class="">Back to upgrading my test cluster so I can properly abuse things like this.</span></div><div class=""><span style="font-variant-ligatures: no-common-ligatures" class=""><br class=""></span></div><div class=""><span style="font-variant-ligatures: no-common-ligatures" class=""> -Darrell</span></div><div><blockquote type="cite" class=""><hr style="border:none;border-top:solid #B5C4DF 1.0pt;padding:0 0 0 0;margin:10px 0 5px 0;" class=""><span style="margin: -1.3px 0.0px 0.0px 0.0px" id="RwhHeaderAttributes" class=""><font face="Helvetica" size="4" color="#000000" style="font: 13.0px Helvetica; color: #000000" class=""><b class="">From:</b> Darrell Budic <<a href="mailto:budic@onholyground.com" class="">budic@onholyground.com</a>></font></span><br class="">
<span style="margin: -1.3px 0.0px 0.0px 0.0px" class=""><font face="Helvetica" size="4" color="#000000" style="font: 13.0px Helvetica; color: #000000" class=""><b class="">Subject:</b> Re: [ovirt-users] Ovirt vm's paused due to storage error</font></span><br class="">
<span style="margin: -1.3px 0.0px 0.0px 0.0px" class=""><font face="Helvetica" size="4" color="#000000" style="font: 13.0px Helvetica; color: #000000" class=""><b class="">Date:</b> March 22, 2018 at 1:23:29 PM CDT</font></span><br class="">
<span style="margin: -1.3px 0.0px 0.0px 0.0px" class=""><font face="Helvetica" size="4" color="#000000" style="font: 13.0px Helvetica; color: #000000" class=""><b class="">To:</b> users</font></span><br class="">
<br class="Apple-interchange-newline"><div class=""><meta http-equiv="Content-Type" content="text/html; charset=utf-8" class=""><div style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">I’ve also encounter something similar on my setup, ovirt 3.1.9 with a gluster 3.12.3 storage cluster. All the storage domains in question are setup as gluster volumes & sharded, and I’ve enabled libgfapi support in the engine. It’s happened primarily to VMs that haven’t been restarted to switch to gfapi yet (still have fuse mounts for these), but one or two VMs that have been switched to gfapi mounts as well.<div class=""><br class=""></div><div class="">I started updating the storage cluster to gluster 3.12.6 yesterday and got more annoying/bad behavior as well. Many VMs that were “high disk use” VMs experienced hangs, but not as storage related pauses. Instead, they hang and their watchdogs eventually reported CPU hangs. All did eventually resume normal operation, but it was annoying, to be sure. The Ovirt Engine also lost contact with all of my VMs (unknown status, ? in GUI), even though it still had contact with the hosts. My gluster cluster reported no errors, volume status was normal, and all peers and bricks were connected. Didn’t see anything in the gluster logs that indicated problems, but there were reports of failed heals that eventually went away. </div><div class=""><br class=""></div><div class="">Seems like something in vdsm and/or libgfapi isn’t handling the gfapi mounts well during healing and the related locks, but I can’t tell what it is. I’ve got two more servers in the cluster to upgrade to 3.12.6 yet, and I’ll keep an eye on more logs while I’m doing it, will report on it after I get more info.</div><div class=""><br class=""><div class=""><blockquote type="cite" class=""></blockquote> -Darrell<br class=""><blockquote type="cite" class=""><hr style="border:none;border-top:solid #B5C4DF 1.0pt;padding:0 0 0 0;margin:10px 0 5px 0;" class=""><span style="margin: -1.3px 0.0px 0.0px 0.0px" id="RwhHeaderAttributes" class=""><font face="Helvetica" size="4" style="font-style: normal; font-variant-caps: normal; font-weight: normal; font-stretch: normal; font-size: 13px; line-height: normal; font-family: Helvetica;" class=""><b class="">From:</b> Sahina Bose <<a href="mailto:sabose@redhat.com" class="">sabose@redhat.com</a>></font></span><br class="">
<span style="margin: -1.3px 0.0px 0.0px 0.0px" class=""><font face="Helvetica" size="4" style="font-style: normal; font-variant-caps: normal; font-weight: normal; font-stretch: normal; font-size: 13px; line-height: normal; font-family: Helvetica;" class=""><b class="">Subject:</b> Re: [ovirt-users] Ovirt vm's paused due to storage error</font></span><br class="">
<span style="margin: -1.3px 0.0px 0.0px 0.0px" class=""><font face="Helvetica" size="4" style="font-style: normal; font-variant-caps: normal; font-weight: normal; font-stretch: normal; font-size: 13px; line-height: normal; font-family: Helvetica;" class=""><b class="">Date:</b> March 22, 2018 at 4:56:13 AM CDT</font></span><br class="">
<span style="margin: -1.3px 0.0px 0.0px 0.0px" class=""><font face="Helvetica" size="4" style="font-style: normal; font-variant-caps: normal; font-weight: normal; font-stretch: normal; font-size: 13px; line-height: normal; font-family: Helvetica;" class=""><b class="">To:</b> Endre Karlson</font></span><br class="">
<span style="margin: -1.3px 0.0px 0.0px 0.0px" class=""><font face="Helvetica" size="4" style="font-style: normal; font-variant-caps: normal; font-weight: normal; font-stretch: normal; font-size: 13px; line-height: normal; font-family: Helvetica;" class=""><b class="">Cc:</b> users</font></span><br class="">
<br class="Apple-interchange-newline"><div class=""><div dir="ltr" class=""><div class="">Can you provide "gluster volume info" and the mount logs of the data volume (I assume that this hosts the vdisks for the VM's with storage error).<br class=""><br class=""></div>Also vdsm.log at the corresponding time.<br class=""></div><div class="gmail_extra"><br class=""><div class="gmail_quote">On Fri, Mar 16, 2018 at 3:45 AM, Endre Karlson <span dir="ltr" class=""><<a href="mailto:endre.karlson@gmail.com" target="_blank" class="">endre.karlson@gmail.com</a>></span> wrote:<br class=""><blockquote style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex" class="gmail_quote"><div dir="ltr" class="">Hi, this is is here again and we are getting several vm's going into storage error in our 4 node cluster running on centos 7.4 with gluster and ovirt 4.2.1.<div class=""><br class=""></div><div class="">Gluster version: 3.12.6<br class=""></div><div class=""><br class=""></div><div class="">volume status</div><div class=""><div class="">[root@ovirt3 ~]# gluster volume status</div><div class="">Status of volume: data</div><div class="">Gluster process TCP Port RDMA Port Online Pid</div><div class="">------------------------------<wbr class="">------------------------------<wbr class="">------------------</div><div class="">Brick ovirt0:/gluster/brick3/data 49152 0 Y 9102 </div><div class="">Brick ovirt2:/gluster/brick3/data 49152 0 Y 28063</div><div class="">Brick ovirt3:/gluster/brick3/data 49152 0 Y 28379</div><div class="">Brick ovirt0:/gluster/brick4/data 49153 0 Y 9111 </div><div class="">Brick ovirt2:/gluster/brick4/data 49153 0 Y 28069</div><div class="">Brick ovirt3:/gluster/brick4/data 49153 0 Y 28388</div><div class="">Brick ovirt0:/gluster/brick5/data 49154 0 Y 9120 </div><div class="">Brick ovirt2:/gluster/brick5/data 49154 0 Y 28075</div><div class="">Brick ovirt3:/gluster/brick5/data 49154 0 Y 28397</div><div class="">Brick ovirt0:/gluster/brick6/data 49155 0 Y 9129 </div><div class="">Brick ovirt2:/gluster/brick6_1/data 49155 0 Y 28081</div><div class="">Brick ovirt3:/gluster/brick6/data 49155 0 Y 28404</div><div class="">Brick ovirt0:/gluster/brick7/data 49156 0 Y 9138 </div><div class="">Brick ovirt2:/gluster/brick7/data 49156 0 Y 28089</div><div class="">Brick ovirt3:/gluster/brick7/data 49156 0 Y 28411</div><div class="">Brick ovirt0:/gluster/brick8/data 49157 0 Y 9145 </div><div class="">Brick ovirt2:/gluster/brick8/data 49157 0 Y 28095</div><div class="">Brick ovirt3:/gluster/brick8/data 49157 0 Y 28418</div><div class="">Brick ovirt1:/gluster/brick3/data 49152 0 Y 23139</div><div class="">Brick ovirt1:/gluster/brick4/data 49153 0 Y 23145</div><div class="">Brick ovirt1:/gluster/brick5/data 49154 0 Y 23152</div><div class="">Brick ovirt1:/gluster/brick6/data 49155 0 Y 23159</div><div class="">Brick ovirt1:/gluster/brick7/data 49156 0 Y 23166</div><div class="">Brick ovirt1:/gluster/brick8/data 49157 0 Y 23173</div><div class="">Self-heal Daemon on localhost N/A N/A Y 7757 </div><div class="">Bitrot Daemon on localhost N/A N/A Y 7766 </div><div class="">Scrubber Daemon on localhost N/A N/A Y 7785 </div><div class="">Self-heal Daemon on ovirt2 N/A N/A Y 8205 </div><div class="">Bitrot Daemon on ovirt2 N/A N/A Y 8216 </div><div class="">Scrubber Daemon on ovirt2 N/A N/A Y 8227 </div><div class="">Self-heal Daemon on ovirt0 N/A N/A Y 32665</div><div class="">Bitrot Daemon on ovirt0 N/A N/A Y 32674</div><div class="">Scrubber Daemon on ovirt0 N/A N/A Y 32712</div><div class="">Self-heal Daemon on ovirt1 N/A N/A Y 31759</div><div class="">Bitrot Daemon on ovirt1 N/A N/A Y 31768</div><div class="">Scrubber Daemon on ovirt1 N/A N/A Y 31790</div><div class=""> </div><div class="">Task Status of Volume data</div><div class="">------------------------------<wbr class="">------------------------------<wbr class="">------------------</div><div class="">Task : Rebalance </div><div class="">ID : 62942ba3-db9e-4604-aa03-<wbr class="">4970767f4d67</div><div class="">Status : completed </div><div class=""> </div><div class="">Status of volume: engine</div><div class="">Gluster process TCP Port RDMA Port Online Pid</div><div class="">------------------------------<wbr class="">------------------------------<wbr class="">------------------</div><div class="">Brick ovirt0:/gluster/brick1/engine 49158 0 Y 9155 </div><div class="">Brick ovirt2:/gluster/brick1/engine 49158 0 Y 28107</div><div class="">Brick ovirt3:/gluster/brick1/engine 49158 0 Y 28427</div><div class="">Self-heal Daemon on localhost N/A N/A Y 7757 </div><div class="">Self-heal Daemon on ovirt1 N/A N/A Y 31759</div><div class="">Self-heal Daemon on ovirt0 N/A N/A Y 32665</div><div class="">Self-heal Daemon on ovirt2 N/A N/A Y 8205 </div><div class=""> </div><div class="">Task Status of Volume engine</div><div class="">------------------------------<wbr class="">------------------------------<wbr class="">------------------</div><div class="">There are no active volume tasks</div><div class=""> </div><div class="">Status of volume: iso</div><div class="">Gluster process TCP Port RDMA Port Online Pid</div><div class="">------------------------------<wbr class="">------------------------------<wbr class="">------------------</div><div class="">Brick ovirt0:/gluster/brick2/iso 49159 0 Y 9164 </div><div class="">Brick ovirt2:/gluster/brick2/iso 49159 0 Y 28116</div><div class="">Brick ovirt3:/gluster/brick2/iso 49159 0 Y 28436</div><div class="">NFS Server on localhost 2049 0 Y 7746 </div><div class="">Self-heal Daemon on localhost N/A N/A Y 7757 </div><div class="">NFS Server on ovirt1 2049 0 Y 31748</div><div class="">Self-heal Daemon on ovirt1 N/A N/A Y 31759</div><div class="">NFS Server on ovirt0 2049 0 Y 32656</div><div class="">Self-heal Daemon on ovirt0 N/A N/A Y 32665</div><div class="">NFS Server on ovirt2 2049 0 Y 8194 </div><div class="">Self-heal Daemon on ovirt2 N/A N/A Y 8205 </div><div class=""> </div><div class="">Task Status of Volume iso</div><div class="">------------------------------<wbr class="">------------------------------<wbr class="">------------------</div><div class="">There are no active volume tasks</div></div><div class=""><br class=""></div></div>
<br class="">______________________________<wbr class="">_________________<br class="">
Users mailing list<br class="">
<a href="mailto:Users@ovirt.org" class="">Users@ovirt.org</a><br class="">
<a href="http://lists.ovirt.org/mailman/listinfo/users" rel="noreferrer" target="_blank" class="">http://lists.ovirt.org/<wbr class="">mailman/listinfo/users</a><br class="">
<br class=""></blockquote></div><br class=""></div>
_______________________________________________<br class="">Users mailing list<br class=""><a href="mailto:Users@ovirt.org" class="">Users@ovirt.org</a><br class=""><a href="http://lists.ovirt.org/mailman/listinfo/users" class="">http://lists.ovirt.org/mailman/listinfo/users</a><br class=""></div></blockquote></div><br class=""></div></div>_______________________________________________<br class="">Users mailing list<br class=""><a href="mailto:Users@ovirt.org" class="">Users@ovirt.org</a><br class="">http://lists.ovirt.org/mailman/listinfo/users<br class=""></div></blockquote></div><br class=""></div></body></html>