
--Apple-Mail=_F4CB6E88-1C71-4B40-801A-29625F33DF0D Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 I=E2=80=99ve also encounter something similar on my setup, ovirt 3.1.9 = with a gluster 3.12.3 storage cluster. All the storage domains in = question are setup as gluster volumes & sharded, and I=E2=80=99ve = enabled libgfapi support in the engine. It=E2=80=99s happened primarily = to VMs that haven=E2=80=99t been restarted to switch to gfapi yet (still = have fuse mounts for these), but one or two VMs that have been switched = to gfapi mounts as well. I started updating the storage cluster to gluster 3.12.6 yesterday and = got more annoying/bad behavior as well. Many VMs that were =E2=80=9Chigh = disk use=E2=80=9D VMs experienced hangs, but not as storage related = pauses. Instead, they hang and their watchdogs eventually reported CPU = hangs. All did eventually resume normal operation, but it was annoying, = to be sure. The Ovirt Engine also lost contact with all of my VMs = (unknown status, ? in GUI), even though it still had contact with the = hosts. My gluster cluster reported no errors, volume status was normal, = and all peers and bricks were connected. Didn=E2=80=99t see anything in = the gluster logs that indicated problems, but there were reports of = failed heals that eventually went away.=20 Seems like something in vdsm and/or libgfapi isn=E2=80=99t handling the = gfapi mounts well during healing and the related locks, but I can=E2=80=99= t tell what it is. I=E2=80=99ve got two more servers in the cluster to = upgrade to 3.12.6 yet, and I=E2=80=99ll keep an eye on more logs while = I=E2=80=99m doing it, will report on it after I get more info. -Darrell
From: Sahina Bose <sabose@redhat.com> Subject: Re: [ovirt-users] Ovirt vm's paused due to storage error Date: March 22, 2018 at 4:56:13 AM CDT To: Endre Karlson Cc: users =20 Can you provide "gluster volume info" and the mount logs of the data = volume (I assume that this hosts the vdisks for the VM's with storage = error). =20 Also vdsm.log at the corresponding time. =20 On Fri, Mar 16, 2018 at 3:45 AM, Endre Karlson = <endre.karlson@gmail.com <mailto:endre.karlson@gmail.com>> wrote: Hi, this is is here again and we are getting several vm's going into = storage error in our 4 node cluster running on centos 7.4 with gluster = and ovirt 4.2.1. =20 Gluster version: 3.12.6 =20 volume status [root@ovirt3 ~]# gluster volume status Status of volume: data Gluster process TCP Port RDMA Port = Online Pid = --------------------------------------------------------------------------=
Brick ovirt0:/gluster/brick3/data 49152 0 Y = 9102=20 Brick ovirt2:/gluster/brick3/data 49152 0 Y = 28063 Brick ovirt3:/gluster/brick3/data 49152 0 Y = 28379 Brick ovirt0:/gluster/brick4/data 49153 0 Y = 9111=20 Brick ovirt2:/gluster/brick4/data 49153 0 Y = 28069 Brick ovirt3:/gluster/brick4/data 49153 0 Y = 28388 Brick ovirt0:/gluster/brick5/data 49154 0 Y = 9120=20 Brick ovirt2:/gluster/brick5/data 49154 0 Y = 28075 Brick ovirt3:/gluster/brick5/data 49154 0 Y = 28397 Brick ovirt0:/gluster/brick6/data 49155 0 Y = 9129=20 Brick ovirt2:/gluster/brick6_1/data 49155 0 Y = 28081 Brick ovirt3:/gluster/brick6/data 49155 0 Y = 28404 Brick ovirt0:/gluster/brick7/data 49156 0 Y = 9138=20 Brick ovirt2:/gluster/brick7/data 49156 0 Y = 28089 Brick ovirt3:/gluster/brick7/data 49156 0 Y = 28411 Brick ovirt0:/gluster/brick8/data 49157 0 Y = 9145=20 Brick ovirt2:/gluster/brick8/data 49157 0 Y = 28095 Brick ovirt3:/gluster/brick8/data 49157 0 Y = 28418 Brick ovirt1:/gluster/brick3/data 49152 0 Y = 23139 Brick ovirt1:/gluster/brick4/data 49153 0 Y = 23145 Brick ovirt1:/gluster/brick5/data 49154 0 Y = 23152 Brick ovirt1:/gluster/brick6/data 49155 0 Y = 23159 Brick ovirt1:/gluster/brick7/data 49156 0 Y = 23166 Brick ovirt1:/gluster/brick8/data 49157 0 Y = 23173 Self-heal Daemon on localhost N/A N/A Y = 7757=20 Bitrot Daemon on localhost N/A N/A Y = 7766=20 Scrubber Daemon on localhost N/A N/A Y = 7785=20 Self-heal Daemon on ovirt2 N/A N/A Y = 8205=20 Bitrot Daemon on ovirt2 N/A N/A Y = 8216=20 Scrubber Daemon on ovirt2 N/A N/A Y = 8227=20 Self-heal Daemon on ovirt0 N/A N/A Y = 32665 Bitrot Daemon on ovirt0 N/A N/A Y = 32674 Scrubber Daemon on ovirt0 N/A N/A Y = 32712 Self-heal Daemon on ovirt1 N/A N/A Y = 31759 Bitrot Daemon on ovirt1 N/A N/A Y = 31768 Scrubber Daemon on ovirt1 N/A N/A Y = 31790 =20 Task Status of Volume data = --------------------------------------------------------------------------=
Task : Rebalance =20 ID : 62942ba3-db9e-4604-aa03-4970767f4d67 Status : completed =20 =20 Status of volume: engine Gluster process TCP Port RDMA Port = Online Pid = --------------------------------------------------------------------------=
Brick ovirt0:/gluster/brick1/engine 49158 0 Y = 9155=20 Brick ovirt2:/gluster/brick1/engine 49158 0 Y = 28107 Brick ovirt3:/gluster/brick1/engine 49158 0 Y = 28427 Self-heal Daemon on localhost N/A N/A Y = 7757=20 Self-heal Daemon on ovirt1 N/A N/A Y = 31759 Self-heal Daemon on ovirt0 N/A N/A Y = 32665 Self-heal Daemon on ovirt2 N/A N/A Y = 8205=20 =20 Task Status of Volume engine = --------------------------------------------------------------------------=
There are no active volume tasks =20 Status of volume: iso Gluster process TCP Port RDMA Port = Online Pid = --------------------------------------------------------------------------=
Brick ovirt0:/gluster/brick2/iso 49159 0 Y = 9164=20 Brick ovirt2:/gluster/brick2/iso 49159 0 Y = 28116 Brick ovirt3:/gluster/brick2/iso 49159 0 Y = 28436 NFS Server on localhost 2049 0 Y = 7746=20 Self-heal Daemon on localhost N/A N/A Y = 7757=20 NFS Server on ovirt1 2049 0 Y = 31748 Self-heal Daemon on ovirt1 N/A N/A Y = 31759 NFS Server on ovirt0 2049 0 Y = 32656 Self-heal Daemon on ovirt0 N/A N/A Y = 32665 NFS Server on ovirt2 2049 0 Y = 8194=20 Self-heal Daemon on ovirt2 N/A N/A Y = 8205=20 =20 Task Status of Volume iso = --------------------------------------------------------------------------=
There are no active volume tasks =20 =20 _______________________________________________ Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> http://lists.ovirt.org/mailman/listinfo/users = <http://lists.ovirt.org/mailman/listinfo/users> =20 =20 _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
--Apple-Mail=_F4CB6E88-1C71-4B40-801A-29625F33DF0D Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 <html><head><meta http-equiv=3D"Content-Type" content=3D"text/html; = charset=3Dutf-8"></head><body style=3D"word-wrap: break-word; = -webkit-nbsp-mode: space; line-break: after-white-space;" = class=3D"">I=E2=80=99ve also encounter something similar on my setup, = ovirt 3.1.9 with a gluster 3.12.3 storage cluster. All the storage = domains in question are setup as gluster volumes & sharded, and = I=E2=80=99ve enabled libgfapi support in the engine. It=E2=80=99s = happened primarily to VMs that haven=E2=80=99t been restarted to switch = to gfapi yet (still have fuse mounts for these), but one or two VMs that = have been switched to gfapi mounts as well.<div class=3D""><br = class=3D""></div><div class=3D"">I started updating the storage cluster = to gluster 3.12.6 yesterday and got more annoying/bad behavior as well. = Many VMs that were =E2=80=9Chigh disk use=E2=80=9D VMs experienced = hangs, but not as storage related pauses. Instead, they hang and their = watchdogs eventually reported CPU hangs. All did eventually resume = normal operation, but it was annoying, to be sure. The Ovirt Engine also = lost contact with all of my VMs (unknown status, ? in GUI), even though = it still had contact with the hosts. My gluster cluster reported no = errors, volume status was normal, and all peers and bricks were = connected. Didn=E2=80=99t see anything in the gluster logs that = indicated problems, but there were reports of failed heals that = eventually went away. </div><div class=3D""><br class=3D""></div><div= class=3D"">Seems like something in vdsm and/or libgfapi isn=E2=80=99t = handling the gfapi mounts well during healing and the related locks, but = I can=E2=80=99t tell what it is. I=E2=80=99ve got two more servers in = the cluster to upgrade to 3.12.6 yet, and I=E2=80=99ll keep an eye on = more logs while I=E2=80=99m doing it, will report on it after I get more = info.</div><div class=3D""><br class=3D""><div><blockquote type=3D"cite" = class=3D""></blockquote> -Darrell<br class=3D""><blockquote = type=3D"cite" class=3D""><hr style=3D"border:none;border-top:solid = #B5C4DF 1.0pt;padding:0 0 0 0;margin:10px 0 5px 0;" class=3D""><span = style=3D"margin: -1.3px 0.0px 0.0px 0.0px" id=3D"RwhHeaderAttributes" = class=3D""><font face=3D"Helvetica" size=3D"4" color=3D"#000000" = style=3D"font: 13.0px Helvetica; color: #000000" class=3D""><b = class=3D"">From:</b> Sahina Bose <<a href=3D"mailto:sabose@redhat.com" = class=3D"">sabose@redhat.com</a>></font></span><br class=3D""> <span style=3D"margin: -1.3px 0.0px 0.0px 0.0px" class=3D""><font = face=3D"Helvetica" size=3D"4" color=3D"#000000" style=3D"font: 13.0px = Helvetica; color: #000000" class=3D""><b class=3D"">Subject:</b> Re: = [ovirt-users] Ovirt vm's paused due to storage error</font></span><br = class=3D""> <span style=3D"margin: -1.3px 0.0px 0.0px 0.0px" class=3D""><font = face=3D"Helvetica" size=3D"4" color=3D"#000000" style=3D"font: 13.0px = Helvetica; color: #000000" class=3D""><b class=3D"">Date:</b> March 22, = 2018 at 4:56:13 AM CDT</font></span><br class=3D""> <span style=3D"margin: -1.3px 0.0px 0.0px 0.0px" class=3D""><font = face=3D"Helvetica" size=3D"4" color=3D"#000000" style=3D"font: 13.0px = Helvetica; color: #000000" class=3D""><b class=3D"">To:</b> Endre = Karlson</font></span><br class=3D""> <span style=3D"margin: -1.3px 0.0px 0.0px 0.0px" class=3D""><font = face=3D"Helvetica" size=3D"4" color=3D"#000000" style=3D"font: 13.0px = Helvetica; color: #000000" class=3D""><b class=3D"">Cc:</b> = users</font></span><br class=3D""> <br class=3D"Apple-interchange-newline"><div class=3D""><div dir=3D"ltr" = class=3D""><div class=3D"">Can you provide "gluster volume info" = and the mount logs of the data volume (I assume that this hosts = the vdisks for the VM's with storage error).<br class=3D""><br = class=3D""></div>Also vdsm.log at the corresponding time.<br = class=3D""></div><div class=3D"gmail_extra"><br class=3D""><div = class=3D"gmail_quote">On Fri, Mar 16, 2018 at 3:45 AM, Endre Karlson = <span dir=3D"ltr" class=3D""><<a = href=3D"mailto:endre.karlson@gmail.com" target=3D"_blank" = class=3D"">endre.karlson@gmail.com</a>></span> wrote:<br = class=3D""><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 = .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr" = class=3D"">Hi, this is is here again and we are getting several vm's = going into storage error in our 4 node cluster running on centos 7.4 = with gluster and ovirt 4.2.1.<div class=3D""><br class=3D""></div><div = class=3D"">Gluster version: 3.12.6<br class=3D""></div><div class=3D""><br= class=3D""></div><div class=3D"">volume status</div><div class=3D""><div = class=3D"">[root@ovirt3 ~]# gluster volume status</div><div = class=3D"">Status of volume: data</div><div class=3D"">Gluster = process = TCP Port RDMA Port = Online Pid</div><div class=3D"">------------------------------<wbr = class=3D"">------------------------------<wbr = class=3D"">------------------</div><div class=3D"">Brick = ovirt0:/gluster/brick3/data = 49152 0 = Y 9102 </div><div class=3D"">Brick = ovirt2:/gluster/brick3/data = 49152 0 = Y 28063</div><div class=3D"">Brick = ovirt3:/gluster/brick3/data = 49152 0 = Y 28379</div><div class=3D"">Brick = ovirt0:/gluster/brick4/data = 49153 0 = Y 9111 </div><div class=3D"">Brick = ovirt2:/gluster/brick4/data = 49153 0 = Y 28069</div><div class=3D"">Brick = ovirt3:/gluster/brick4/data = 49153 0 = Y 28388</div><div class=3D"">Brick = ovirt0:/gluster/brick5/data = 49154 0 = Y 9120 </div><div class=3D"">Brick = ovirt2:/gluster/brick5/data = 49154 0 = Y 28075</div><div class=3D"">Brick = ovirt3:/gluster/brick5/data = 49154 0 = Y 28397</div><div class=3D"">Brick = ovirt0:/gluster/brick6/data = 49155 0 = Y 9129 </div><div class=3D"">Brick = ovirt2:/gluster/brick6_1/data = 49155 0 = Y 28081</div><div class=3D"">Brick = ovirt3:/gluster/brick6/data = 49155 0 = Y 28404</div><div class=3D"">Brick = ovirt0:/gluster/brick7/data = 49156 0 = Y 9138 </div><div class=3D"">Brick = ovirt2:/gluster/brick7/data = 49156 0 = Y 28089</div><div class=3D"">Brick = ovirt3:/gluster/brick7/data = 49156 0 = Y 28411</div><div class=3D"">Brick = ovirt0:/gluster/brick8/data = 49157 0 = Y 9145 </div><div class=3D"">Brick = ovirt2:/gluster/brick8/data = 49157 0 = Y 28095</div><div class=3D"">Brick = ovirt3:/gluster/brick8/data = 49157 0 = Y 28418</div><div class=3D"">Brick = ovirt1:/gluster/brick3/data = 49152 0 = Y 23139</div><div class=3D"">Brick = ovirt1:/gluster/brick4/data = 49153 0 = Y 23145</div><div class=3D"">Brick = ovirt1:/gluster/brick5/data = 49154 0 = Y 23152</div><div class=3D"">Brick = ovirt1:/gluster/brick6/data = 49155 0 = Y 23159</div><div class=3D"">Brick = ovirt1:/gluster/brick7/data = 49156 0 = Y 23166</div><div class=3D"">Brick = ovirt1:/gluster/brick8/data = 49157 0 = Y 23173</div><div class=3D"">Self-heal Daemon = on localhost = N/A N/A = Y 7757 </div><div class=3D"">Bitrot = Daemon on localhost = N/A N/A = Y 7766 </div><div = class=3D"">Scrubber Daemon on localhost = N/A N/A = Y 7785 </div><div = class=3D"">Self-heal Daemon on ovirt2 = N/A N/A = Y 8205 </div><div = class=3D"">Bitrot Daemon on ovirt2 = N/A = N/A Y = 8216 </div><div class=3D"">Scrubber Daemon on ovirt2 = N/A = N/A Y = 8227 </div><div class=3D"">Self-heal Daemon on ovirt0 = N/A = N/A Y = 32665</div><div class=3D"">Bitrot Daemon on ovirt0 = N/A = N/A Y = 32674</div><div class=3D"">Scrubber Daemon on ovirt0 = N/A = N/A Y = 32712</div><div class=3D"">Self-heal Daemon on ovirt1 = N/A = N/A Y = 31759</div><div class=3D"">Bitrot Daemon on ovirt1 = N/A = N/A Y = 31768</div><div class=3D"">Scrubber Daemon on ovirt1 = N/A = N/A Y = 31790</div><div class=3D""> </div><div class=3D"">Task Status = of Volume data</div><div class=3D"">------------------------------<wbr = class=3D"">------------------------------<wbr = class=3D"">------------------</div><div class=3D"">Task = : Rebalance = </div><div class=3D"">ID = : = 62942ba3-db9e-4604-aa03-<wbr class=3D"">4970767f4d67</div><div = class=3D"">Status = : completed </div><div = class=3D""> </div><div class=3D"">Status of volume: = engine</div><div class=3D"">Gluster process = = TCP Port RDMA Port Online Pid</div><div = class=3D"">------------------------------<wbr = class=3D"">------------------------------<wbr = class=3D"">------------------</div><div class=3D"">Brick = ovirt0:/gluster/brick1/engine = 49158 0 = Y 9155 </div><div class=3D"">Brick = ovirt2:/gluster/brick1/engine = 49158 0 = Y 28107</div><div class=3D"">Brick = ovirt3:/gluster/brick1/engine = 49158 0 = Y 28427</div><div class=3D"">Self-heal Daemon = on localhost = N/A N/A = Y 7757 </div><div class=3D"">Self-heal = Daemon on ovirt1 = N/A N/A = Y 31759</div><div class=3D"">Self-heal Daemon = on ovirt0 = N/A N/A Y = 32665</div><div class=3D"">Self-heal Daemon on = ovirt2 = N/A N/A Y = 8205 </div><div class=3D""> </div><div = class=3D"">Task Status of Volume engine</div><div = class=3D"">------------------------------<wbr = class=3D"">------------------------------<wbr = class=3D"">------------------</div><div class=3D"">There are no active = volume tasks</div><div class=3D""> </div><div class=3D"">Status of = volume: iso</div><div class=3D"">Gluster process = = TCP Port RDMA Port Online Pid</div><div = class=3D"">------------------------------<wbr = class=3D"">------------------------------<wbr = class=3D"">------------------</div><div class=3D"">Brick = ovirt0:/gluster/brick2/iso = 49159 0 Y = 9164 </div><div class=3D"">Brick = ovirt2:/gluster/brick2/iso = 49159 0 Y = 28116</div><div class=3D"">Brick = ovirt3:/gluster/brick2/iso = 49159 0 Y = 28436</div><div class=3D"">NFS Server on = localhost = 2049 0 = Y 7746 </div><div = class=3D"">Self-heal Daemon on localhost = N/A N/A = Y 7757 </div><div = class=3D"">NFS Server on ovirt1 = 2049 = 0 Y = 31748</div><div class=3D"">Self-heal Daemon on ovirt1 = N/A = N/A Y = 31759</div><div class=3D"">NFS Server on ovirt0 = = 2049 0 Y = 32656</div><div class=3D"">Self-heal Daemon on = ovirt0 = N/A N/A Y = 32665</div><div class=3D"">NFS Server on = ovirt2 = 2049 0 = Y 8194 </div><div = class=3D"">Self-heal Daemon on ovirt2 = N/A N/A = Y 8205 </div><div = class=3D""> </div><div class=3D"">Task Status of Volume = iso</div><div class=3D"">------------------------------<wbr = class=3D"">------------------------------<wbr = class=3D"">------------------</div><div class=3D"">There are no active = volume tasks</div></div><div class=3D""><br class=3D""></div></div> <br class=3D"">______________________________<wbr = class=3D"">_________________<br class=3D""> Users mailing list<br class=3D""> <a href=3D"mailto:Users@ovirt.org" class=3D"">Users@ovirt.org</a><br = class=3D""> <a href=3D"http://lists.ovirt.org/mailman/listinfo/users" = rel=3D"noreferrer" target=3D"_blank" = class=3D"">http://lists.ovirt.org/<wbr = class=3D"">mailman/listinfo/users</a><br class=3D""> <br class=3D""></blockquote></div><br class=3D""></div> _______________________________________________<br class=3D"">Users = mailing list<br class=3D""><a href=3D"mailto:Users@ovirt.org" = class=3D"">Users@ovirt.org</a><br = class=3D"">http://lists.ovirt.org/mailman/listinfo/users<br = class=3D""></div></blockquote></div><br class=3D""></div></body></html>= --Apple-Mail=_F4CB6E88-1C71-4B40-801A-29625F33DF0D--