--Apple-Mail=_F4CB6E88-1C71-4B40-801A-29625F33DF0D
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
charset=utf-8
I=E2=80=99ve also encounter something similar on my setup, ovirt 3.1.9 =
with a gluster 3.12.3 storage cluster. All the storage domains in =
question are setup as gluster volumes & sharded, and I=E2=80=99ve =
enabled libgfapi support in the engine. It=E2=80=99s happened primarily =
to VMs that haven=E2=80=99t been restarted to switch to gfapi yet (still =
have fuse mounts for these), but one or two VMs that have been switched =
to gfapi mounts as well.
I started updating the storage cluster to gluster 3.12.6 yesterday and =
got more annoying/bad behavior as well. Many VMs that were =E2=80=9Chigh =
disk use=E2=80=9D VMs experienced hangs, but not as storage related =
pauses. Instead, they hang and their watchdogs eventually reported CPU =
hangs. All did eventually resume normal operation, but it was annoying, =
to be sure. The Ovirt Engine also lost contact with all of my VMs =
(unknown status, ? in GUI), even though it still had contact with the =
hosts. My gluster cluster reported no errors, volume status was normal, =
and all peers and bricks were connected. Didn=E2=80=99t see anything in =
the gluster logs that indicated problems, but there were reports of =
failed heals that eventually went away.=20
Seems like something in vdsm and/or libgfapi isn=E2=80=99t handling the =
gfapi mounts well during healing and the related locks, but I can=E2=80=99=
t tell what it is. I=E2=80=99ve got two more servers in the cluster to =
upgrade to 3.12.6 yet, and I=E2=80=99ll keep an eye on more logs while =
I=E2=80=99m doing it, will report on it after I get more info.
-Darrell
From: Sahina Bose <sabose(a)redhat.com>
Subject: Re: [ovirt-users] Ovirt vm's paused due to storage error
Date: March 22, 2018 at 4:56:13 AM CDT
To: Endre Karlson
Cc: users
=20
Can you provide "gluster volume info" and the mount logs of the data =
volume (I assume that this hosts the vdisks for the VM's with storage =
error).
=20
Also vdsm.log at the corresponding time.
=20
On Fri, Mar 16, 2018 at 3:45 AM, Endre Karlson =
<endre.karlson(a)gmail.com
<mailto:endre.karlson@gmail.com>> wrote:
Hi, this is is here again and we are getting several vm's going
into =
storage error in our 4 node cluster running on centos 7.4 with gluster =
and ovirt 4.2.1.
=20
Gluster version: 3.12.6
=20
volume status
[root@ovirt3 ~]# gluster volume status
Status of volume: data
Gluster process TCP Port RDMA Port =
Online Pid
=
--------------------------------------------------------------------------=
----
Brick ovirt0:/gluster/brick3/data 49152 0 Y
=
9102=20
Brick ovirt2:/gluster/brick3/data 49152 0 Y
=
28063
Brick ovirt3:/gluster/brick3/data 49152 0 Y
=
28379
Brick ovirt0:/gluster/brick4/data 49153 0 Y
=
9111=20
Brick ovirt2:/gluster/brick4/data 49153 0 Y
=
28069
Brick ovirt3:/gluster/brick4/data 49153 0 Y
=
28388
Brick ovirt0:/gluster/brick5/data 49154 0 Y
=
9120=20
Brick ovirt2:/gluster/brick5/data 49154 0 Y
=
28075
Brick ovirt3:/gluster/brick5/data 49154 0 Y
=
28397
Brick ovirt0:/gluster/brick6/data 49155 0 Y
=
9129=20
Brick ovirt2:/gluster/brick6_1/data 49155 0 Y
=
28081
Brick ovirt3:/gluster/brick6/data 49155 0 Y
=
28404
Brick ovirt0:/gluster/brick7/data 49156 0 Y
=
9138=20
Brick ovirt2:/gluster/brick7/data 49156 0 Y
=
28089
Brick ovirt3:/gluster/brick7/data 49156 0 Y
=
28411
Brick ovirt0:/gluster/brick8/data 49157 0 Y
=
9145=20
Brick ovirt2:/gluster/brick8/data 49157 0 Y
=
28095
Brick ovirt3:/gluster/brick8/data 49157 0 Y
=
28418
Brick ovirt1:/gluster/brick3/data 49152 0 Y
=
23139
Brick ovirt1:/gluster/brick4/data 49153 0 Y
=
23145
Brick ovirt1:/gluster/brick5/data 49154 0 Y
=
23152
Brick ovirt1:/gluster/brick6/data 49155 0 Y
=
23159
Brick ovirt1:/gluster/brick7/data 49156 0 Y
=
23166
Brick ovirt1:/gluster/brick8/data 49157 0 Y
=
23173
Self-heal Daemon on localhost N/A N/A Y
=
7757=20
Bitrot Daemon on localhost N/A N/A Y
=
7766=20
Scrubber Daemon on localhost N/A N/A Y
=
7785=20
Self-heal Daemon on ovirt2 N/A N/A Y
=
8205=20
Bitrot Daemon on ovirt2 N/A N/A Y
=
8216=20
Scrubber Daemon on ovirt2 N/A N/A Y
=
8227=20
Self-heal Daemon on ovirt0 N/A N/A Y
=
32665
Bitrot Daemon on ovirt0 N/A N/A Y
=
32674
Scrubber Daemon on ovirt0 N/A N/A Y
=
32712
Self-heal Daemon on ovirt1 N/A N/A Y
=
31759
Bitrot Daemon on ovirt1 N/A N/A Y
=
31768
Scrubber Daemon on ovirt1 N/A N/A Y
=
31790
> =20
> Task Status of Volume data
=
--------------------------------------------------------------------------=
----
Task : Rebalance =20
ID : 62942ba3-db9e-4604-aa03-4970767f4d67
Status : completed =20
=20
Status of volume: engine
Gluster process TCP Port RDMA Port =
Online Pid
=
--------------------------------------------------------------------------=
----
Brick ovirt0:/gluster/brick1/engine 49158 0 Y
=
9155=20
Brick ovirt2:/gluster/brick1/engine 49158 0 Y
=
28107
Brick ovirt3:/gluster/brick1/engine 49158 0 Y
=
28427
Self-heal Daemon on localhost N/A N/A Y
=
7757=20
Self-heal Daemon on ovirt1 N/A N/A Y
=
31759
Self-heal Daemon on ovirt0 N/A N/A Y
=
32665
Self-heal Daemon on ovirt2 N/A N/A Y
=
8205=20
> =20
> Task Status of Volume engine
=
--------------------------------------------------------------------------=
----
There are no active volume tasks
=20
Status of volume: iso
Gluster process TCP Port RDMA Port =
Online Pid
=
--------------------------------------------------------------------------=
----
Brick ovirt0:/gluster/brick2/iso 49159 0 Y
=
9164=20
Brick ovirt2:/gluster/brick2/iso 49159 0 Y
=
28116
Brick ovirt3:/gluster/brick2/iso 49159 0 Y
=
28436
NFS Server on localhost 2049 0 Y
=
7746=20
Self-heal Daemon on localhost N/A N/A Y
=
7757=20
NFS Server on ovirt1 2049 0 Y
=
31748
Self-heal Daemon on ovirt1 N/A N/A Y
=
31759
NFS Server on ovirt0 2049 0 Y
=
32656
Self-heal Daemon on ovirt0 N/A N/A Y
=
32665
NFS Server on ovirt2 2049 0 Y
=
8194=20
Self-heal Daemon on ovirt2 N/A N/A Y
=
8205=20
> =20
> Task Status of Volume iso
=
--------------------------------------------------------------------------=
----
There are no active volume tasks
=20
=20
_______________________________________________
Users mailing list
Users(a)ovirt.org <mailto:Users@ovirt.org>
http://lists.ovirt.org/mailman/listinfo/users =
<
http://lists.ovirt.org/mailman/listinfo/users>
=20
=20
_______________________________________________
Users mailing list
Users(a)ovirt.org
http://lists.ovirt.org/mailman/listinfo/users
--Apple-Mail=_F4CB6E88-1C71-4B40-801A-29625F33DF0D
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
charset=utf-8
<html><head><meta http-equiv=3D"Content-Type"
content=3D"text/html; =
charset=3Dutf-8"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; line-break: after-white-space;" =
class=3D"">I=E2=80=99ve also encounter something similar on my setup, =
ovirt 3.1.9 with a gluster 3.12.3 storage cluster. All the storage =
domains in question are setup as gluster volumes & sharded, and =
I=E2=80=99ve enabled libgfapi support in the engine. It=E2=80=99s =
happened primarily to VMs that haven=E2=80=99t been restarted to switch =
to gfapi yet (still have fuse mounts for these), but one or two VMs that =
have been switched to gfapi mounts as well.<div class=3D""><br =
class=3D""></div><div class=3D"">I started updating the
storage cluster =
to gluster 3.12.6 yesterday and got more annoying/bad behavior as well. =
Many VMs that were =E2=80=9Chigh disk use=E2=80=9D VMs experienced =
hangs, but not as storage related pauses. Instead, they hang and their =
watchdogs eventually reported CPU hangs. All did eventually resume =
normal operation, but it was annoying, to be sure. The Ovirt Engine also =
lost contact with all of my VMs (unknown status, ? in GUI), even though =
it still had contact with the hosts. My gluster cluster reported no =
errors, volume status was normal, and all peers and bricks were =
connected. Didn=E2=80=99t see anything in the gluster logs that =
indicated problems, but there were reports of failed heals that =
eventually went away. </div><div class=3D""><br
class=3D""></div><div=
class=3D"">Seems like something in vdsm and/or libgfapi isn=E2=80=99t =
handling the gfapi mounts well during healing and the related locks, but =
I can=E2=80=99t tell what it is. I=E2=80=99ve got two more servers in =
the cluster to upgrade to 3.12.6 yet, and I=E2=80=99ll keep an eye on =
more logs while I=E2=80=99m doing it, will report on it after I get more =
info.</div><div class=3D""><br
class=3D""><div><blockquote type=3D"cite" =
class=3D""></blockquote> -Darrell<br
class=3D""><blockquote =
type=3D"cite" class=3D""><hr
style=3D"border:none;border-top:solid =
#B5C4DF 1.0pt;padding:0 0 0 0;margin:10px 0 5px 0;" class=3D""><span
=
style=3D"margin: -1.3px 0.0px 0.0px 0.0px" id=3D"RwhHeaderAttributes"
=
class=3D""><font face=3D"Helvetica" size=3D"4"
color=3D"#000000" =
style=3D"font: 13.0px Helvetica; color: #000000" class=3D""><b
=
class=3D"">From:</b> Sahina Bose <<a
href=3D"mailto:sabose@redhat.com" =
class=3D"">sabose(a)redhat.com</a>&gt;</font></span><br
class=3D"">
<span style=3D"margin: -1.3px 0.0px 0.0px 0.0px"
class=3D""><font =
face=3D"Helvetica" size=3D"4" color=3D"#000000"
style=3D"font: 13.0px =
Helvetica; color: #000000" class=3D""><b
class=3D"">Subject:</b> Re: =
[ovirt-users] Ovirt vm's paused due to storage error</font></span><br
=
class=3D"">
<span style=3D"margin: -1.3px 0.0px 0.0px 0.0px"
class=3D""><font =
face=3D"Helvetica" size=3D"4" color=3D"#000000"
style=3D"font: 13.0px =
Helvetica; color: #000000" class=3D""><b
class=3D"">Date:</b> March 22, =
2018 at 4:56:13 AM CDT</font></span><br class=3D"">
<span style=3D"margin: -1.3px 0.0px 0.0px 0.0px"
class=3D""><font =
face=3D"Helvetica" size=3D"4" color=3D"#000000"
style=3D"font: 13.0px =
Helvetica; color: #000000" class=3D""><b
class=3D"">To:</b> Endre =
Karlson</font></span><br class=3D"">
<span style=3D"margin: -1.3px 0.0px 0.0px 0.0px"
class=3D""><font =
face=3D"Helvetica" size=3D"4" color=3D"#000000"
style=3D"font: 13.0px =
Helvetica; color: #000000" class=3D""><b
class=3D"">Cc:</b
=
users</font></span><br class=3D"">
<br class=3D"Apple-interchange-newline"><div
class=3D""><div dir=3D"ltr" =
class=3D""><div class=3D"">Can you provide "gluster
volume info" =
and the mount logs of the data volume (I assume that this hosts =
the vdisks for the VM's with storage error).<br class=3D""><br =
class=3D""></div>Also vdsm.log at the corresponding time.<br =
class=3D""></div><div class=3D"gmail_extra"><br
class=3D""><div =
class=3D"gmail_quote">On Fri, Mar 16, 2018 at 3:45 AM, Endre Karlson =
<span dir=3D"ltr" class=3D""><<a =
href=3D"mailto:endre.karlson@gmail.com" target=3D"_blank" =
class=3D"">endre.karlson(a)gmail.com</a>&gt;</span>
wrote:<br =
class=3D""><blockquote class=3D"gmail_quote"
style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr" =
class=3D"">Hi, this is is here again and we are getting several vm's =
going into storage error in our 4 node cluster running on centos 7.4 =
with gluster and ovirt 4.2.1.<div class=3D""><br
class=3D""></div><div =
class=3D"">Gluster version: 3.12.6<br
class=3D""></div><div class=3D""><br=
class=3D""></div><div class=3D"">volume
status</div><div class=3D""><div =
class=3D"">[root@ovirt3 ~]# gluster volume status</div><div =
class=3D"">Status of volume: data</div><div
class=3D"">Gluster =
process
=
TCP Port RDMA
Port =
Online Pid</div><div
class=3D"">------------------------------<wbr =
class=3D"">------------------------------<wbr =
class=3D"">------------------</div><div
class=3D"">Brick =
ovirt0:/gluster/brick3/data =
49152 0
=
Y 9102 </div><div
class=3D"">Brick =
ovirt2:/gluster/brick3/data =
49152 0
=
Y 28063</div><div
class=3D"">Brick =
ovirt3:/gluster/brick3/data =
49152 0
=
Y 28379</div><div
class=3D"">Brick =
ovirt0:/gluster/brick4/data =
49153 0
=
Y 9111 </div><div
class=3D"">Brick =
ovirt2:/gluster/brick4/data =
49153 0
=
Y 28069</div><div
class=3D"">Brick =
ovirt3:/gluster/brick4/data =
49153 0
=
Y 28388</div><div
class=3D"">Brick =
ovirt0:/gluster/brick5/data =
49154 0
=
Y 9120 </div><div
class=3D"">Brick =
ovirt2:/gluster/brick5/data =
49154 0
=
Y 28075</div><div
class=3D"">Brick =
ovirt3:/gluster/brick5/data =
49154 0
=
Y 28397</div><div
class=3D"">Brick =
ovirt0:/gluster/brick6/data =
49155 0
=
Y 9129 </div><div
class=3D"">Brick =
ovirt2:/gluster/brick6_1/data =
49155 0
=
Y 28081</div><div
class=3D"">Brick =
ovirt3:/gluster/brick6/data =
49155 0
=
Y 28404</div><div
class=3D"">Brick =
ovirt0:/gluster/brick7/data =
49156 0
=
Y 9138 </div><div
class=3D"">Brick =
ovirt2:/gluster/brick7/data =
49156 0
=
Y 28089</div><div
class=3D"">Brick =
ovirt3:/gluster/brick7/data =
49156 0
=
Y 28411</div><div
class=3D"">Brick =
ovirt0:/gluster/brick8/data =
49157 0
=
Y 9145 </div><div
class=3D"">Brick =
ovirt2:/gluster/brick8/data =
49157 0
=
Y 28095</div><div
class=3D"">Brick =
ovirt3:/gluster/brick8/data =
49157 0
=
Y 28418</div><div
class=3D"">Brick =
ovirt1:/gluster/brick3/data =
49152 0
=
Y 23139</div><div
class=3D"">Brick =
ovirt1:/gluster/brick4/data =
49153 0
=
Y 23145</div><div
class=3D"">Brick =
ovirt1:/gluster/brick5/data =
49154 0
=
Y 23152</div><div
class=3D"">Brick =
ovirt1:/gluster/brick6/data =
49155 0
=
Y 23159</div><div
class=3D"">Brick =
ovirt1:/gluster/brick7/data =
49156 0
=
Y 23166</div><div
class=3D"">Brick =
ovirt1:/gluster/brick8/data =
49157 0
=
Y 23173</div><div
class=3D"">Self-heal Daemon =
on localhost
=
N/A N/A
=
Y 7757 </div><div
class=3D"">Bitrot =
Daemon on localhost
=
N/A N/A
=
Y 7766 </div><div
=
class=3D"">Scrubber Daemon on localhost
=
N/A
N/A =
Y
7785 </div><div =
class=3D"">Self-heal Daemon on ovirt2
=
N/A
N/A =
Y
8205 </div><div =
class=3D"">Bitrot Daemon on ovirt2
=
N/A
=
N/A Y
=
8216 </div><div class=3D"">Scrubber Daemon on
ovirt2 =
N/A =
N/A Y
=
8227 </div><div class=3D"">Self-heal Daemon on
ovirt0 =
N/A =
N/A Y
=
32665</div><div class=3D"">Bitrot Daemon on
ovirt0 =
N/A =
N/A Y
=
32674</div><div class=3D"">Scrubber Daemon on
ovirt0 =
N/A =
N/A Y
=
32712</div><div class=3D"">Self-heal Daemon on
ovirt1 =
N/A
=
N/A Y
=
31759</div><div class=3D"">Bitrot Daemon on
ovirt1 =
N/A =
N/A Y
=
31768</div><div class=3D"">Scrubber Daemon on
ovirt1 =
N/A =
N/A Y
=
31790</div><div class=3D""> </div><div
class=3D"">Task Status =
of Volume data</div><div
class=3D"">------------------------------<wbr =
class=3D"">------------------------------<wbr =
class=3D"">------------------</div><div
class=3D"">Task =
:
Rebalance =
</div><div
class=3D"">ID =
: =
62942ba3-db9e-4604-aa03-<wbr class=3D"">4970767f4d67</div><div =
class=3D"">Status
=
: completed
</div><div =
class=3D""> </div><div class=3D"">Status of
volume: =
engine</div><div class=3D"">Gluster process
=
=
TCP Port RDMA Port Online Pid</div><div
=
class=3D"">------------------------------<wbr =
class=3D"">------------------------------<wbr =
class=3D"">------------------</div><div
class=3D"">Brick =
ovirt0:/gluster/brick1/engine =
49158 0
=
Y 9155 </div><div
class=3D"">Brick =
ovirt2:/gluster/brick1/engine =
49158 0
=
Y 28107</div><div
class=3D"">Brick =
ovirt3:/gluster/brick1/engine =
49158 0
=
Y 28427</div><div
class=3D"">Self-heal Daemon =
on localhost
=
N/A N/A
=
Y 7757 </div><div
class=3D"">Self-heal =
Daemon on ovirt1
=
N/A N/A
=
Y 31759</div><div
class=3D"">Self-heal Daemon =
on ovirt0
=
N/A N/A
Y =
32665</div><div class=3D"">Self-heal
Daemon on =
ovirt2
=
N/A N/A
Y =
8205 </div><div
class=3D""> </div><div =
class=3D"">Task Status of Volume engine</div><div =
class=3D"">------------------------------<wbr =
class=3D"">------------------------------<wbr =
class=3D"">------------------</div><div
class=3D"">There are no active =
volume tasks</div><div class=3D""> </div><div
class=3D"">Status of =
volume: iso</div><div class=3D"">Gluster process
=
=
TCP Port RDMA Port Online
Pid</div><div =
class=3D"">------------------------------<wbr =
class=3D"">------------------------------<wbr =
class=3D"">------------------</div><div
class=3D"">Brick =
ovirt0:/gluster/brick2/iso
=
49159 0
Y =
9164 </div><div
class=3D"">Brick =
ovirt2:/gluster/brick2/iso
=
49159 0
Y =
28116</div><div class=3D"">Brick =
ovirt3:/gluster/brick2/iso
=
49159 0
Y =
28436</div><div class=3D"">NFS
Server on =
localhost
=
2049 0
=
Y 7746 </div><div
=
class=3D"">Self-heal Daemon on localhost
=
N/A
N/A =
Y
7757 </div><div =
class=3D"">NFS Server on ovirt1
=
2049
=
0 Y
=
31748</div><div class=3D"">Self-heal Daemon on
ovirt1 =
N/A
=
N/A Y
=
31759</div><div class=3D"">NFS Server on ovirt0
=
=
2049 0
Y =
32656</div><div class=3D"">Self-heal
Daemon on =
ovirt0
=
N/A N/A
Y =
32665</div><div class=3D"">NFS
Server on =
ovirt2
=
2049 0
=
Y
8194 </div><div =
class=3D"">Self-heal Daemon on ovirt2
=
N/A
N/A =
Y
8205 </div><div =
class=3D""> </div><div class=3D"">Task Status
of Volume =
iso</div><div class=3D"">------------------------------<wbr =
class=3D"">------------------------------<wbr =
class=3D"">------------------</div><div
class=3D"">There are no active =
volume tasks</div></div><div class=3D""><br
class=3D""></div></div>
<br class=3D"">______________________________<wbr =
class=3D"">_________________<br class=3D"">
Users mailing list<br class=3D"">
<a href=3D"mailto:Users@ovirt.org"
class=3D"">Users(a)ovirt.org</a><br =
class=3D"">
<a
href=3D"http://lists.ovirt.org/mailman/listinfo/users" =
rel=3D"noreferrer" target=3D"_blank" =
class=3D"">http://lists.ovirt.org/<wbr =
class=3D"">mailman/listinfo/users</a><br class=3D"">
<br class=3D""></blockquote></div><br
class=3D""></div>
_______________________________________________<br class=3D"">Users =
mailing list<br class=3D""><a href=3D"mailto:Users@ovirt.org"
=
class=3D"">Users(a)ovirt.org</a><br =
class=3D"">http://lists.ovirt.org/mailman/listinfo/users<br =
class=3D""></div></blockquote></div><br
class=3D""></div></body></html>=
--Apple-Mail=_F4CB6E88-1C71-4B40-801A-29625F33DF0D--