Re: [Users] Nodes lose storage at random

18 Feb 2014

      ------=_Part_5070193_841446361.1392731812625
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

Hi Johan, 

Can you please run something like this on the spm node? 
while true; do echo `date; ps ax | grep -i remotefilehandler | wc -l` >> /tmp/handler_num.txt; sleep 1; done 

When it'll happen again, please stop the script, and write here the maximum number and the time that it happened. 

Also, please check if "process_pool_max_slots_per_domain" is defined in /etc/vdsm/vdsm.conf, and if so, what's the value? (if it's not defined there, the default is 10) 

Thanks! 

----- Original Message -----
...
From: "Johan Kooijman" <mail@johankooijman.com>
To: "Meital Bourvine" <mbourvin@redhat.com>
Cc: "users" <users@ovirt.org>
Sent: Tuesday, February 18, 2014 2:55:11 PM
Subject: Re: [Users] Nodes lose storage at random
...
To follow up on this: The setup has only ~80 VM's active right now. The 2
bugreports are not in scope for this setup, the issues occur at random, even
when there's no activity (create/delete VM's) and there are only 4
directories in / rhev/data-center/mnt/.
...
On Tue, Feb 18, 2014 at 1:51 PM, Johan Kooijman < mail@johankooijman.com >
wrote:
...
...
Meital,

...
...
I'm running the latest stable oVirt, 3.3.3 on Centos 6.5. For my nodes I
use
the node iso CentOS 6 "oVirt Node - 3.0.1 - 1.0.2.el6".

...
...
I have no ways of reproducing just yet. I can confirm that it's happening
on
all nodes in the cluster. And every time a node goes offline, this error
pops up.

...
...
Could the fact that lockd & statd were not running on the NFS host cause
this
error? Is there a workaround available that we know of?

...
...
On Tue, Feb 18, 2014 at 12:57 PM, Meital Bourvine < mbourvin@redhat.com >
wrote:

...
...
...
Hi Johan,

...
...
...
Please take a look at this error (from vdsm.log):

...
...
...
Thread-636938::DEBUG::2014-02-18
10:48:06,374::task::579::TaskManager.Task::(_updateState)
Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::moving from state init ->
state
preparing
...
...
Thread-636938::INFO::2014-02-18
10:48:06,375::logUtils::44::dispatcher::(wrapper) Run and protect:
getVolumeSize(sdUUID='e9f70496-f181-4c9b-9ecb-d7f780772b04',
spUUID='59980e09-b329-4254-b66e-790abd69e194',
imgUUID='d50ecfbb-dc98-40cf-9b19-4bd402952aeb',
volUUID='68fefe24-0346-4d0d-b377-ddd7be7be29c', options=None)
...
...
Thread-636938::ERROR::2014-02-18
10:48:06,376::task::850::TaskManager.Task::(_setError)
Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Unexpected error
...
...
Thread-636938::DEBUG::2014-02-18
10:48:06,415::task::869::TaskManager.Task::(_run)
Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Task._run:
f4ce9a6e-0292-4071-9a24-a8d8fba7222b
('e9f70496-f181-4c9b-9ecb-d7f780772b04',
'59980e09-b329-4254-b66e-790abd69e194',
'd50ecfbb-dc98-40cf-9b19-4bd402952aeb',
'68fefe24-0346-4d0d-b377-ddd7be7be29c') {} failed - stopping task
...
...
Thread-636938::DEBUG::2014-02-18
10:48:06,416::task::1194::TaskManager.Task::(stop)
Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::stopping in state preparing
(force False)
...
...
Thread-636938::DEBUG::2014-02-18
10:48:06,416::task::974::TaskManager.Task::(_decref)
Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::ref 1 aborting True
...
...
Thread-636938::INFO::2014-02-18
10:48:06,416::task::1151::TaskManager.Task::(prepare)
Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::aborting: Task is aborted:
u'No
free file handlers in pool' - code 100
...
...
Thread-636938::DEBUG::2014-02-18
10:48:06,417::task::1156::TaskManager.Task::(prepare)
Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Prepare: aborted: No free
file
handlers in pool

...
...
...
And then you can see after a few seconds:
...
...
MainThread::INFO::2014-02-18 10:48:45,258::vdsm::101::vds::(run) (PID:
1450)
I am the actual vdsm 4.12.1-2.el6 hv5.ovirt.gs.cloud.lan
(2.6.32-358.18.1.el6.x86_64)

...
...
...
Meaning that vdsm was restarted.

...
...
...
Which oVirt version are you using?
...
...
I see that there are a few old bugs that describes the same behaviour,
but
with different reproduction steps, for example [1], [2].
...
...
Can you think of any reproduction steps that might be causing this issue?

...
...
...
[1] https://bugzilla.redhat.com/show_bug.cgi?id=948210
...
...
[2] https://bugzilla.redhat.com/show_bug.cgi?id=853011

...
...
...
...
From: "Johan Kooijman" < mail@johankooijman.com >
...
...
...
To: "users" < users@ovirt.org >
...
...
...
Sent: Tuesday, February 18, 2014 1:32:56 PM
...
...
...
Subject: [Users] Nodes lose storage at random

...
...
...
...
Hi All,

...
...
...
...
We're seeing some weird issues in our ovirt setup. We have 4 nodes
connected
and an NFS (v3) filestore (FreeBSD/ZFS).

...
...
...
...
Once in a while, it seems at random, a node loses their connection to
storage, recovers it a minute later. The other nodes usually don't lose
their storage at that moment. Just one, or two at a time.

...
...
...
...
We've setup extra tooling to verify the storage performance at those
moments
and the availability for other systems. It's always online, just the
nodes
don't think so.

...
...
...
...
The engine tells me this:

...
...
...
...
2014-02-18 11:48:03,598 WARN
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
(pool-6-thread-48) domain d88764c8-ecc3-4f22-967e-2ce225ac4498:Export
in
problem. vds: hv5
...
...
...
2014-02-18 11:48:18,909 WARN
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
(pool-6-thread-48) domain e9f70496-f181-4c9b-9ecb-d7f780772b04:Data in
problem. vds: hv5
...
...
...
2014-02-18 11:48:45,021 WARN
[org.ovirt.engine.core.vdsbroker.VdsManager]
(DefaultQuartzScheduler_Worker-18) [46683672] Failed to refresh VDS ,
vds
=
66e6aace-e51d-4006-bb2f-d85c2f1fd8d2 : hv5, VDS Network Error,
continuing.
...
...
...
2014-02-18 11:48:45,070 INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler_Worker-41) [2ef1a894] Correlation ID: 2ef1a894,
Call
Stack: null, Custom Event ID: -1, Message: Invalid status on Data
Center
GS.
Setting Data Center status to Non Responsive (On host hv5, Error:
Network
error during communication with the Host.).

...
...
...
...
The export and data domain live over NFS. There's another domain, ISO,
that
lives on the engine machine, also shared over NFS. That domain doesn't
have
any issue at all.

...
...
...
...
Attached are the logfiles for the relevant time period for both the
engine
server and the node. The node by the way, is a deployment of the node
ISO,
not a full blown installation.

...
...
...
...
Any clues on where to begin searching? The NFS server shows no issues
nor
anything in the logs. I did notice that the statd and lockd daemons
were
not
running, but I wonder if that can have anything to do with the issue.

...
...
...
...
--
...
...
...
Met vriendelijke groeten / With kind regards,
...
...
...
Johan Kooijman

...
...
...
...
mail@johankooijman.com

...
...
...
...
_______________________________________________
...
...
...
Users mailing list
...
...
...
Users@ovirt.org
...
...
...
http://lists.ovirt.org/mailman/listinfo/users

...
...
--
...
Met vriendelijke groeten / With kind regards,
...
Johan Kooijman

...
...
T +31(0) 6 43 44 45 27
...
F +31(0) 162 82 00 01
...
E mail@johankooijman.com

...
--
Met vriendelijke groeten / With kind regards,
Johan Kooijman
...
T +31(0) 6 43 44 45 27
F +31(0) 162 82 00 01
E mail@johankooijman.com
------=_Part_5070193_841446361.1392731812625
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: quoted-printable

<html><body><div style=3D"font-family: times new roman, new york, times, se=
rif; font-size: 12pt; color: #000000"><div>Hi Johan,<br></div><div><br></di=
v><div>Can you please run something like this on the spm node?<br></div><di=
v>while true; do echo `date; ps ax | grep -i remotefilehandler | wc -l` >=
;> /tmp/handler_num.txt; sleep 1; done</div><div><br></div><div>When it'=
ll happen again, please stop the script, and write here the maximum number =
and the time that it happened.<br></div><div><br></div><div>Also, please ch=
eck if "process_pool_max_slots_per_domain" is defined in /etc/vdsm/vdsm.con=
f, and if so, what's the value? (if it's not defined there, the default is =
10)<br></div><div><br></div><div>Thanks!<br></div><div><br></div><div><br><=
/div><hr id=3D"zwchr"><blockquote style=3D"border-left:2px solid #1010FF;ma=
rgin-left:5px;padding-left:5px;color:#000;font-weight:normal;font-style:nor=
mal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:1=
2pt;"><b>From: </b>"Johan Kooijman" <mail@johankooijman.com><br><b>To=
: </b>"Meital Bourvine" <mbourvin@redhat.com><br><b>Cc: </b>"users" &=
lt;users@ovirt.org><br><b>Sent: </b>Tuesday, February 18, 2014 2:55:11 P=
M<br><b>Subject: </b>Re: [Users] Nodes lose storage at random<br><div><br><=
/div><div dir=3D"ltr">To follow up on this: The setup has only ~80 VM's act=
ive right now. The 2 bugreports are not in scope for this setup, the issues=
 occur at random, even when there's no activity (create/delete VM's) and th=
ere are only 4 directories in /<span style=3D"color:rgb(0,0,0);white-space:=
pre-wrap">rhev/data-center/mnt/.</span><div>
<span style=3D"color:rgb(0,0,0);white-space:pre-wrap"><br></span></div></di=
v><div class=3D"gmail_extra"><br><div><br></div><div class=3D"gmail_quote">=
On Tue, Feb 18, 2014 at 1:51 PM, Johan Kooijman <span dir=3D"ltr"><<a hr=
ef=3D"mailto:mail@johankooijman.com" target=3D"_blank">mail@johankooijman.c=
om</a>></span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Meital,<div><br></div><div>=
I'm running the latest stable oVirt, 3.3.3 on Centos 6.5. For my nodes I us=
e the node iso CentOS 6 "oVirt Node - 3.0.1 - 1.0.2.el6".</div>
<div><br></div><div>I have no ways of reproducing just yet. I can confirm t=
hat it's happening on all nodes in the cluster. And every time a node goes =
offline, this error pops up.</div>
<div><br></div><div>Could the fact that lockd & statd were not running =
on the NFS host cause this error? Is there a workaround available that we k=
now of?</div>
</div><div class=3D"gmail_extra"><div><div class=3D"h5"><br><div><br></div>=
<div class=3D"gmail_quote">On Tue, Feb 18, 2014 at 12:57 PM, Meital Bourvin=
e <span dir=3D"ltr"><<a href=3D"mailto:mbourvin@redhat.com" target=3D"_b=
lank">mbourvin@redhat.com</a>></span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div><div style=3D"font-size:12pt;font-famil=
y:times new roman,new york,times,serif"><pre>Hi Johan,<br><div><br></div>Pl=
ease take a look at this error (from vdsm.log):<br>

<div><br></div>Thread-636938::DEBUG::2014-02-18 10:48:06,374::task::579::Ta=
skManager.Task::(_updateState) Task=3D`f4ce9a6e-0292-4071-9a24-a8d8fba7222b=
`::moving from state init -> state preparing
Thread-636938::INFO::2014-02-18 10:48:06,375::logUtils::44::dispatcher::(wr=
apper) Run and protect: getVolumeSize(sdUUID=3D'e9f70496-f181-4c9b-9ecb-d7f=
780772b04', spUUID=3D'59980e09-b329-4254-b66e-790abd69e194', imgUUID=3D'd50=
ecfbb-dc98-40cf-9b19-4bd402952aeb', volUUID=3D'68fefe24-0346-4d0d-b377-ddd7=
be7be29c', options=3DNone)
Thread-636938::ERROR::2014-02-18 10:48:06,376::task::850::TaskManager.Task:=
:(_setError) Task=3D`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Unexpected erro=
r
Thread-636938::DEBUG::2014-02-18 10:48:06,415::task::869::TaskManager.Task:=
:(_run) Task=3D`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Task._run: f4ce9a6e-=
0292-4071-9a24-a8d8fba7222b ('e9f70496-f181-4c9b-9ecb-d7f780772b04', '59980=
e09-b329-4254-b66e-790abd69e194', 'd50ecfbb-dc98-40cf-9b19-4bd402952aeb', '=
68fefe24-0346-4d0d-b377-ddd7be7be29c') {} failed - stopping task
Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::1194::TaskManager.Task=
::(stop) Task=3D`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::stopping in state p=
reparing (force False)
Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::974::TaskManager.Task:=
:(_decref) Task=3D`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::ref 1 aborting Tr=
ue
Thread-636938::INFO::2014-02-18 10:48:06,416::task::1151::TaskManager.Task:=
:(prepare) Task=3D`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::aborting: Task is=
 aborted: u'No free file handlers in pool' - code 100
Thread-636938::DEBUG::2014-02-18 10:48:06,417::task::1156::TaskManager.Task=
::(prepare) Task=3D`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Prepare: aborted=
: No free file handlers in pool<br></pre><div><br></div><pre><br></pre><pre=
...
And then you can see after a few seconds:<br></pre><pre>MainThread::INFO::=
2014-02-18 10:48:45,258::vdsm::101::vds::(run) (PID: 1450) I am the actual =
vdsm 4.12.1-2.el6 hv5.ovirt.gs.cloud.lan (2.6.32-358.18.1.el6.x86_64)<br>
<div><br></div><br>Meaning that vdsm was restarted.<br><div><br></div>Which=
 oVirt version are you using?<br>I see that there are a few old bugs that d=
escribes the same behaviour, but with different reproduction steps, for exa=
mple [1], [2].<br>

Can you think of any reproduction steps that might be causing this issue?<b=
r><div><br></div><br>[1] <a href=3D"https://bugzilla.redhat.com/show_bug.cg=
i?id=3D948210" target=3D"_blank">https://bugzilla.redhat.com/show_bug.cgi?i=
d=3D948210</a><br>

[2] <a href=3D"https://bugzilla.redhat.com/show_bug.cgi?id=3D853011" target=
=3D"_blank">https://bugzilla.redhat.com/show_bug.cgi?id=3D853011</a><br></p=
re><div><br></div><div><br></div><hr><blockquote style=3D"padding-left:5px;=
font-size:12pt;font-style:normal;margin-left:5px;font-family:Helvetica,Aria=
l,sans-serif;text-decoration:none;font-weight:normal;border-left:2px solid =
#1010ff">

<b>From: </b>"Johan Kooijman" <<a href=3D"mailto:mail@johankooijman.com"=
 target=3D"_blank">mail@johankooijman.com</a>><br><b>To: </b>"users" <=
;<a href=3D"mailto:users@ovirt.org" target=3D"_blank">users@ovirt.org</a>&g=
t;<br>

<b>Sent: </b>Tuesday, February 18, 2014 1:32:56 PM<br><b>Subject: </b>[User=
s] Nodes lose storage at random<div><div><br><div><br></div><div dir=3D"ltr=
">Hi All,<div><br></div><div>We're seeing some weird issues in our ovirt se=
tup. We have 4 nodes connected and an NFS (v3) filestore (FreeBSD/ZFS).</di=
v>

<div><br></div><div>Once in a while, it seems at random, a node loses their=
 connection to storage, recovers it a minute later. The other nodes usually=
 don't lose their storage at that moment. Just one, or two at a time. =
</div>

<div><br></div><div>We've setup extra tooling to verify the storage perform=
ance at those moments and the availability for other systems. It's always o=
nline, just the nodes don't think so. </div><div><br></div>

<div>The engine tells me this:</div><div><br></div><div><div>2014-02-18 11:=
48:03,598 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCo=
mmand] (pool-6-thread-48) domain d88764c8-ecc3-4f22-967e-2ce225ac4498:Expor=
t in problem. vds: hv5</div>

<div>2014-02-18 11:48:18,909 WARN  [org.ovirt.engine.core.vdsbroker.ir=
sbroker.IrsBrokerCommand] (pool-6-thread-48) domain e9f70496-f181-4c9b-9ecb=
-d7f780772b04:Data in problem. vds: hv5</div><div>2014-02-18 11:48:45,021 W=
ARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzSchedu=
ler_Worker-18) [46683672] Failed to refresh VDS , vds =3D 66e6aace-e51d-400=
6-bb2f-d85c2f1fd8d2 : hv5, VDS Network Error, continuing.</div>

<div>2014-02-18 11:48:45,070 INFO  [org.ovirt.engine.core.dal.dbbroker=
.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-41) [2ef=
1a894] Correlation ID: 2ef1a894, Call Stack: null, Custom Event ID: -1, Mes=
sage: Invalid status on Data Center GS. Setting Data Center status to Non R=
esponsive (On host hv5, Error: Network error during communication with the =
Host.).</div>

<div><br></div><div>The export and data domain live over NFS. There's anoth=
er domain, ISO, that lives on the engine machine, also shared over NFS. Tha=
t domain doesn't have any issue at all.  </div><div><br></div>

<div>
Attached are the logfiles for the relevant time period for both the engine =
server and the node. The node by the way, is a deployment of the node ISO, =
not a full blown installation.</div><div><br></div><div>Any clues on where =
to begin searching? The NFS server shows no issues nor anything in the logs=
. I did notice that the statd and lockd daemons were not running, but I won=
der if that can have anything to do with the issue.</div>

<div><br></div>-- <br>Met vriendelijke groeten / With kind regards,<br>Joha=
n Kooijman<br><div><br></div><a href=3D"mailto:mail@johankooijman.com" targ=
et=3D"_blank">mail@johankooijman.com</a>
</div></div>
<br></div></div>_______________________________________________<br>Users ma=
iling list<br><a href=3D"mailto:Users@ovirt.org" target=3D"_blank">Users@ov=
irt.org</a><br><a href=3D"http://lists.ovirt.org/mailman/listinfo/users" ta=
rget=3D"_blank">http://lists.ovirt.org/mailman/listinfo/users</a><br>

</blockquote><div><br></div></div></div></blockquote></div><br><br clear=3D=
"all"><div><br></div>-- <br>Met vriendelijke groeten / With kind regards,<b=
r>Johan Kooijman<br><div><br></div></div></div>T <a href=3D"tel:%2B31%280%2=
9%206%2043%2044%2045%2027" target=3D"_blank">+31(0) 6 43 44 45 27</a><br>
F <a href=3D"tel:%2B31%280%29%20162%2082%2000%2001" target=3D"_blank">+31(0=
) 162 82 00 01</a><br>
E <a href=3D"mailto:mail@johankooijman.com" target=3D"_blank">mail@johankoo=
ijman.com</a>
</div>
</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>Met vriendel=
ijke groeten / With kind regards,<br>Johan Kooijman<br><div><br></div>T +31=
(0) 6 43 44 45 27<br>F +31(0) 162 82 00 01<br>E <a href=3D"mailto:mail@joha=
nkooijman.com" target=3D"_blank">mail@johankooijman.com</a>
</div>
</blockquote><div><br></div></div></body></html>
------=_Part_5070193_841446361.1392731812625--