------=_Part_5070193_841446361.1392731812625
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Hi Johan,
Can you please run something like this on the spm node?
while true; do echo `date; ps ax | grep -i remotefilehandler | wc -l` >>
/tmp/handler_num.txt; sleep 1; done
When it'll happen again, please stop the script, and write here the maximum number and
the time that it happened.
Also, please check if "process_pool_max_slots_per_domain" is defined in
/etc/vdsm/vdsm.conf, and if so, what's the value? (if it's not defined there, the
default is 10)
Thanks!
----- Original Message -----
From: "Johan Kooijman" <mail(a)johankooijman.com>
To: "Meital Bourvine" <mbourvin(a)redhat.com>
Cc: "users" <users(a)ovirt.org>
Sent: Tuesday, February 18, 2014 2:55:11 PM
Subject: Re: [Users] Nodes lose storage at random
To follow up on this: The setup has only ~80 VM's active right
now. The 2
bugreports are not in scope for this setup, the issues occur at random, even
when there's no activity (create/delete VM's) and there are only 4
directories in / rhev/data-center/mnt/.
On Tue, Feb 18, 2014 at 1:51 PM, Johan Kooijman <
mail(a)johankooijman.com >
wrote:
> Meital,
> I'm running the latest stable oVirt, 3.3.3 on Centos 6.5.
For my nodes I
> use
> the node iso CentOS 6 "oVirt Node - 3.0.1 - 1.0.2.el6".
> I have no ways of reproducing just yet. I can confirm that
it's happening
> on
> all nodes in the cluster. And every time a node goes offline, this error
> pops up.
> Could the fact that lockd & statd were not running on the
NFS host cause
> this
> error? Is there a workaround available that we know of?
> On Tue, Feb 18, 2014 at 12:57 PM, Meital Bourvine <
mbourvin(a)redhat.com >
> wrote:
> > Hi Johan,
>
> > Please take a look at this error (from vdsm.log):
>
> > Thread-636938::DEBUG::2014-02-18
> > 10:48:06,374::task::579::TaskManager.Task::(_updateState)
> > Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::moving from state init ->
> > state
> > preparing
>
> > Thread-636938::INFO::2014-02-18
> > 10:48:06,375::logUtils::44::dispatcher::(wrapper) Run and protect:
> > getVolumeSize(sdUUID='e9f70496-f181-4c9b-9ecb-d7f780772b04',
> > spUUID='59980e09-b329-4254-b66e-790abd69e194',
> > imgUUID='d50ecfbb-dc98-40cf-9b19-4bd402952aeb',
> > volUUID='68fefe24-0346-4d0d-b377-ddd7be7be29c', options=None)
>
> > Thread-636938::ERROR::2014-02-18
> > 10:48:06,376::task::850::TaskManager.Task::(_setError)
> > Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Unexpected error
>
> > Thread-636938::DEBUG::2014-02-18
> > 10:48:06,415::task::869::TaskManager.Task::(_run)
> > Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Task._run:
> > f4ce9a6e-0292-4071-9a24-a8d8fba7222b
> > ('e9f70496-f181-4c9b-9ecb-d7f780772b04',
> > '59980e09-b329-4254-b66e-790abd69e194',
> > 'd50ecfbb-dc98-40cf-9b19-4bd402952aeb',
> > '68fefe24-0346-4d0d-b377-ddd7be7be29c') {} failed - stopping task
>
> > Thread-636938::DEBUG::2014-02-18
> > 10:48:06,416::task::1194::TaskManager.Task::(stop)
> > Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::stopping in state preparing
> > (force False)
>
> > Thread-636938::DEBUG::2014-02-18
> > 10:48:06,416::task::974::TaskManager.Task::(_decref)
> > Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::ref 1 aborting True
>
> > Thread-636938::INFO::2014-02-18
> > 10:48:06,416::task::1151::TaskManager.Task::(prepare)
> > Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::aborting: Task is aborted:
> > u'No
> > free file handlers in pool' - code 100
>
> > Thread-636938::DEBUG::2014-02-18
> > 10:48:06,417::task::1156::TaskManager.Task::(prepare)
> > Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Prepare: aborted: No free
> > file
> > handlers in pool
>
> > And then you can see after a few seconds:
>
> > MainThread::INFO::2014-02-18 10:48:45,258::vdsm::101::vds::(run) (PID:
> > 1450)
> > I am the actual vdsm 4.12.1-2.el6 hv5.ovirt.gs.cloud.lan
> > (2.6.32-358.18.1.el6.x86_64)
>
> > Meaning that vdsm was restarted.
>
> > Which oVirt version are you using?
>
> > I see that there are a few old bugs that describes the same behaviour,
> > but
> > with different reproduction steps, for example [1], [2].
>
> > Can you think of any reproduction steps that might be causing this issue?
>
> > > From: "Johan Kooijman" <
mail(a)johankooijman.com >
> >
>
> > > To: "users" < users(a)ovirt.org >
> >
>
> > > Sent: Tuesday, February 18, 2014 1:32:56 PM
> >
>
> > > Subject: [Users] Nodes lose storage at random
> >
>
> > > Hi All,
> >
>
> > > We're seeing some weird issues in our ovirt setup.
We have 4 nodes
> > > connected
> > > and an NFS (v3) filestore (FreeBSD/ZFS).
> >
>
> > > Once in a while, it seems at random, a node loses
their connection to
> > > storage, recovers it a minute later. The other nodes usually don't
lose
> > > their storage at that moment. Just one, or two at a time.
> >
>
> > > We've setup extra tooling to verify the storage
performance at those
> > > moments
> > > and the availability for other systems. It's always online, just the
> > > nodes
> > > don't think so.
> >
>
> > > The engine tells me this:
> >
>
> > > 2014-02-18 11:48:03,598 WARN
> > > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
> > > (pool-6-thread-48) domain d88764c8-ecc3-4f22-967e-2ce225ac4498:Export
> > > in
> > > problem. vds: hv5
> >
>
> > > 2014-02-18 11:48:18,909 WARN
> > > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
> > > (pool-6-thread-48) domain e9f70496-f181-4c9b-9ecb-d7f780772b04:Data in
> > > problem. vds: hv5
> >
>
> > > 2014-02-18 11:48:45,021 WARN
> > > [org.ovirt.engine.core.vdsbroker.VdsManager]
> > > (DefaultQuartzScheduler_Worker-18) [46683672] Failed to refresh VDS ,
> > > vds
> > > =
> > > 66e6aace-e51d-4006-bb2f-d85c2f1fd8d2 : hv5, VDS Network Error,
> > > continuing.
> >
>
> > > 2014-02-18 11:48:45,070 INFO
> > > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> > > (DefaultQuartzScheduler_Worker-41) [2ef1a894] Correlation ID: 2ef1a894,
> > > Call
> > > Stack: null, Custom Event ID: -1, Message: Invalid status on Data
> > > Center
> > > GS.
> > > Setting Data Center status to Non Responsive (On host hv5, Error:
> > > Network
> > > error during communication with the Host.).
> >
>
> > > The export and data domain live over NFS. There's
another domain, ISO,
> > > that
> > > lives on the engine machine, also shared over NFS. That domain
doesn't
> > > have
> > > any issue at all.
> >
>
> > > Attached are the logfiles for the relevant time period
for both the
> > > engine
> > > server and the node. The node by the way, is a deployment of the node
> > > ISO,
> > > not a full blown installation.
> >
>
> > > Any clues on where to begin searching? The NFS server
shows no issues
> > > nor
> > > anything in the logs. I did notice that the statd and lockd daemons
> > > were
> > > not
> > > running, but I wonder if that can have anything to do with the issue.
> >
>
> > > --
> >
>
> > > Met vriendelijke groeten / With kind regards,
> >
>
> > > Johan Kooijman
> >
>
> > > mail(a)johankooijman.com
> >
>
> > > _______________________________________________
> >
>
> > > Users mailing list
> >
>
> > > Users(a)ovirt.org
> >
>
> > >
http://lists.ovirt.org/mailman/listinfo/users
> >
>
> --
> Met vriendelijke groeten / With kind regards,
> Johan Kooijman
> T +31(0) 6 43 44 45 27
> F +31(0) 162 82 00 01
> E mail(a)johankooijman.com
--
Met vriendelijke groeten / With kind regards,
Johan Kooijman
T +31(0) 6 43 44 45 27
F +31(0) 162 82 00 01
E mail(a)johankooijman.com
------=_Part_5070193_841446361.1392731812625
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: quoted-printable
<html><body><div style=3D"font-family: times new roman, new york,
times, se=
rif; font-size: 12pt; color: #000000"><div>Hi
Johan,<br></div><div><br></di=
v><div>Can you please run something like this on the spm
node?<br></div><di=
v>while true; do echo `date; ps ax | grep -i remotefilehandler | wc -l` >=
;> /tmp/handler_num.txt; sleep 1;
done</div><div><br></div><div>When it'=
ll happen again, please stop the script, and write here the maximum number =
and the time that it
happened.<br></div><div><br></div><div>Also, please
ch=
eck if "process_pool_max_slots_per_domain" is defined in /etc/vdsm/vdsm.con=
f, and if so, what's the value? (if it's not defined there, the default is =
10)<br></div><div><br></div><div>Thanks!<br></div><div><br></div><div><br><=
/div><hr id=3D"zwchr"><blockquote style=3D"border-left:2px
solid #1010FF;ma=
rgin-left:5px;padding-left:5px;color:#000;font-weight:normal;font-style:nor=
mal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:1=
2pt;"><b>From: </b>"Johan Kooijman"
&lt;mail(a)johankooijman.com&gt;<br><b>To=
: </b>"Meital Bourvine"
&lt;mbourvin(a)redhat.com&gt;<br><b>Cc: </b>"users"
&=
lt;users(a)ovirt.org&gt;<br><b>Sent: </b>Tuesday, February 18, 2014
2:55:11 P=
M<br><b>Subject: </b>Re: [Users] Nodes lose storage at
random<br><div><br><=
/div><div dir=3D"ltr">To follow up on this: The setup has only ~80
VM's act=
ive right now. The 2 bugreports are not in scope for this setup, the issues=
occur at random, even when there's no activity (create/delete VM's) and th=
ere are only 4 directories in /<span style=3D"color:rgb(0,0,0);white-space:=
pre-wrap">rhev/data-center/mnt/.</span><div>
<span
style=3D"color:rgb(0,0,0);white-space:pre-wrap"><br></span></div></di=
v><div
class=3D"gmail_extra"><br><div><br></div><div
class=3D"gmail_quote">=
On Tue, Feb 18, 2014 at 1:51 PM, Johan Kooijman <span
dir=3D"ltr"><<a hr=
ef=3D"mailto:mail@johankooijman.com"
target=3D"_blank">mail(a)johankooijman.c=
om</a>></span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0
.8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div
dir=3D"ltr">Meital,<div><br></div><div>=
I'm running the latest stable oVirt, 3.3.3 on Centos 6.5. For my nodes I us=
e the node iso CentOS 6 "oVirt Node - 3.0.1 - 1.0.2.el6".</div>
<div><br></div><div>I have no ways of reproducing just yet. I can
confirm t=
hat it's happening on all nodes in the cluster. And every time a node goes =
offline, this error pops up.</div>
<div><br></div><div>Could the fact that lockd & statd were
not running =
on the NFS host cause this error? Is there a workaround available that we k=
now of?</div>
</div><div class=3D"gmail_extra"><div><div
class=3D"h5"><br><div><br></div>=
<div class=3D"gmail_quote">On Tue, Feb 18, 2014 at 12:57 PM, Meital
Bourvin=
e <span dir=3D"ltr"><<a
href=3D"mailto:mbourvin@redhat.com" target=3D"_b=
lank">mbourvin(a)redhat.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0
.8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div><div
style=3D"font-size:12pt;font-famil=
y:times new roman,new york,times,serif"><pre>Hi
Johan,<br><div><br></div>Pl=
ease take a look at this error (from vdsm.log):<br>
<div><br></div>Thread-636938::DEBUG::2014-02-18
10:48:06,374::task::579::Ta=
skManager.Task::(_updateState) Task=3D`f4ce9a6e-0292-4071-9a24-a8d8fba7222b=
`::moving from state init -> state preparing
Thread-636938::INFO::2014-02-18 10:48:06,375::logUtils::44::dispatcher::(wr=
apper) Run and protect: getVolumeSize(sdUUID=3D'e9f70496-f181-4c9b-9ecb-d7f=
780772b04', spUUID=3D'59980e09-b329-4254-b66e-790abd69e194',
imgUUID=3D'd50=
ecfbb-dc98-40cf-9b19-4bd402952aeb', volUUID=3D'68fefe24-0346-4d0d-b377-ddd7=
be7be29c', options=3DNone)
Thread-636938::ERROR::2014-02-18 10:48:06,376::task::850::TaskManager.Task:=
:(_setError) Task=3D`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Unexpected erro=
r
Thread-636938::DEBUG::2014-02-18 10:48:06,415::task::869::TaskManager.Task:=
:(_run) Task=3D`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Task._run: f4ce9a6e-=
0292-4071-9a24-a8d8fba7222b ('e9f70496-f181-4c9b-9ecb-d7f780772b04', '59980=
e09-b329-4254-b66e-790abd69e194', 'd50ecfbb-dc98-40cf-9b19-4bd402952aeb',
'=
68fefe24-0346-4d0d-b377-ddd7be7be29c') {} failed - stopping task
Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::1194::TaskManager.Task=
::(stop) Task=3D`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::stopping in state p=
reparing (force False)
Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::974::TaskManager.Task:=
:(_decref) Task=3D`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::ref 1 aborting Tr=
ue
Thread-636938::INFO::2014-02-18 10:48:06,416::task::1151::TaskManager.Task:=
:(prepare) Task=3D`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::aborting: Task is=
aborted: u'No free file handlers in pool' - code 100
Thread-636938::DEBUG::2014-02-18 10:48:06,417::task::1156::TaskManager.Task=
::(prepare) Task=3D`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Prepare: aborted=
: No free file handlers in
pool<br></pre><div><br></div><pre><br></pre><pre=
And then you can see after a few
seconds:<br></pre><pre>MainThread::INFO::=
2014-02-18
10:48:45,258::vdsm::101::vds::(run) (PID: 1450) I am the actual =
vdsm 4.12.1-2.el6 hv5.ovirt.gs.cloud.lan (2.6.32-358.18.1.el6.x86_64)<br>
<div><br></div><br>Meaning that vdsm was
restarted.<br><div><br></div>Which=
oVirt version are you using?<br>I see that there are a few old bugs that d=
escribes the same behaviour, but with different reproduction steps, for exa=
mple [1], [2].<br>
Can you think of any reproduction steps that might be causing this issue?<b=
r><div><br></div><br>[1] <a
href=3D"https://bugzilla.redhat.com/show_bug.cg=
i?id=3D948210"
target=3D"_blank">https://bugzilla.redhat.com/show_bug.cgi?i=
d=3D948210</a><br>
[2] <a
href=3D"https://bugzilla.redhat.com/show_bug.cgi?id=3D853011" target=
=3D"_blank">https://bugzilla.redhat.com/show_bug.cgi?id=3D853011</a><br></p=
re><div><br></div><div><br></div><hr><blockquote
style=3D"padding-left:5px;=
font-size:12pt;font-style:normal;margin-left:5px;font-family:Helvetica,Aria=
l,sans-serif;text-decoration:none;font-weight:normal;border-left:2px solid =
#1010ff">
<b>From: </b>"Johan Kooijman" <<a
href=3D"mailto:mail@johankooijman.com"=
target=3D"_blank">mail(a)johankooijman.com</a>&gt;<br><b>To:
</b>"users" <=
;<a href=3D"mailto:users@ovirt.org"
target=3D"_blank">users(a)ovirt.org</a>&g=
t;<br>
<b>Sent: </b>Tuesday, February 18, 2014 1:32:56 PM<br><b>Subject:
</b>[User=
s] Nodes lose storage at
random<div><div><br><div><br></div><div
dir=3D"ltr=
">Hi All,<div><br></div><div>We're seeing some weird
issues in our ovirt se=
tup. We have 4 nodes connected and an NFS (v3) filestore (FreeBSD/ZFS).</di=
v>
<div><br></div><div>Once in a while, it seems at random, a node
loses their=
connection to storage, recovers it a minute later. The other nodes usually=
don't lose their storage at that moment. Just one, or two at a time. =
</div>
<div><br></div><div>We've setup extra tooling to verify the
storage perform=
ance at those moments and the availability for other systems. It's always o=
nline, just the nodes don't think
so. </div><div><br></div>
<div>The engine tells me
this:</div><div><br></div><div><div>2014-02-18 11:=
48:03,598 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCo=
mmand] (pool-6-thread-48) domain d88764c8-ecc3-4f22-967e-2ce225ac4498:Expor=
t in problem. vds: hv5</div>
<div>2014-02-18 11:48:18,909 WARN [org.ovirt.engine.core.vdsbroker.ir=
sbroker.IrsBrokerCommand] (pool-6-thread-48) domain e9f70496-f181-4c9b-9ecb=
-d7f780772b04:Data in problem. vds: hv5</div><div>2014-02-18 11:48:45,021 W=
ARN [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzSchedu=
ler_Worker-18) [46683672] Failed to refresh VDS , vds =3D 66e6aace-e51d-400=
6-bb2f-d85c2f1fd8d2 : hv5, VDS Network Error, continuing.</div>
<div>2014-02-18 11:48:45,070 INFO [org.ovirt.engine.core.dal.dbbroker=
.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-41) [2ef=
1a894] Correlation ID: 2ef1a894, Call Stack: null, Custom Event ID: -1, Mes=
sage: Invalid status on Data Center GS. Setting Data Center status to Non R=
esponsive (On host hv5, Error: Network error during communication with the =
Host.).</div>
<div><br></div><div>The export and data domain live over NFS.
There's anoth=
er domain, ISO, that lives on the engine machine, also shared over NFS. Tha=
t domain doesn't have any issue at all.
</div><div><br></div>
<div>
Attached are the logfiles for the relevant time period for both the engine =
server and the node. The node by the way, is a deployment of the node ISO, =
not a full blown installation.</div><div><br></div><div>Any
clues on where =
to begin searching? The NFS server shows no issues nor anything in the logs=
. I did notice that the statd and lockd daemons were not running, but I won=
der if that can have anything to do with the issue.</div>
<div><br></div>-- <br>Met vriendelijke groeten / With kind
regards,<br>Joha=
n Kooijman<br><div><br></div><a
href=3D"mailto:mail@johankooijman.com" targ=
et=3D"_blank">mail(a)johankooijman.com</a>
</div></div>
<br></div></div>_______________________________________________<br>Users
ma=
iling list<br><a href=3D"mailto:Users@ovirt.org"
target=3D"_blank">Users@ov=
irt.org</a><br><a
href=3D"http://lists.ovirt.org/mailman/listinfo/users" ta=
rget=3D"_blank">http://lists.ovirt.org/mailman/listinfo/user...
</blockquote><div><br></div></div></div></blockquote></div><br><br
clear=3D=
"all"><div><br></div>-- <br>Met vriendelijke groeten
/ With kind regards,<b=
r>Johan Kooijman<br><div><br></div></div></div>T
<a href=3D"tel:%2B31%280%2=
9%206%2043%2044%2045%2027" target=3D"_blank">+31(0) 6 43 44 45
27</a><br>
F <a href=3D"tel:%2B31%280%29%20162%2082%2000%2001"
target=3D"_blank">+31(0=
) 162 82 00 01</a><br>
E <a href=3D"mailto:mail@johankooijman.com"
target=3D"_blank">mail@johankoo=
ijman.com</a>
</div>
</blockquote></div><br><br
clear=3D"all"><div><br></div>-- <br>Met vriendel=
ijke groeten / With kind regards,<br>Johan
Kooijman<br><div><br></div>T +31=
(0) 6 43 44 45 27<br>F +31(0) 162 82 00 01<br>E <a
href=3D"mailto:mail@joha=
nkooijman.com" target=3D"_blank">mail(a)johankooijman.com</a>
</div>
</blockquote><div><br></div></div></body></html>
------=_Part_5070193_841446361.1392731812625--