Re: [Users] Nodes lose storage at random

Tuesday, 18 February 2014

------=_Part_5007416_497450989.1392724657956
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

Hi Johan, 

Please take a look at this error (from vdsm.log): 

Thread-636938::DEBUG::2014-02-18 10:48:06,374::task::579::TaskManager.Task::(_updateState)
Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::moving from state init -> state preparing
Thread-636938::INFO::2014-02-18 10:48:06,375::logUtils::44::dispatcher::(wrapper) Run and
protect: getVolumeSize(sdUUID='e9f70496-f181-4c9b-9ecb-d7f780772b04',
spUUID='59980e09-b329-4254-b66e-790abd69e194',
imgUUID='d50ecfbb-dc98-40cf-9b19-4bd402952aeb',
volUUID='68fefe24-0346-4d0d-b377-ddd7be7be29c', options=None)
Thread-636938::ERROR::2014-02-18 10:48:06,376::task::850::TaskManager.Task::(_setError)
Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Unexpected error
Thread-636938::DEBUG::2014-02-18 10:48:06,415::task::869::TaskManager.Task::(_run)
Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Task._run:
f4ce9a6e-0292-4071-9a24-a8d8fba7222b ('e9f70496-f181-4c9b-9ecb-d7f780772b04',
'59980e09-b329-4254-b66e-790abd69e194',
'd50ecfbb-dc98-40cf-9b19-4bd402952aeb',
'68fefe24-0346-4d0d-b377-ddd7be7be29c') {} failed - stopping task
Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::1194::TaskManager.Task::(stop)
Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::stopping in state preparing (force False)
Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::974::TaskManager.Task::(_decref)
Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::ref 1 aborting True
Thread-636938::INFO::2014-02-18 10:48:06,416::task::1151::TaskManager.Task::(prepare)
Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::aborting: Task is aborted: u'No free file
handlers in pool' - code 100
Thread-636938::DEBUG::2014-02-18 10:48:06,417::task::1156::TaskManager.Task::(prepare)
Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Prepare: aborted: No free file handlers in
pool 

And then you can see after a few seconds: 
MainThread::INFO::2014-02-18 10:48:45,258::vdsm::101::vds::(run) (PID: 1450) I am the
actual vdsm 4.12.1-2.el6 hv5.ovirt.gs.cloud.lan (2.6.32-358.18.1.el6.x86_64) 

Meaning that vdsm was restarted. 

Which oVirt version are you using? 
I see that there are a few old bugs that describes the same behaviour, but with different
reproduction steps, for example [1], [2]. 
Can you think of any reproduction steps that might be causing this issue? 

[1] https://bugzilla.redhat.com/show_bug.cgi?id=948210 
[2] https://bugzilla.redhat.com/show_bug.cgi?id=853011 

----- Original Message -----

...
 From: "Johan Kooijman" <mail(a)johankooijman.com&gt;
 To: "users" <users(a)ovirt.org&gt;
 Sent: Tuesday, February 18, 2014 1:32:56 PM
 Subject: [Users] Nodes lose storage at random 
...
 Hi All, 
...
 We're seeing some weird issues in our ovirt setup. We have 4
nodes connected
 and an NFS (v3) filestore (FreeBSD/ZFS). 
...
 Once in a while, it seems at random, a node loses their connection
to
 storage, recovers it a minute later. The other nodes usually don't lose
 their storage at that moment. Just one, or two at a time. 
...
 We've setup extra tooling to verify the storage performance at
those moments
 and the availability for other systems. It's always online, just the nodes
 don't think so. 
...
 The engine tells me this: 
...
 2014-02-18 11:48:03,598 WARN
 [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
 (pool-6-thread-48) domain d88764c8-ecc3-4f22-967e-2ce225ac4498:Export in
 problem. vds: hv5
 2014-02-18 11:48:18,909 WARN
 [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
 (pool-6-thread-48) domain e9f70496-f181-4c9b-9ecb-d7f780772b04:Data in
 problem. vds: hv5
 2014-02-18 11:48:45,021 WARN [org.ovirt.engine.core.vdsbroker.VdsManager]
 (DefaultQuartzScheduler_Worker-18) [46683672] Failed to refresh VDS , vds =
 66e6aace-e51d-4006-bb2f-d85c2f1fd8d2 : hv5, VDS Network Error, continuing.
 2014-02-18 11:48:45,070 INFO
 [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
 (DefaultQuartzScheduler_Worker-41) [2ef1a894] Correlation ID: 2ef1a894, Call
 Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center GS.
 Setting Data Center status to Non Responsive (On host hv5, Error: Network
 error during communication with the Host.). 
...
 The export and data domain live over NFS. There's another domain,
ISO, that
 lives on the engine machine, also shared over NFS. That domain doesn't have
 any issue at all. 
...
 Attached are the logfiles for the relevant time period for both the
engine
 server and the node. The node by the way, is a deployment of the node ISO,
 not a full blown installation. 
...
 Any clues on where to begin searching? The NFS server shows no issues
nor
 anything in the logs. I did notice that the statd and lockd daemons were not
 running, but I wonder if that can have anything to do with the issue. 
...
 --
 Met vriendelijke groeten / With kind regards,
 Johan Kooijman 
...
 mail(a)johankooijman.com 
...
 _______________________________________________
 Users mailing list
 Users(a)ovirt.org
 http://lists.ovirt.org/mailman/listinfo/users 
------=_Part_5007416_497450989.1392724657956
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: quoted-printable

<html><body><div style=3D"font-family: times new roman, new york,
times, se=
rif; font-size: 12pt; color: #000000"><pre>Hi
Johan,<br><div><br></div>Plea=
se take a look at this error (from
vdsm.log):<br><div><br></div>Thread-6369=
38::DEBUG::2014-02-18 10:48:06,374::task::579::TaskManager.Task::(_updateSt=
ate) Task=3D`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::moving from state init =
-&gt; state preparing
Thread-636938::INFO::2014-02-18 10:48:06,375::logUtils::44::dispatcher::(wr=
apper) Run and protect: getVolumeSize(sdUUID=3D'e9f70496-f181-4c9b-9ecb-d7f=
780772b04', spUUID=3D'59980e09-b329-4254-b66e-790abd69e194',
imgUUID=3D'd50=
ecfbb-dc98-40cf-9b19-4bd402952aeb', volUUID=3D'68fefe24-0346-4d0d-b377-ddd7=
be7be29c', options=3DNone)
Thread-636938::ERROR::2014-02-18 10:48:06,376::task::850::TaskManager.Task:=
:(_setError) Task=3D`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Unexpected erro=
r
Thread-636938::DEBUG::2014-02-18 10:48:06,415::task::869::TaskManager.Task:=
:(_run) Task=3D`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Task._run: f4ce9a6e-=
0292-4071-9a24-a8d8fba7222b ('e9f70496-f181-4c9b-9ecb-d7f780772b04', '59980=
e09-b329-4254-b66e-790abd69e194', 'd50ecfbb-dc98-40cf-9b19-4bd402952aeb',
'=
68fefe24-0346-4d0d-b377-ddd7be7be29c') {} failed - stopping task
Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::1194::TaskManager.Task=
::(stop) Task=3D`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::stopping in state p=
reparing (force False)
Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::974::TaskManager.Task:=
:(_decref) Task=3D`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::ref 1 aborting Tr=
ue
Thread-636938::INFO::2014-02-18 10:48:06,416::task::1151::TaskManager.Task:=
:(prepare) Task=3D`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::aborting: Task is=
 aborted: u'No free file handlers in pool' - code 100
Thread-636938::DEBUG::2014-02-18 10:48:06,417::task::1156::TaskManager.Task=
::(prepare) Task=3D`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Prepare: aborted=
: No free file handlers in
pool<br></pre><div><br></div><pre><br></pre><pre=
...
And then you can see after a few
seconds:<br></pre><pre>MainThread::INFO::= 2014-02-18
10:48:45,258::vdsm::101::vds::(run) (PID: 1450) I am the actual =
vdsm 4.12.1-2.el6 hv5.ovirt.gs.cloud.lan (2.6.32-358.18.1.el6.x86_64)<br><d=
iv><br></div><br>Meaning that vdsm was
restarted.<br><div><br></div>Which o=
Virt version are you using?<br>I see that there are a few old bugs that des=
cribes the same behaviour, but with different reproduction steps, for examp=
le [1], [2].<br>Can you think of any reproduction steps that might be causi=
ng this issue?<br><div><br></div><br>[1]
https://bugzilla.redhat.com/show_b=
ug.cgi?id=3D948210<br>[2] https://bugzilla.redhat.com/show_bug.cgi?id=3D853=
011<br></pre><div><br></div><div><br></div><hr
id=3D"zwchr"><blockquote sty=
le=3D"border-left:2px solid #1010FF;margin-left:5px;padding-left:5px;color:=
#000;font-weight:normal;font-style:normal;text-decoration:none;font-family:=
Helvetica,Arial,sans-serif;font-size:12pt;"><b>From: </b>"Johan
Kooijman" &=
lt;mail(a)johankooijman.com&amp;gt;&lt;br&gt;&lt;b&gt;To: </b>"users"
&amp;lt;users(a)ovirt.org&amp;gt;=
<br><b>Sent: </b>Tuesday, February 18, 2014 1:32:56
PM<br><b>Subject: </b>[=
Users] Nodes lose storage at random<br><div><br></div><div
dir=3D"ltr">Hi A=
ll,<div><br></div><div>We're seeing some weird issues in our
ovirt setup. W=
e have 4 nodes connected and an NFS (v3) filestore (FreeBSD/ZFS).</div><div=
...
<br></div><div>Once in a while, it seems at random,
a node loses their con= nection to storage, recovers it a minute later. The other
nodes usually don=
't lose their storage at that moment. Just one, or two at a time.&nbsp;</di=
v>
<div><br></div><div>We've setup extra tooling to verify the
storage perform=
ance at those moments and the availability for other systems. It's always o=
nline, just the nodes don't think
so.&nbsp;</div><div><br></div>
<div>The engine tells me
this:</div><div><br></div><div><div>2014-02-18 11:=
48:03,598 WARN &nbsp;[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCo=
mmand] (pool-6-thread-48) domain d88764c8-ecc3-4f22-967e-2ce225ac4498:Expor=
t in problem. vds: hv5</div>
<div>2014-02-18 11:48:18,909 WARN &nbsp;[org.ovirt.engine.core.vdsbroker.ir=
sbroker.IrsBrokerCommand] (pool-6-thread-48) domain e9f70496-f181-4c9b-9ecb=
-d7f780772b04:Data in problem. vds: hv5</div><div>2014-02-18 11:48:45,021 W=
ARN &nbsp;[org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzSchedu=
ler_Worker-18) [46683672] Failed to refresh VDS , vds =3D 66e6aace-e51d-400=
6-bb2f-d85c2f1fd8d2 : hv5, VDS Network Error, continuing.</div>
<div>2014-02-18 11:48:45,070 INFO &nbsp;[org.ovirt.engine.core.dal.dbbroker=
.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-41) [2ef=
1a894] Correlation ID: 2ef1a894, Call Stack: null, Custom Event ID: -1, Mes=
sage: Invalid status on Data Center GS. Setting Data Center status to Non R=
esponsive (On host hv5, Error: Network error during communication with the =
Host.).</div>
<div><br></div><div>The export and data domain live over NFS.
There's anoth=
er domain, ISO, that lives on the engine machine, also shared over NFS. Tha=
t domain doesn't have any issue at all.
&nbsp;</div><div><br></div><div>
Attached are the logfiles for the relevant time period for both the engine =
server and the node. The node by the way, is a deployment of the node ISO, =
not a full blown installation.</div><div><br></div><div>Any
clues on where =
to begin searching? The NFS server shows no issues nor anything in the logs=
. I did notice that the statd and lockd daemons were not running, but I won=
der if that can have anything to do with the issue.</div>
<div><br></div>-- <br>Met vriendelijke groeten / With kind
regards,<br>Joha=
n Kooijman<br><div><br></div><a
href=3D"mailto:mail@johankooijman.com" targ=
et=3D&quot;_blank&quot;&gt;mail(a)johankooijman.com&lt;/a&gt;
</div></div>
<br>_______________________________________________<br>Users mailing
list<b=
r>Users@ovirt.org<br>http://lists.ovirt.org/mailman/listinfo/users<br></blo=
ckquote><div><br></div></div></body></html>
------=_Part_5007416_497450989.1392724657956--

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Re: [Users] Nodes lose storage at random