<html><body><div style="font-family: times new roman, new york, times, serif; font-size: 12pt; color: #000000"><div>Hi Johan,<br></div><div><br></div><div>Can you please run something like this on the spm node?<br></div><div>while true; do echo `date; ps ax | grep -i remotefilehandler | wc -l` >> /tmp/handler_num.txt; sleep 1; done</div><div><br></div><div>When it'll happen again, please stop the script, and write here the maximum number and the time that it happened.<br></div><div><br></div><div>Also, please check if "process_pool_max_slots_per_domain" is defined in /etc/vdsm/vdsm.conf, and if so, what's the value? (if it's not defined there, the default is 10)<br></div><div><br></div><div>Thanks!<br></div><div><br></div><div><br></div><hr id="zwchr"><blockquote style="border-left:2px solid #1010FF;margin-left:5px;padding-left:5px;color:#000;font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt;"><b>From: </b>"Johan Kooijman" <mail@johankooijman.com><br><b>To: </b>"Meital Bourvine" <mbourvin@redhat.com><br><b>Cc: </b>"users" <users@ovirt.org><br><b>Sent: </b>Tuesday, February 18, 2014 2:55:11 PM<br><b>Subject: </b>Re: [Users] Nodes lose storage at random<br><div><br></div><div dir="ltr">To follow up on this: The setup has only ~80 VM's active right now. The 2 bugreports are not in scope for this setup, the issues occur at random, even when there's no activity (create/delete VM's) and there are only 4 directories in /<span style="color:rgb(0,0,0);white-space:pre-wrap">rhev/data-center/mnt/.</span><div>
<span style="color:rgb(0,0,0);white-space:pre-wrap"><br></span></div></div><div class="gmail_extra"><br><div><br></div><div class="gmail_quote">On Tue, Feb 18, 2014 at 1:51 PM, Johan Kooijman <span dir="ltr"><<a href="mailto:mail@johankooijman.com" target="_blank">mail@johankooijman.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Meital,<div><br></div><div>I'm running the latest stable oVirt, 3.3.3 on Centos 6.5. For my nodes I use the node iso CentOS 6 "oVirt Node - 3.0.1 - 1.0.2.el6".</div>
<div><br></div><div>I have no ways of reproducing just yet. I can confirm that it's happening on all nodes in the cluster. And every time a node goes offline, this error pops up.</div>
<div><br></div><div>Could the fact that lockd & statd were not running on the NFS host cause this error? Is there a workaround available that we know of?</div>
</div><div class="gmail_extra"><div><div class="h5"><br><div><br></div><div class="gmail_quote">On Tue, Feb 18, 2014 at 12:57 PM, Meital Bourvine <span dir="ltr"><<a href="mailto:mbourvin@redhat.com" target="_blank">mbourvin@redhat.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div style="font-size:12pt;font-family:times new roman,new york,times,serif"><pre>Hi Johan,<br><div><br></div>Please take a look at this error (from vdsm.log):<br>
<div><br></div>Thread-636938::DEBUG::2014-02-18 10:48:06,374::task::579::TaskManager.Task::(_updateState) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::moving from state init -> state preparing
Thread-636938::INFO::2014-02-18 10:48:06,375::logUtils::44::dispatcher::(wrapper) Run and protect: getVolumeSize(sdUUID='e9f70496-f181-4c9b-9ecb-d7f780772b04', spUUID='59980e09-b329-4254-b66e-790abd69e194', imgUUID='d50ecfbb-dc98-40cf-9b19-4bd402952aeb', volUUID='68fefe24-0346-4d0d-b377-ddd7be7be29c', options=None)
Thread-636938::ERROR::2014-02-18 10:48:06,376::task::850::TaskManager.Task::(_setError) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Unexpected error
Thread-636938::DEBUG::2014-02-18 10:48:06,415::task::869::TaskManager.Task::(_run) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Task._run: f4ce9a6e-0292-4071-9a24-a8d8fba7222b ('e9f70496-f181-4c9b-9ecb-d7f780772b04', '59980e09-b329-4254-b66e-790abd69e194', 'd50ecfbb-dc98-40cf-9b19-4bd402952aeb', '68fefe24-0346-4d0d-b377-ddd7be7be29c') {} failed - stopping task
Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::1194::TaskManager.Task::(stop) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::stopping in state preparing (force False)
Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::974::TaskManager.Task::(_decref) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::ref 1 aborting True
Thread-636938::INFO::2014-02-18 10:48:06,416::task::1151::TaskManager.Task::(prepare) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::aborting: Task is aborted: u'No free file handlers in pool' - code 100
Thread-636938::DEBUG::2014-02-18 10:48:06,417::task::1156::TaskManager.Task::(prepare) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Prepare: aborted: No free file handlers in pool<br></pre><div><br></div><pre><br></pre><pre>And then you can see after a few seconds:<br></pre><pre>MainThread::INFO::2014-02-18 10:48:45,258::vdsm::101::vds::(run) (PID: 1450) I am the actual vdsm 4.12.1-2.el6 hv5.ovirt.gs.cloud.lan (2.6.32-358.18.1.el6.x86_64)<br>
<div><br></div><br>Meaning that vdsm was restarted.<br><div><br></div>Which oVirt version are you using?<br>I see that there are a few old bugs that describes the same behaviour, but with different reproduction steps, for example [1], [2].<br>
Can you think of any reproduction steps that might be causing this issue?<br><div><br></div><br>[1] <a href="https://bugzilla.redhat.com/show_bug.cgi?id=948210" target="_blank">https://bugzilla.redhat.com/show_bug.cgi?id=948210</a><br>
[2] <a href="https://bugzilla.redhat.com/show_bug.cgi?id=853011" target="_blank">https://bugzilla.redhat.com/show_bug.cgi?id=853011</a><br></pre><div><br></div><div><br></div><hr><blockquote style="padding-left:5px;font-size:12pt;font-style:normal;margin-left:5px;font-family:Helvetica,Arial,sans-serif;text-decoration:none;font-weight:normal;border-left:2px solid #1010ff">
<b>From: </b>"Johan Kooijman" <<a href="mailto:mail@johankooijman.com" target="_blank">mail@johankooijman.com</a>><br><b>To: </b>"users" <<a href="mailto:users@ovirt.org" target="_blank">users@ovirt.org</a>><br>
<b>Sent: </b>Tuesday, February 18, 2014 1:32:56 PM<br><b>Subject: </b>[Users] Nodes lose storage at random<div><div><br><div><br></div><div dir="ltr">Hi All,<div><br></div><div>We're seeing some weird issues in our ovirt setup. We have 4 nodes connected and an NFS (v3) filestore (FreeBSD/ZFS).</div>
<div><br></div><div>Once in a while, it seems at random, a node loses their connection to storage, recovers it a minute later. The other nodes usually don't lose their storage at that moment. Just one, or two at a time. </div>
<div><br></div><div>We've setup extra tooling to verify the storage performance at those moments and the availability for other systems. It's always online, just the nodes don't think so. </div><div><br></div>
<div>The engine tells me this:</div><div><br></div><div><div>2014-02-18 11:48:03,598 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-48) domain d88764c8-ecc3-4f22-967e-2ce225ac4498:Export in problem. vds: hv5</div>
<div>2014-02-18 11:48:18,909 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-48) domain e9f70496-f181-4c9b-9ecb-d7f780772b04:Data in problem. vds: hv5</div><div>2014-02-18 11:48:45,021 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-18) [46683672] Failed to refresh VDS , vds = 66e6aace-e51d-4006-bb2f-d85c2f1fd8d2 : hv5, VDS Network Error, continuing.</div>
<div>2014-02-18 11:48:45,070 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-41) [2ef1a894] Correlation ID: 2ef1a894, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center GS. Setting Data Center status to Non Responsive (On host hv5, Error: Network error during communication with the Host.).</div>
<div><br></div><div>The export and data domain live over NFS. There's another domain, ISO, that lives on the engine machine, also shared over NFS. That domain doesn't have any issue at all. </div><div><br></div>
<div>
Attached are the logfiles for the relevant time period for both the engine server and the node. The node by the way, is a deployment of the node ISO, not a full blown installation.</div><div><br></div><div>Any clues on where to begin searching? The NFS server shows no issues nor anything in the logs. I did notice that the statd and lockd daemons were not running, but I wonder if that can have anything to do with the issue.</div>
<div><br></div>-- <br>Met vriendelijke groeten / With kind regards,<br>Johan Kooijman<br><div><br></div><a href="mailto:mail@johankooijman.com" target="_blank">mail@johankooijman.com</a>
</div></div>
<br></div></div>_______________________________________________<br>Users mailing list<br><a href="mailto:Users@ovirt.org" target="_blank">Users@ovirt.org</a><br><a href="http://lists.ovirt.org/mailman/listinfo/users" target="_blank">http://lists.ovirt.org/mailman/listinfo/users</a><br>
</blockquote><div><br></div></div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br>Met vriendelijke groeten / With kind regards,<br>Johan Kooijman<br><div><br></div></div></div>T <a href="tel:%2B31%280%29%206%2043%2044%2045%2027" target="_blank">+31(0) 6 43 44 45 27</a><br>
F <a href="tel:%2B31%280%29%20162%2082%2000%2001" target="_blank">+31(0) 162 82 00 01</a><br>
E <a href="mailto:mail@johankooijman.com" target="_blank">mail@johankooijman.com</a>
</div>
</blockquote></div><br><br clear="all"><div><br></div>-- <br>Met vriendelijke groeten / With kind regards,<br>Johan Kooijman<br><div><br></div>T +31(0) 6 43 44 45 27<br>F +31(0) 162 82 00 01<br>E <a href="mailto:mail@johankooijman.com" target="_blank">mail@johankooijman.com</a>
</div>
</blockquote><div><br></div></div></body></html>