<div dir="ltr">Meital,<div><br></div><div>It&#39;s been 4 days since the last crash - but 5 minutes ago one of the nodes had the same issues. I&#39;ve been running the script on the SPM as you mentioned. It turns out that at the time the node went down, the SPM didn&#39;t have more remoteFileHandler processes than before or after the crash - 29. </div>

<div><br></div><div>I&#39;m not sure what to make of this piece of information.</div><div><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Tue, Feb 18, 2014 at 2:56 PM, Meital Bourvine <span dir="ltr">&lt;<a href="mailto:mbourvin@redhat.com" target="_blank">mbourvin@redhat.com</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div style="font-size:12pt;font-family:times new roman,new york,times,serif"><div>Hi Johan,<br></div><div><br></div>

<div>Can you please run something like this on the spm node?<br></div><div>while true; do echo `date; ps ax | grep -i remotefilehandler | wc -l` &gt;&gt; /tmp/handler_num.txt; sleep 1; done</div><div><br></div><div>When it&#39;ll happen again, please stop the script, and write here the maximum number and the time that it happened.<br>

</div><div><br></div><div>Also, please check if &quot;process_pool_max_slots_per_domain&quot; is defined in /etc/vdsm/vdsm.conf, and if so, what&#39;s the value? (if it&#39;s not defined there, the default is 10)<br></div>

<div><br></div><div>Thanks!<br></div><div><br></div><div><br></div><hr><blockquote style="padding-left:5px;font-size:12pt;font-style:normal;margin-left:5px;font-family:Helvetica,Arial,sans-serif;text-decoration:none;font-weight:normal;border-left:2px solid #1010ff">

<b>From: </b>&quot;Johan Kooijman&quot; &lt;<a href="mailto:mail@johankooijman.com" target="_blank">mail@johankooijman.com</a>&gt;<br><b>To: </b>&quot;Meital Bourvine&quot; &lt;<a href="mailto:mbourvin@redhat.com" target="_blank">mbourvin@redhat.com</a>&gt;<br>

<b>Cc: </b>&quot;users&quot; &lt;<a href="mailto:users@ovirt.org" target="_blank">users@ovirt.org</a>&gt;<br><b>Sent: </b>Tuesday, February 18, 2014 2:55:11 PM<br><b>Subject: </b>Re: [Users] Nodes lose storage at random<div>

<div class="h5"><br><div><br></div><div dir="ltr">To follow up on this: The setup has only ~80 VM&#39;s active right now. The 2 bugreports are not in scope for this setup, the issues occur at random, even when there&#39;s no activity (create/delete VM&#39;s) and there are only 4 directories in /<span style="white-space:pre-wrap">rhev/data-center/mnt/.</span><div>


<span style="white-space:pre-wrap"><br></span></div></div><div class="gmail_extra"><br><div><br></div><div class="gmail_quote">On Tue, Feb 18, 2014 at 1:51 PM, Johan Kooijman <span dir="ltr">&lt;<a href="mailto:mail@johankooijman.com" target="_blank">mail@johankooijman.com</a>&gt;</span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Meital,<div><br></div><div>I&#39;m running the latest stable oVirt, 3.3.3 on Centos 6.5. For my nodes I use the node iso CentOS 6 &quot;oVirt Node - 3.0.1 - 1.0.2.el6&quot;.</div>


<div><br></div><div>I have no ways of reproducing just yet. I can confirm that it&#39;s happening on all nodes in the cluster. And every time a node goes offline, this error pops up.</div>

<div><br></div><div>Could the fact that lockd &amp; statd were not running on the NFS host cause this error? Is there a workaround available that we know of?</div>

</div><div class="gmail_extra"><div><div><br><div><br></div><div class="gmail_quote">On Tue, Feb 18, 2014 at 12:57 PM, Meital Bourvine <span dir="ltr">&lt;<a href="mailto:mbourvin@redhat.com" target="_blank">mbourvin@redhat.com</a>&gt;</span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div style="font-size:12pt;font-family:times new roman,new york,times,serif"><pre>Hi Johan,<br><div><br></div>Please take a look at this error (from vdsm.log):<br>


<div><br></div>Thread-636938::DEBUG::2014-02-18 10:48:06,374::task::579::TaskManager.Task::(_updateState) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::moving from state init -&gt; state preparing

Thread-636938::INFO::2014-02-18 10:48:06,375::logUtils::44::dispatcher::(wrapper) Run and protect: getVolumeSize(sdUUID=&#39;e9f70496-f181-4c9b-9ecb-d7f780772b04&#39;, spUUID=&#39;59980e09-b329-4254-b66e-790abd69e194&#39;, imgUUID=&#39;d50ecfbb-dc98-40cf-9b19-4bd402952aeb&#39;, volUUID=&#39;68fefe24-0346-4d0d-b377-ddd7be7be29c&#39;, options=None)

Thread-636938::ERROR::2014-02-18 10:48:06,376::task::850::TaskManager.Task::(_setError) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Unexpected error

Thread-636938::DEBUG::2014-02-18 10:48:06,415::task::869::TaskManager.Task::(_run) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Task._run: f4ce9a6e-0292-4071-9a24-a8d8fba7222b (&#39;e9f70496-f181-4c9b-9ecb-d7f780772b04&#39;, &#39;59980e09-b329-4254-b66e-790abd69e194&#39;, &#39;d50ecfbb-dc98-40cf-9b19-4bd402952aeb&#39;, &#39;68fefe24-0346-4d0d-b377-ddd7be7be29c&#39;) {} failed - stopping task

Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::1194::TaskManager.Task::(stop) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::stopping in state preparing (force False)

Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::974::TaskManager.Task::(_decref) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::ref 1 aborting True

Thread-636938::INFO::2014-02-18 10:48:06,416::task::1151::TaskManager.Task::(prepare) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::aborting: Task is aborted: u&#39;No free file handlers in pool&#39; - code 100

Thread-636938::DEBUG::2014-02-18 10:48:06,417::task::1156::TaskManager.Task::(prepare) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Prepare: aborted: No free file handlers in pool<br></pre><div><br></div><pre><br></pre><pre>

And then you can see after a few seconds:<br></pre><pre>MainThread::INFO::2014-02-18 10:48:45,258::vdsm::101::vds::(run) (PID: 1450) I am the actual vdsm 4.12.1-2.el6 hv5.ovirt.gs.cloud.lan (2.6.32-358.18.1.el6.x86_64)<br>


<div><br></div><br>Meaning that vdsm was restarted.<br><div><br></div>Which oVirt version are you using?<br>I see that there are a few old bugs that describes the same behaviour, but with different reproduction steps, for example [1], [2].<br>


Can you think of any reproduction steps that might be causing this issue?<br><div><br></div><br>[1] <a href="https://bugzilla.redhat.com/show_bug.cgi?id=948210" target="_blank">https://bugzilla.redhat.com/show_bug.cgi?id=948210</a><br>


[2] <a href="https://bugzilla.redhat.com/show_bug.cgi?id=853011" target="_blank">https://bugzilla.redhat.com/show_bug.cgi?id=853011</a><br></pre><div><br></div><div><br></div><hr><blockquote style="padding-left:5px;font-size:12pt;font-style:normal;margin-left:5px;font-family:Helvetica,Arial,sans-serif;text-decoration:none;font-weight:normal;border-left:2px solid #1010ff">


<b>From: </b>&quot;Johan Kooijman&quot; &lt;<a href="mailto:mail@johankooijman.com" target="_blank">mail@johankooijman.com</a>&gt;<br><b>To: </b>&quot;users&quot; &lt;<a href="mailto:users@ovirt.org" target="_blank">users@ovirt.org</a>&gt;<br>


<b>Sent: </b>Tuesday, February 18, 2014 1:32:56 PM<br><b>Subject: </b>[Users] Nodes lose storage at random<div><div><br><div><br></div><div dir="ltr">Hi All,<div><br></div><div>We&#39;re seeing some weird issues in our ovirt setup. We have 4 nodes connected and an NFS (v3) filestore (FreeBSD/ZFS).</div>


<div><br></div><div>Once in a while, it seems at random, a node loses their connection to storage, recovers it a minute later. The other nodes usually don&#39;t lose their storage at that moment. Just one, or two at a time. </div>


<div><br></div><div>We&#39;ve setup extra tooling to verify the storage performance at those moments and the availability for other systems. It&#39;s always online, just the nodes don&#39;t think so. </div><div><br></div>


<div>The engine tells me this:</div><div><br></div><div><div>2014-02-18 11:48:03,598 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-48) domain d88764c8-ecc3-4f22-967e-2ce225ac4498:Export in problem. vds: hv5</div>


<div>2014-02-18 11:48:18,909 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-48) domain e9f70496-f181-4c9b-9ecb-d7f780772b04:Data in problem. vds: hv5</div><div>2014-02-18 11:48:45,021 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-18) [46683672] Failed to refresh VDS , vds = 66e6aace-e51d-4006-bb2f-d85c2f1fd8d2 : hv5, VDS Network Error, continuing.</div>


<div>2014-02-18 11:48:45,070 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-41) [2ef1a894] Correlation ID: 2ef1a894, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center GS. Setting Data Center status to Non Responsive (On host hv5, Error: Network error during communication with the Host.).</div>


<div><br></div><div>The export and data domain live over NFS. There&#39;s another domain, ISO, that lives on the engine machine, also shared over NFS. That domain doesn&#39;t have any issue at all.  </div><div><br></div>


<div>

Attached are the logfiles for the relevant time period for both the engine server and the node. The node by the way, is a deployment of the node ISO, not a full blown installation.</div><div><br></div><div>Any clues on where to begin searching? The NFS server shows no issues nor anything in the logs. I did notice that the statd and lockd daemons were not running, but I wonder if that can have anything to do with the issue.</div>


<div><br></div>-- <br>Met vriendelijke groeten / With kind regards,<br>Johan Kooijman<br><div><br></div><a href="mailto:mail@johankooijman.com" target="_blank">mail@johankooijman.com</a>

</div></div>

<br></div></div>_______________________________________________<br>Users mailing list<br><a href="mailto:Users@ovirt.org" target="_blank">Users@ovirt.org</a><br><a href="http://lists.ovirt.org/mailman/listinfo/users" target="_blank">http://lists.ovirt.org/mailman/listinfo/users</a><br>


</blockquote><div><br></div></div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br>Met vriendelijke groeten / With kind regards,<br>Johan Kooijman<br><div><br></div></div></div>T <a href="tel:%2B31%280%29%206%2043%2044%2045%2027" target="_blank">+31(0) 6 43 44 45 27</a><br>


F <a href="tel:%2B31%280%29%20162%2082%2000%2001" target="_blank">+31(0) 162 82 00 01</a><br>

E <a href="mailto:mail@johankooijman.com" target="_blank">mail@johankooijman.com</a>

</div>

</blockquote></div><br><br clear="all"><div><br></div>-- <br>Met vriendelijke groeten / With kind regards,<br>Johan Kooijman<br><div><br></div>T <a href="tel:%2B31%280%29%206%2043%2044%2045%2027" value="+31643444527" target="_blank">+31(0) 6 43 44 45 27</a><br>

F <a href="tel:%2B31%280%29%20162%2082%2000%2001" value="+31162820001" target="_blank">+31(0) 162 82 00 01</a><br>E <a href="mailto:mail@johankooijman.com" target="_blank">mail@johankooijman.com</a>

</div>

</div></div></blockquote><div><br></div></div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br>Met vriendelijke groeten / With kind regards,<br>Johan Kooijman<br><br>T +31(0) 6 43 44 45 27<br>F +31(0) 162 82 00 01<br>

E <a href="mailto:mail@johankooijman.com">mail@johankooijman.com</a>

</div>