<html><body><div style="font-family: times new roman, new york, times, serif; font-size: 12pt; color: #000000"><pre>Hi Johan,<br><div><br></div>Please take a look at this error (from vdsm.log):<br><div><br></div>Thread-636938::DEBUG::2014-02-18 10:48:06,374::task::579::TaskManager.Task::(_updateState) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::moving from state init -&gt; state preparing

Thread-636938::INFO::2014-02-18 10:48:06,375::logUtils::44::dispatcher::(wrapper) Run and protect: getVolumeSize(sdUUID='e9f70496-f181-4c9b-9ecb-d7f780772b04', spUUID='59980e09-b329-4254-b66e-790abd69e194', imgUUID='d50ecfbb-dc98-40cf-9b19-4bd402952aeb', volUUID='68fefe24-0346-4d0d-b377-ddd7be7be29c', options=None)

Thread-636938::ERROR::2014-02-18 10:48:06,376::task::850::TaskManager.Task::(_setError) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Unexpected error

Thread-636938::DEBUG::2014-02-18 10:48:06,415::task::869::TaskManager.Task::(_run) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Task._run: f4ce9a6e-0292-4071-9a24-a8d8fba7222b ('e9f70496-f181-4c9b-9ecb-d7f780772b04', '59980e09-b329-4254-b66e-790abd69e194', 'd50ecfbb-dc98-40cf-9b19-4bd402952aeb', '68fefe24-0346-4d0d-b377-ddd7be7be29c') {} failed - stopping task

Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::1194::TaskManager.Task::(stop) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::stopping in state preparing (force False)

Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::974::TaskManager.Task::(_decref) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::ref 1 aborting True

Thread-636938::INFO::2014-02-18 10:48:06,416::task::1151::TaskManager.Task::(prepare) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::aborting: Task is aborted: u'No free file handlers in pool' - code 100

Thread-636938::DEBUG::2014-02-18 10:48:06,417::task::1156::TaskManager.Task::(prepare) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Prepare: aborted: No free file handlers in pool<br></pre><div><br></div><pre><br></pre><pre>And then you can see after a few seconds:<br></pre><pre>MainThread::INFO::2014-02-18 10:48:45,258::vdsm::101::vds::(run) (PID: 1450) I am the actual vdsm 4.12.1-2.el6 hv5.ovirt.gs.cloud.lan (2.6.32-358.18.1.el6.x86_64)<br><div><br></div><br>Meaning that vdsm was restarted.<br><div><br></div>Which oVirt version are you using?<br>I see that there are a few old bugs that describes the same behaviour, but with different reproduction steps, for example [1], [2].<br>Can you think of any reproduction steps that might be causing this issue?<br><div><br></div><br>[1] https://bugzilla.redhat.com/show_bug.cgi?id=948210<br>[2] https://bugzilla.redhat.com/show_bug.cgi?id=853011<br></pre><div><br></div><div><br></div><hr id="zwchr"><blockquote style="border-left:2px solid #1010FF;margin-left:5px;padding-left:5px;color:#000;font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt;"><b>From: </b>"Johan Kooijman" &lt;mail@johankooijman.com&gt;<br><b>To: </b>"users" &lt;users@ovirt.org&gt;<br><b>Sent: </b>Tuesday, February 18, 2014 1:32:56 PM<br><b>Subject: </b>[Users] Nodes lose storage at random<br><div><br></div><div dir="ltr">Hi All,<div><br></div><div>We're seeing some weird issues in our ovirt setup. We have 4 nodes connected and an NFS (v3) filestore (FreeBSD/ZFS).</div><div><br></div><div>Once in a while, it seems at random, a node loses their connection to storage, recovers it a minute later. The other nodes usually don't lose their storage at that moment. Just one, or two at a time.&nbsp;</div>

<div><br></div><div>We've setup extra tooling to verify the storage performance at those moments and the availability for other systems. It's always online, just the nodes don't think so.&nbsp;</div><div><br></div>

<div>The engine tells me this:</div><div><br></div><div><div>2014-02-18 11:48:03,598 WARN &nbsp;[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-48) domain d88764c8-ecc3-4f22-967e-2ce225ac4498:Export in problem. vds: hv5</div>

<div>2014-02-18 11:48:18,909 WARN &nbsp;[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-48) domain e9f70496-f181-4c9b-9ecb-d7f780772b04:Data in problem. vds: hv5</div><div>2014-02-18 11:48:45,021 WARN &nbsp;[org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-18) [46683672] Failed to refresh VDS , vds = 66e6aace-e51d-4006-bb2f-d85c2f1fd8d2 : hv5, VDS Network Error, continuing.</div>

<div>2014-02-18 11:48:45,070 INFO &nbsp;[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-41) [2ef1a894] Correlation ID: 2ef1a894, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center GS. Setting Data Center status to Non Responsive (On host hv5, Error: Network error during communication with the Host.).</div>

<div><br></div><div>The export and data domain live over NFS. There's another domain, ISO, that lives on the engine machine, also shared over NFS. That domain doesn't have any issue at all. &nbsp;</div><div><br></div><div>

Attached are the logfiles for the relevant time period for both the engine server and the node. The node by the way, is a deployment of the node ISO, not a full blown installation.</div><div><br></div><div>Any clues on where to begin searching? The NFS server shows no issues nor anything in the logs. I did notice that the statd and lockd daemons were not running, but I wonder if that can have anything to do with the issue.</div>

<div><br></div>-- <br>Met vriendelijke groeten / With kind regards,<br>Johan Kooijman<br><div><br></div><a href="mailto:mail@johankooijman.com" target="_blank">mail@johankooijman.com</a>

</div></div>

<br>_______________________________________________<br>Users mailing list<br>Users@ovirt.org<br>http://lists.ovirt.org/mailman/listinfo/users<br></blockquote><div><br></div></div></body></html>