[Users] Nodes lose storage at random

Tue Feb 18 06:57:37 EST 2014

Hi Johan, 

Please take a look at this error (from vdsm.log): 

Thread-636938::DEBUG::2014-02-18 10:48:06,374::task::579::TaskManager.Task::(_updateState) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::moving from state init -> state preparing
Thread-636938::INFO::2014-02-18 10:48:06,375::logUtils::44::dispatcher::(wrapper) Run and protect: getVolumeSize(sdUUID='e9f70496-f181-4c9b-9ecb-d7f780772b04', spUUID='59980e09-b329-4254-b66e-790abd69e194', imgUUID='d50ecfbb-dc98-40cf-9b19-4bd402952aeb', volUUID='68fefe24-0346-4d0d-b377-ddd7be7be29c', options=None)
Thread-636938::ERROR::2014-02-18 10:48:06,376::task::850::TaskManager.Task::(_setError) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Unexpected error
Thread-636938::DEBUG::2014-02-18 10:48:06,415::task::869::TaskManager.Task::(_run) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Task._run: f4ce9a6e-0292-4071-9a24-a8d8fba7222b ('e9f70496-f181-4c9b-9ecb-d7f780772b04', '59980e09-b329-4254-b66e-790abd69e194', 'd50ecfbb-dc98-40cf-9b19-4bd402952aeb', '68fefe24-0346-4d0d-b377-ddd7be7be29c') {} failed - stopping task
Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::1194::TaskManager.Task::(stop) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::stopping in state preparing (force False)
Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::974::TaskManager.Task::(_decref) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::ref 1 aborting True
Thread-636938::INFO::2014-02-18 10:48:06,416::task::1151::TaskManager.Task::(prepare) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::aborting: Task is aborted: u'No free file handlers in pool' - code 100
Thread-636938::DEBUG::2014-02-18 10:48:06,417::task::1156::TaskManager.Task::(prepare) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Prepare: aborted: No free file handlers in pool 

And then you can see after a few seconds: 
MainThread::INFO::2014-02-18 10:48:45,258::vdsm::101::vds::(run) (PID: 1450) I am the actual vdsm 4.12.1-2.el6 hv5.ovirt.gs.cloud.lan (2.6.32-358.18.1.el6.x86_64) 

Meaning that vdsm was restarted. 

Which oVirt version are you using? 
I see that there are a few old bugs that describes the same behaviour, but with different reproduction steps, for example [1], [2]. 
Can you think of any reproduction steps that might be causing this issue? 

[1] https://bugzilla.redhat.com/show_bug.cgi?id=948210 
[2] https://bugzilla.redhat.com/show_bug.cgi?id=853011 

----- Original Message -----

> From: "Johan Kooijman" <mail at johankooijman.com>
> To: "users" <users at ovirt.org>
> Sent: Tuesday, February 18, 2014 1:32:56 PM
> Subject: [Users] Nodes lose storage at random

> Hi All,

> We're seeing some weird issues in our ovirt setup. We have 4 nodes connected
> and an NFS (v3) filestore (FreeBSD/ZFS).

> Once in a while, it seems at random, a node loses their connection to
> storage, recovers it a minute later. The other nodes usually don't lose
> their storage at that moment. Just one, or two at a time.

> We've setup extra tooling to verify the storage performance at those moments
> and the availability for other systems. It's always online, just the nodes
> don't think so.

> The engine tells me this:

> 2014-02-18 11:48:03,598 WARN
> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
> (pool-6-thread-48) domain d88764c8-ecc3-4f22-967e-2ce225ac4498:Export in
> problem. vds: hv5
> 2014-02-18 11:48:18,909 WARN
> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
> (pool-6-thread-48) domain e9f70496-f181-4c9b-9ecb-d7f780772b04:Data in
> problem. vds: hv5
> 2014-02-18 11:48:45,021 WARN [org.ovirt.engine.core.vdsbroker.VdsManager]
> (DefaultQuartzScheduler_Worker-18) [46683672] Failed to refresh VDS , vds =
> 66e6aace-e51d-4006-bb2f-d85c2f1fd8d2 : hv5, VDS Network Error, continuing.
> 2014-02-18 11:48:45,070 INFO
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (DefaultQuartzScheduler_Worker-41) [2ef1a894] Correlation ID: 2ef1a894, Call
> Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center GS.
> Setting Data Center status to Non Responsive (On host hv5, Error: Network
> error during communication with the Host.).

> The export and data domain live over NFS. There's another domain, ISO, that
> lives on the engine machine, also shared over NFS. That domain doesn't have
> any issue at all.

> Attached are the logfiles for the relevant time period for both the engine
> server and the node. The node by the way, is a deployment of the node ISO,
> not a full blown installation.

> Any clues on where to begin searching? The NFS server shows no issues nor
> anything in the logs. I did notice that the statd and lockd daemons were not
> running, but I wonder if that can have anything to do with the issue.

> --
> Met vriendelijke groeten / With kind regards,
> Johan Kooijman

> mail at johankooijman.com

> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20140218/29eb1a04/attachment-0001.html>