Re: [Users] Nodes lose storage at random

18 Feb 2014

      Meital,

I'm running the latest stable oVirt, 3.3.3 on Centos 6.5. For my nodes I
use the node iso CentOS 6 "oVirt Node - 3.0.1 - 1.0.2.el6".

I have no ways of reproducing just yet. I can confirm that it's happening
on all nodes in the cluster. And every time a node goes offline, this error
pops up.

Could the fact that lockd & statd were not running on the NFS host cause
this error? Is there a workaround available that we know of?

On Tue, Feb 18, 2014 at 12:57 PM, Meital Bourvine <mbourvin@redhat.com>wrote:
...
Hi Johan,
Please take a look at this error (from vdsm.log):
Thread-636938::DEBUG::2014-02-18 10:48:06,374::task::579::TaskManager.Task::(_updateState) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::moving from state init -> state preparing
Thread-636938::INFO::2014-02-18 10:48:06,375::logUtils::44::dispatcher::(wrapper) Run and protect: getVolumeSize(sdUUID='e9f70496-f181-4c9b-9ecb-d7f780772b04', spUUID='59980e09-b329-4254-b66e-790abd69e194', imgUUID='d50ecfbb-dc98-40cf-9b19-4bd402952aeb', volUUID='68fefe24-0346-4d0d-b377-ddd7be7be29c', options=None)
Thread-636938::ERROR::2014-02-18 10:48:06,376::task::850::TaskManager.Task::(_setError) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Unexpected error
Thread-636938::DEBUG::2014-02-18 10:48:06,415::task::869::TaskManager.Task::(_run) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Task._run: f4ce9a6e-0292-4071-9a24-a8d8fba7222b ('e9f70496-f181-4c9b-9ecb-d7f780772b04', '59980e09-b329-4254-b66e-790abd69e194', 'd50ecfbb-dc98-40cf-9b19-4bd402952aeb', '68fefe24-0346-4d0d-b377-ddd7be7be29c') {} failed - stopping task
Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::1194::TaskManager.Task::(stop) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::stopping in state preparing (force False)
Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::974::TaskManager.Task::(_decref) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::ref 1 aborting True
Thread-636938::INFO::2014-02-18 10:48:06,416::task::1151::TaskManager.Task::(prepare) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::aborting: Task is aborted: u'No free file handlers in pool' - code 100
Thread-636938::DEBUG::2014-02-18 10:48:06,417::task::1156::TaskManager.Task::(prepare) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Prepare: aborted: No free file handlers in pool
And then you can see after a few seconds:
MainThread::INFO::2014-02-18 10:48:45,258::vdsm::101::vds::(run) (PID: 1450) I am the actual vdsm 4.12.1-2.el6 hv5.ovirt.gs.cloud.lan (2.6.32-358.18.1.el6.x86_64)
Meaning that vdsm was restarted.
Which oVirt version are you using?
I see that there are a few old bugs that describes the same behaviour, but with different reproduction steps, for example [1], [2].
Can you think of any reproduction steps that might be causing this issue?
[1] https://bugzilla.redhat.com/show_bug.cgi?id=948210
[2] https://bugzilla.redhat.com/show_bug.cgi?id=853011
------------------------------
*From: *"Johan Kooijman" <mail@johankooijman.com>
*To: *"users" <users@ovirt.org>
*Sent: *Tuesday, February 18, 2014 1:32:56 PM
*Subject: *[Users] Nodes lose storage at random
Hi All,
We're seeing some weird issues in our ovirt setup. We have 4 nodes
connected and an NFS (v3) filestore (FreeBSD/ZFS).
Once in a while, it seems at random, a node loses their connection to
storage, recovers it a minute later. The other nodes usually don't lose
their storage at that moment. Just one, or two at a time.
We've setup extra tooling to verify the storage performance at those
moments and the availability for other systems. It's always online, just
the nodes don't think so.
The engine tells me this:
2014-02-18 11:48:03,598 WARN
 [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
(pool-6-thread-48) domain d88764c8-ecc3-4f22-967e-2ce225ac4498:Export in
problem. vds: hv5
2014-02-18 11:48:18,909 WARN
 [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
(pool-6-thread-48) domain e9f70496-f181-4c9b-9ecb-d7f780772b04:Data in
problem. vds: hv5
2014-02-18 11:48:45,021 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager]
(DefaultQuartzScheduler_Worker-18) [46683672] Failed to refresh VDS , vds =
66e6aace-e51d-4006-bb2f-d85c2f1fd8d2 : hv5, VDS Network Error, continuing.
2014-02-18 11:48:45,070 INFO
 [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler_Worker-41) [2ef1a894] Correlation ID: 2ef1a894,
Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data
Center GS. Setting Data Center status to Non Responsive (On host hv5,
Error: Network error during communication with the Host.).
The export and data domain live over NFS. There's another domain, ISO,
that lives on the engine machine, also shared over NFS. That domain doesn't
have any issue at all.
Attached are the logfiles for the relevant time period for both the engine
server and the node. The node by the way, is a deployment of the node ISO,
not a full blown installation.
Any clues on where to begin searching? The NFS server shows no issues nor
anything in the logs. I did notice that the statd and lockd daemons were
not running, but I wonder if that can have anything to do with the issue.
--
Met vriendelijke groeten / With kind regards,
Johan Kooijman
mail@johankooijman.com
_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users
-- 
Met vriendelijke groeten / With kind regards,
Johan Kooijman

T +31(0) 6 43 44 45 27
F +31(0) 162 82 00 01
E mail@johankooijman.com