[Users] Nodes lose storage at random

newer
[Users] Upgrade from 3.4.0-0.9 to...

Johan Kooijman

18 Feb 2014 18 Feb '14

11:32 a.m.

Hi All, We're seeing some weird issues in our ovirt setup. We have 4 nodes connected and an NFS (v3) filestore (FreeBSD/ZFS). Once in a while, it seems at random, a node loses their connection to storage, recovers it a minute later. The other nodes usually don't lose their storage at that moment. Just one, or two at a time. We've setup extra tooling to verify the storage performance at those moments and the availability for other systems. It's always online, just the nodes don't think so. The engine tells me this: 2014-02-18 11:48:03,598 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-48) domain d88764c8-ecc3-4f22-967e-2ce225ac4498:Export in problem. vds: hv5 2014-02-18 11:48:18,909 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-48) domain e9f70496-f181-4c9b-9ecb-d7f780772b04:Data in problem. vds: hv5 2014-02-18 11:48:45,021 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-18) [46683672] Failed to refresh VDS , vds = 66e6aace-e51d-4006-bb2f-d85c2f1fd8d2 : hv5, VDS Network Error, continuing. 2014-02-18 11:48:45,070 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-41) [2ef1a894] Correlation ID: 2ef1a894, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center GS. Setting Data Center status to Non Responsive (On host hv5, Error: Network error during communication with the Host.). The export and data domain live over NFS. There's another domain, ISO, that lives on the engine machine, also shared over NFS. That domain doesn't have any issue at all. Attached are the logfiles for the relevant time period for both the engine server and the node. The node by the way, is a deployment of the node ISO, not a full blown installation. Any clues on where to begin searching? The NFS server shows no issues nor anything in the logs. I did notice that the statd and lockd daemons were not running, but I wonder if that can have anything to do with the issue. -- Met vriendelijke groeten / With kind regards, Johan Kooijman mail@johankooijman.com

Attachments:

attachment.html (text/html — 2.5 KB)
engine.txt (text/plain — 4.9 KB)
hv5.txt (text/plain — 1.7 MB)

Show replies by date

Meital Bourvine

18 Feb 18 Feb

11:57 a.m.

------=_Part_5007416_497450989.1392724657956 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Hi Johan, Please take a look at this error (from vdsm.log): Thread-636938::DEBUG::2014-02-18 10:48:06,374::task::579::TaskManager.Task::(_updateState) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::moving from state init -> state preparing Thread-636938::INFO::2014-02-18 10:48:06,375::logUtils::44::dispatcher::(wrapper) Run and protect: getVolumeSize(sdUUID='e9f70496-f181-4c9b-9ecb-d7f780772b04', spUUID='59980e09-b329-4254-b66e-790abd69e194', imgUUID='d50ecfbb-dc98-40cf-9b19-4bd402952aeb', volUUID='68fefe24-0346-4d0d-b377-ddd7be7be29c', options=None) Thread-636938::ERROR::2014-02-18 10:48:06,376::task::850::TaskManager.Task::(_setError) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Unexpected error Thread-636938::DEBUG::2014-02-18 10:48:06,415::task::869::TaskManager.Task::(_run) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Task._run: f4ce9a6e-0292-4071-9a24-a8d8fba7222b ('e9f70496-f181-4c9b-9ecb-d7f780772b04', '59980e09-b329-4254-b66e-790abd69e194', 'd50ecfbb-dc98-40cf-9b19-4bd402952aeb', '68fefe24-0346-4d0d-b377-ddd7be7be29c') {} failed - stopping task Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::1194::TaskManager.Task::(stop) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::stopping in state preparing (force False) Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::974::TaskManager.Task::(_decref) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::ref 1 aborting True Thread-636938::INFO::2014-02-18 10:48:06,416::task::1151::TaskManager.Task::(prepare) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::aborting: Task is aborted: u'No free file handlers in pool' - code 100 Thread-636938::DEBUG::2014-02-18 10:48:06,417::task::1156::TaskManager.Task::(prepare) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Prepare: aborted: No free file handlers in pool And then you can see after a few seconds: MainThread::INFO::2014-02-18 10:48:45,258::vdsm::101::vds::(run) (PID: 1450) I am the actual vdsm 4.12.1-2.el6 hv5.ovirt.gs.cloud.lan (2.6.32-358.18.1.el6.x86_64) Meaning that vdsm was restarted. Which oVirt version are you using? I see that there are a few old bugs that describes the same behaviour, but with different reproduction steps, for example [1], [2]. Can you think of any reproduction steps that might be causing this issue? [1] https://bugzilla.redhat.com/show_bug.cgi?id=948210 [2] https://bugzilla.redhat.com/show_bug.cgi?id=853011 ----- Original Message -----

...

From: "Johan Kooijman" <mail@johankooijman.com> To: "users" <users@ovirt.org> Sent: Tuesday, February 18, 2014 1:32:56 PM Subject: [Users] Nodes lose storage at random

...

Hi All,

...

We're seeing some weird issues in our ovirt setup. We have 4 nodes connected and an NFS (v3) filestore (FreeBSD/ZFS).

...

Once in a while, it seems at random, a node loses their connection to storage, recovers it a minute later. The other nodes usually don't lose their storage at that moment. Just one, or two at a time.

...

We've setup extra tooling to verify the storage performance at those moments and the availability for other systems. It's always online, just the nodes don't think so.

...

The engine tells me this:

...

2014-02-18 11:48:03,598 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-48) domain d88764c8-ecc3-4f22-967e-2ce225ac4498:Export in problem. vds: hv5 2014-02-18 11:48:18,909 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-48) domain e9f70496-f181-4c9b-9ecb-d7f780772b04:Data in problem. vds: hv5 2014-02-18 11:48:45,021 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-18) [46683672] Failed to refresh VDS , vds = 66e6aace-e51d-4006-bb2f-d85c2f1fd8d2 : hv5, VDS Network Error, continuing. 2014-02-18 11:48:45,070 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-41) [2ef1a894] Correlation ID: 2ef1a894, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center GS. Setting Data Center status to Non Responsive (On host hv5, Error: Network error during communication with the Host.).

...

The export and data domain live over NFS. There's another domain, ISO, that lives on the engine machine, also shared over NFS. That domain doesn't have any issue at all.

...

Attached are the logfiles for the relevant time period for both the engine server and the node. The node by the way, is a deployment of the node ISO, not a full blown installation.

...

Any clues on where to begin searching? The NFS server shows no issues nor anything in the logs. I did notice that the statd and lockd daemons were not running, but I wonder if that can have anything to do with the issue.

...

-- Met vriendelijke groeten / With kind regards, Johan Kooijman

...

mail@johankooijman.com

...

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

------=_Part_5007416_497450989.1392724657956 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable <html><body><div style=3D"font-family: times new roman, new york, times, se= rif; font-size: 12pt; color: #000000"><pre>Hi Johan,<br><div><br></div>Plea= se take a look at this error (from vdsm.log):<br><div><br></div>Thread-6369= 38::DEBUG::2014-02-18 10:48:06,374::task::579::TaskManager.Task::(_updateSt= ate) Task=3D`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::moving from state init = -> state preparing Thread-636938::INFO::2014-02-18 10:48:06,375::logUtils::44::dispatcher::(wr= apper) Run and protect: getVolumeSize(sdUUID=3D'e9f70496-f181-4c9b-9ecb-d7f= 780772b04', spUUID=3D'59980e09-b329-4254-b66e-790abd69e194', imgUUID=3D'd50= ecfbb-dc98-40cf-9b19-4bd402952aeb', volUUID=3D'68fefe24-0346-4d0d-b377-ddd7= be7be29c', options=3DNone) Thread-636938::ERROR::2014-02-18 10:48:06,376::task::850::TaskManager.Task:= :(_setError) Task=3D`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Unexpected erro= r Thread-636938::DEBUG::2014-02-18 10:48:06,415::task::869::TaskManager.Task:= :(_run) Task=3D`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Task._run: f4ce9a6e-= 0292-4071-9a24-a8d8fba7222b ('e9f70496-f181-4c9b-9ecb-d7f780772b04', '59980= e09-b329-4254-b66e-790abd69e194', 'd50ecfbb-dc98-40cf-9b19-4bd402952aeb', '= 68fefe24-0346-4d0d-b377-ddd7be7be29c') {} failed - stopping task Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::1194::TaskManager.Task= ::(stop) Task=3D`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::stopping in state p= reparing (force False) Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::974::TaskManager.Task:= :(_decref) Task=3D`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::ref 1 aborting Tr= ue Thread-636938::INFO::2014-02-18 10:48:06,416::task::1151::TaskManager.Task:= :(prepare) Task=3D`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::aborting: Task is= aborted: u'No free file handlers in pool' - code 100 Thread-636938::DEBUG::2014-02-18 10:48:06,417::task::1156::TaskManager.Task= ::(prepare) Task=3D`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Prepare: aborted= : No free file handlers in pool<br></pre><div><br></div><pre><br></pre><pre=

...

And then you can see after a few seconds:<br></pre><pre>MainThread::INFO::= 2014-02-18 10:48:45,258::vdsm::101::vds::(run) (PID: 1450) I am the actual = vdsm 4.12.1-2.el6 hv5.ovirt.gs.cloud.lan (2.6.32-358.18.1.el6.x86_64)<br><d= iv><br></div><br>Meaning that vdsm was restarted.<br><div><br></div>Which o= Virt version are you using?<br>I see that there are a few old bugs that des= cribes the same behaviour, but with different reproduction steps, for examp= le [1], [2].<br>Can you think of any reproduction steps that might be causi= ng this issue?<br><div><br></div><br>[1] https://bugzilla.redhat.com/show_b= ug.cgi?id=3D948210<br>[2] https://bugzilla.redhat.com/show_bug.cgi?id=3D853= 011<br></pre><div><br></div><div><br></div><hr id=3D"zwchr"><blockquote sty= le=3D"border-left:2px solid #1010FF;margin-left:5px;padding-left:5px;color:= #000;font-weight:normal;font-style:normal;text-decoration:none;font-family:= Helvetica,Arial,sans-serif;font-size:12pt;"><b>From: </b>"Johan Kooijman" &= lt;mail@johankooijman.com><br><b>To: </b>"users" <users@ovirt.org>= <br><b>Sent: </b>Tuesday, February 18, 2014 1:32:56 PM<br><b>Subject: </b>[= Users] Nodes lose storage at random<br><div><br></div><div dir=3D"ltr">Hi A= ll,<div><br></div><div>We're seeing some weird issues in our ovirt setup. W= e have 4 nodes connected and an NFS (v3) filestore (FreeBSD/ZFS).</div><div= <br></div><div>Once in a while, it seems at random, a node loses their con= nection to storage, recovers it a minute later. The other nodes usually don= 't lose their storage at that moment. Just one, or two at a time. </di= v> <div><br></div><div>We've setup extra tooling to verify the storage perform= ance at those moments and the availability for other systems. It's always o= nline, just the nodes don't think so. </div><div><br></div> <div>The engine tells me this:</div><div><br></div><div><div>2014-02-18 11:= 48:03,598 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCo= mmand] (pool-6-thread-48) domain d88764c8-ecc3-4f22-967e-2ce225ac4498:Expor= t in problem. vds: hv5</div> <div>2014-02-18 11:48:18,909 WARN [org.ovirt.engine.core.vdsbroker.ir= sbroker.IrsBrokerCommand] (pool-6-thread-48) domain e9f70496-f181-4c9b-9ecb= -d7f780772b04:Data in problem. vds: hv5</div><div>2014-02-18 11:48:45,021 W= ARN [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzSchedu= ler_Worker-18) [46683672] Failed to refresh VDS , vds =3D 66e6aace-e51d-400= 6-bb2f-d85c2f1fd8d2 : hv5, VDS Network Error, continuing.</div> <div>2014-02-18 11:48:45,070 INFO [org.ovirt.engine.core.dal.dbbroker= .auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-41) [2ef= 1a894] Correlation ID: 2ef1a894, Call Stack: null, Custom Event ID: -1, Mes= sage: Invalid status on Data Center GS. Setting Data Center status to Non R= esponsive (On host hv5, Error: Network error during communication with the = Host.).</div> <div><br></div><div>The export and data domain live over NFS. There's anoth= er domain, ISO, that lives on the engine machine, also shared over NFS. Tha= t domain doesn't have any issue at all. </div><div><br></div><div> Attached are the logfiles for the relevant time period for both the engine = server and the node. The node by the way, is a deployment of the node ISO, = not a full blown installation.</div><div><br></div><div>Any clues on where = to begin searching? The NFS server shows no issues nor anything in the logs= . I did notice that the statd and lockd daemons were not running, but I won= der if that can have anything to do with the issue.</div> <div><br></div>-- <br>Met vriendelijke groeten / With kind regards,<br>Joha= n Kooijman<br><div><br></div><a href=3D"mailto:mail@johankooijman.com" targ= et=3D"_blank">mail@johankooijman.com</a> </div></div> <br>_______________________________________________<br>Users mailing list<b= r>Users@ovirt.org<br>http://lists.ovirt.org/mailman/listinfo/users<br></blo= ckquote><div><br></div></div></body></html> ------=_Part_5007416_497450989.1392724657956--

Johan Kooijman

12:51 p.m.

Meital, I'm running the latest stable oVirt, 3.3.3 on Centos 6.5. For my nodes I use the node iso CentOS 6 "oVirt Node - 3.0.1 - 1.0.2.el6". I have no ways of reproducing just yet. I can confirm that it's happening on all nodes in the cluster. And every time a node goes offline, this error pops up. Could the fact that lockd & statd were not running on the NFS host cause this error? Is there a workaround available that we know of? On Tue, Feb 18, 2014 at 12:57 PM, Meital Bourvine <mbourvin@redhat.com>wrote:

...

Hi Johan,

Please take a look at this error (from vdsm.log):

Thread-636938::DEBUG::2014-02-18 10:48:06,374::task::579::TaskManager.Task::(_updateState) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::moving from state init -> state preparing Thread-636938::INFO::2014-02-18 10:48:06,375::logUtils::44::dispatcher::(wrapper) Run and protect: getVolumeSize(sdUUID='e9f70496-f181-4c9b-9ecb-d7f780772b04', spUUID='59980e09-b329-4254-b66e-790abd69e194', imgUUID='d50ecfbb-dc98-40cf-9b19-4bd402952aeb', volUUID='68fefe24-0346-4d0d-b377-ddd7be7be29c', options=None) Thread-636938::ERROR::2014-02-18 10:48:06,376::task::850::TaskManager.Task::(_setError) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Unexpected error Thread-636938::DEBUG::2014-02-18 10:48:06,415::task::869::TaskManager.Task::(_run) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Task._run: f4ce9a6e-0292-4071-9a24-a8d8fba7222b ('e9f70496-f181-4c9b-9ecb-d7f780772b04', '59980e09-b329-4254-b66e-790abd69e194', 'd50ecfbb-dc98-40cf-9b19-4bd402952aeb', '68fefe24-0346-4d0d-b377-ddd7be7be29c') {} failed - stopping task Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::1194::TaskManager.Task::(stop) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::stopping in state preparing (force False) Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::974::TaskManager.Task::(_decref) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::ref 1 aborting True Thread-636938::INFO::2014-02-18 10:48:06,416::task::1151::TaskManager.Task::(prepare) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::aborting: Task is aborted: u'No free file handlers in pool' - code 100 Thread-636938::DEBUG::2014-02-18 10:48:06,417::task::1156::TaskManager.Task::(prepare) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Prepare: aborted: No free file handlers in pool

And then you can see after a few seconds:

MainThread::INFO::2014-02-18 10:48:45,258::vdsm::101::vds::(run) (PID: 1450) I am the actual vdsm 4.12.1-2.el6 hv5.ovirt.gs.cloud.lan (2.6.32-358.18.1.el6.x86_64)

Meaning that vdsm was restarted.

Which oVirt version are you using? I see that there are a few old bugs that describes the same behaviour, but with different reproduction steps, for example [1], [2]. Can you think of any reproduction steps that might be causing this issue?

[1] https://bugzilla.redhat.com/show_bug.cgi?id=948210 [2] https://bugzilla.redhat.com/show_bug.cgi?id=853011

------------------------------

*From: *"Johan Kooijman" <mail@johankooijman.com> *To: *"users" <users@ovirt.org> *Sent: *Tuesday, February 18, 2014 1:32:56 PM *Subject: *[Users] Nodes lose storage at random

Hi All,

We're seeing some weird issues in our ovirt setup. We have 4 nodes connected and an NFS (v3) filestore (FreeBSD/ZFS).

Once in a while, it seems at random, a node loses their connection to storage, recovers it a minute later. The other nodes usually don't lose their storage at that moment. Just one, or two at a time.

We've setup extra tooling to verify the storage performance at those moments and the availability for other systems. It's always online, just the nodes don't think so.

The engine tells me this:

2014-02-18 11:48:03,598 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-48) domain d88764c8-ecc3-4f22-967e-2ce225ac4498:Export in problem. vds: hv5 2014-02-18 11:48:18,909 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-48) domain e9f70496-f181-4c9b-9ecb-d7f780772b04:Data in problem. vds: hv5 2014-02-18 11:48:45,021 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-18) [46683672] Failed to refresh VDS , vds = 66e6aace-e51d-4006-bb2f-d85c2f1fd8d2 : hv5, VDS Network Error, continuing. 2014-02-18 11:48:45,070 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-41) [2ef1a894] Correlation ID: 2ef1a894, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center GS. Setting Data Center status to Non Responsive (On host hv5, Error: Network error during communication with the Host.).

The export and data domain live over NFS. There's another domain, ISO, that lives on the engine machine, also shared over NFS. That domain doesn't have any issue at all.

Attached are the logfiles for the relevant time period for both the engine server and the node. The node by the way, is a deployment of the node ISO, not a full blown installation.

Any clues on where to begin searching? The NFS server shows no issues nor anything in the logs. I did notice that the statd and lockd daemons were not running, but I wonder if that can have anything to do with the issue.

-- Met vriendelijke groeten / With kind regards, Johan Kooijman

mail@johankooijman.com

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

-- Met vriendelijke groeten / With kind regards, Johan Kooijman T +31(0) 6 43 44 45 27 F +31(0) 162 82 00 01 E mail@johankooijman.com

Johan Kooijman

12:55 p.m.

To follow up on this: The setup has only ~80 VM's active right now. The 2 bugreports are not in scope for this setup, the issues occur at random, even when there's no activity (create/delete VM's) and there are only 4 directories in /rhev/data-center/mnt/. On Tue, Feb 18, 2014 at 1:51 PM, Johan Kooijman <mail@johankooijman.com>wrote:

...

Meital,

I'm running the latest stable oVirt, 3.3.3 on Centos 6.5. For my nodes I use the node iso CentOS 6 "oVirt Node - 3.0.1 - 1.0.2.el6".

I have no ways of reproducing just yet. I can confirm that it's happening on all nodes in the cluster. And every time a node goes offline, this error pops up.

Could the fact that lockd & statd were not running on the NFS host cause this error? Is there a workaround available that we know of?

On Tue, Feb 18, 2014 at 12:57 PM, Meital Bourvine <mbourvin@redhat.com>wrote:

...
Hi Johan,

Please take a look at this error (from vdsm.log):

Thread-636938::DEBUG::2014-02-18 10:48:06,374::task::579::TaskManager.Task::(_updateState) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::moving from state init -> state preparing Thread-636938::INFO::2014-02-18 10:48:06,375::logUtils::44::dispatcher::(wrapper) Run and protect: getVolumeSize(sdUUID='e9f70496-f181-4c9b-9ecb-d7f780772b04', spUUID='59980e09-b329-4254-b66e-790abd69e194', imgUUID='d50ecfbb-dc98-40cf-9b19-4bd402952aeb', volUUID='68fefe24-0346-4d0d-b377-ddd7be7be29c', options=None) Thread-636938::ERROR::2014-02-18 10:48:06,376::task::850::TaskManager.Task::(_setError) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Unexpected error Thread-636938::DEBUG::2014-02-18 10:48:06,415::task::869::TaskManager.Task::(_run) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Task._run: f4ce9a6e-0292-4071-9a24-a8d8fba7222b ('e9f70496-f181-4c9b-9ecb-d7f780772b04', '59980e09-b329-4254-b66e-790abd69e194', 'd50ecfbb-dc98-40cf-9b19-4bd402952aeb', '68fefe24-0346-4d0d-b377-ddd7be7be29c') {} failed - stopping task Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::1194::TaskManager.Task::(stop) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::stopping in state preparing (force False) Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::974::TaskManager.Task::(_decref) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::ref 1 aborting True Thread-636938::INFO::2014-02-18 10:48:06,416::task::1151::TaskManager.Task::(prepare) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::aborting: Task is aborted: u'No free file handlers in pool' - code 100 Thread-636938::DEBUG::2014-02-18 10:48:06,417::task::1156::TaskManager.Task::(prepare) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Prepare: aborted: No free file handlers in pool

And then you can see after a few seconds:

MainThread::INFO::2014-02-18 10:48:45,258::vdsm::101::vds::(run) (PID: 1450) I am the actual vdsm 4.12.1-2.el6 hv5.ovirt.gs.cloud.lan (2.6.32-358.18.1.el6.x86_64)

Meaning that vdsm was restarted.

Which oVirt version are you using? I see that there are a few old bugs that describes the same behaviour, but with different reproduction steps, for example [1], [2].

Can you think of any reproduction steps that might be causing this issue?

[1] https://bugzilla.redhat.com/show_bug.cgi?id=948210

[2] https://bugzilla.redhat.com/show_bug.cgi?id=853011

------------------------------

*From: *"Johan Kooijman" <mail@johankooijman.com> *To: *"users" <users@ovirt.org> *Sent: *Tuesday, February 18, 2014 1:32:56 PM *Subject: *[Users] Nodes lose storage at random

Hi All,

We're seeing some weird issues in our ovirt setup. We have 4 nodes connected and an NFS (v3) filestore (FreeBSD/ZFS).

Once in a while, it seems at random, a node loses their connection to storage, recovers it a minute later. The other nodes usually don't lose their storage at that moment. Just one, or two at a time.

We've setup extra tooling to verify the storage performance at those moments and the availability for other systems. It's always online, just the nodes don't think so.

The engine tells me this:

2014-02-18 11:48:03,598 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-48) domain d88764c8-ecc3-4f22-967e-2ce225ac4498:Export in problem. vds: hv5 2014-02-18 11:48:18,909 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-48) domain e9f70496-f181-4c9b-9ecb-d7f780772b04:Data in problem. vds: hv5 2014-02-18 11:48:45,021 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-18) [46683672] Failed to refresh VDS , vds = 66e6aace-e51d-4006-bb2f-d85c2f1fd8d2 : hv5, VDS Network Error, continuing. 2014-02-18 11:48:45,070 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-41) [2ef1a894] Correlation ID: 2ef1a894, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center GS. Setting Data Center status to Non Responsive (On host hv5, Error: Network error during communication with the Host.).

The export and data domain live over NFS. There's another domain, ISO, that lives on the engine machine, also shared over NFS. That domain doesn't have any issue at all.

Attached are the logfiles for the relevant time period for both the engine server and the node. The node by the way, is a deployment of the node ISO, not a full blown installation.

Any clues on where to begin searching? The NFS server shows no issues nor anything in the logs. I did notice that the statd and lockd daemons were not running, but I wonder if that can have anything to do with the issue.

-- Met vriendelijke groeten / With kind regards, Johan Kooijman

mail@johankooijman.com

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

-- Met vriendelijke groeten / With kind regards, Johan Kooijman

T +31(0) 6 43 44 45 27 F +31(0) 162 82 00 01 E mail@johankooijman.com

-- Met vriendelijke groeten / With kind regards, Johan Kooijman T +31(0) 6 43 44 45 27 F +31(0) 162 82 00 01 E mail@johankooijman.com

Meital Bourvine

1:56 p.m.

------=_Part_5070193_841446361.1392731812625 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Hi Johan, Can you please run something like this on the spm node? while true; do echo `date; ps ax | grep -i remotefilehandler | wc -l` >> /tmp/handler_num.txt; sleep 1; done When it'll happen again, please stop the script, and write here the maximum number and the time that it happened. Also, please check if "process_pool_max_slots_per_domain" is defined in /etc/vdsm/vdsm.conf, and if so, what's the value? (if it's not defined there, the default is 10) Thanks! ----- Original Message -----

...

From: "Johan Kooijman" <mail@johankooijman.com> To: "Meital Bourvine" <mbourvin@redhat.com> Cc: "users" <users@ovirt.org> Sent: Tuesday, February 18, 2014 2:55:11 PM Subject: Re: [Users] Nodes lose storage at random

...

To follow up on this: The setup has only ~80 VM's active right now. The 2 bugreports are not in scope for this setup, the issues occur at random, even when there's no activity (create/delete VM's) and there are only 4 directories in / rhev/data-center/mnt/.

...

On Tue, Feb 18, 2014 at 1:51 PM, Johan Kooijman < mail@johankooijman.com > wrote:

...

...
Meital,

...

...
I'm running the latest stable oVirt, 3.3.3 on Centos 6.5. For my nodes I use the node iso CentOS 6 "oVirt Node - 3.0.1 - 1.0.2.el6".

...

...
I have no ways of reproducing just yet. I can confirm that it's happening on all nodes in the cluster. And every time a node goes offline, this error pops up.

...

...
Could the fact that lockd & statd were not running on the NFS host cause this error? Is there a workaround available that we know of?

...

...
On Tue, Feb 18, 2014 at 12:57 PM, Meital Bourvine < mbourvin@redhat.com > wrote:

...

...
...
Hi Johan,

...

...
...
Please take a look at this error (from vdsm.log):

...

...
...
Thread-636938::DEBUG::2014-02-18 10:48:06,374::task::579::TaskManager.Task::(_updateState) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::moving from state init -> state preparing

...
...
Thread-636938::INFO::2014-02-18 10:48:06,375::logUtils::44::dispatcher::(wrapper) Run and protect: getVolumeSize(sdUUID='e9f70496-f181-4c9b-9ecb-d7f780772b04', spUUID='59980e09-b329-4254-b66e-790abd69e194', imgUUID='d50ecfbb-dc98-40cf-9b19-4bd402952aeb', volUUID='68fefe24-0346-4d0d-b377-ddd7be7be29c', options=None)

...
...
Thread-636938::ERROR::2014-02-18 10:48:06,376::task::850::TaskManager.Task::(_setError) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Unexpected error

...
...
Thread-636938::DEBUG::2014-02-18 10:48:06,415::task::869::TaskManager.Task::(_run) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Task._run: f4ce9a6e-0292-4071-9a24-a8d8fba7222b ('e9f70496-f181-4c9b-9ecb-d7f780772b04', '59980e09-b329-4254-b66e-790abd69e194', 'd50ecfbb-dc98-40cf-9b19-4bd402952aeb', '68fefe24-0346-4d0d-b377-ddd7be7be29c') {} failed - stopping task

...
...
Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::1194::TaskManager.Task::(stop) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::stopping in state preparing (force False)

...
...
Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::974::TaskManager.Task::(_decref) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::ref 1 aborting True

...
...
Thread-636938::INFO::2014-02-18 10:48:06,416::task::1151::TaskManager.Task::(prepare) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::aborting: Task is aborted: u'No free file handlers in pool' - code 100

...
...
Thread-636938::DEBUG::2014-02-18 10:48:06,417::task::1156::TaskManager.Task::(prepare) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Prepare: aborted: No free file handlers in pool

...

...
...
And then you can see after a few seconds:

...
...
MainThread::INFO::2014-02-18 10:48:45,258::vdsm::101::vds::(run) (PID: 1450) I am the actual vdsm 4.12.1-2.el6 hv5.ovirt.gs.cloud.lan (2.6.32-358.18.1.el6.x86_64)

...

...
...
Meaning that vdsm was restarted.

...

...
...
Which oVirt version are you using?

...
...
I see that there are a few old bugs that describes the same behaviour, but with different reproduction steps, for example [1], [2].

...
...
Can you think of any reproduction steps that might be causing this issue?

...

...
...
[1] https://bugzilla.redhat.com/show_bug.cgi?id=948210

...
...
[2] https://bugzilla.redhat.com/show_bug.cgi?id=853011

...

...
...
...
From: "Johan Kooijman" < mail@johankooijman.com >

...
...
...
To: "users" < users@ovirt.org >

...
...
...
Sent: Tuesday, February 18, 2014 1:32:56 PM

...
...
...
Subject: [Users] Nodes lose storage at random

...

...
...
...
Hi All,

...

...
...
...
We're seeing some weird issues in our ovirt setup. We have 4 nodes connected and an NFS (v3) filestore (FreeBSD/ZFS).

...

...
...
...
Once in a while, it seems at random, a node loses their connection to storage, recovers it a minute later. The other nodes usually don't lose their storage at that moment. Just one, or two at a time.

...

...
...
...
We've setup extra tooling to verify the storage performance at those moments and the availability for other systems. It's always online, just the nodes don't think so.

...

...
...
...
The engine tells me this:

...

...
...
...
2014-02-18 11:48:03,598 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-48) domain d88764c8-ecc3-4f22-967e-2ce225ac4498:Export in problem. vds: hv5

...
...
...
2014-02-18 11:48:18,909 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-48) domain e9f70496-f181-4c9b-9ecb-d7f780772b04:Data in problem. vds: hv5

...
...
...
2014-02-18 11:48:45,021 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-18) [46683672] Failed to refresh VDS , vds = 66e6aace-e51d-4006-bb2f-d85c2f1fd8d2 : hv5, VDS Network Error, continuing.

...
...
...
2014-02-18 11:48:45,070 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-41) [2ef1a894] Correlation ID: 2ef1a894, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center GS. Setting Data Center status to Non Responsive (On host hv5, Error: Network error during communication with the Host.).

...

...
...
...
The export and data domain live over NFS. There's another domain, ISO, that lives on the engine machine, also shared over NFS. That domain doesn't have any issue at all.

...

...
...
...
Attached are the logfiles for the relevant time period for both the engine server and the node. The node by the way, is a deployment of the node ISO, not a full blown installation.

...

...
...
...
Any clues on where to begin searching? The NFS server shows no issues nor anything in the logs. I did notice that the statd and lockd daemons were not running, but I wonder if that can have anything to do with the issue.

...

...
...
...
--

...
...
...
Met vriendelijke groeten / With kind regards,

...
...
...
Johan Kooijman

...

...
...
...
mail@johankooijman.com

...

...
...
...
_______________________________________________

...
...
...
Users mailing list

...
...
...
Users@ovirt.org

...
...
...
http://lists.ovirt.org/mailman/listinfo/users

...

...
--

...
Met vriendelijke groeten / With kind regards,

...
Johan Kooijman

...

...
T +31(0) 6 43 44 45 27

...
F +31(0) 162 82 00 01

...
E mail@johankooijman.com

...

-- Met vriendelijke groeten / With kind regards, Johan Kooijman

...

T +31(0) 6 43 44 45 27 F +31(0) 162 82 00 01 E mail@johankooijman.com

------=_Part_5070193_841446361.1392731812625 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable <html><body><div style=3D"font-family: times new roman, new york, times, se= rif; font-size: 12pt; color: #000000"><div>Hi Johan,<br></div><div><br></di= v><div>Can you please run something like this on the spm node?<br></div><di= v>while true; do echo `date; ps ax | grep -i remotefilehandler | wc -l` >= ;> /tmp/handler_num.txt; sleep 1; done</div><div><br></div><div>When it'= ll happen again, please stop the script, and write here the maximum number = and the time that it happened.<br></div><div><br></div><div>Also, please ch= eck if "process_pool_max_slots_per_domain" is defined in /etc/vdsm/vdsm.con= f, and if so, what's the value? (if it's not defined there, the default is = 10)<br></div><div><br></div><div>Thanks!<br></div><div><br></div><div><br><= /div><hr id=3D"zwchr"><blockquote style=3D"border-left:2px solid #1010FF;ma= rgin-left:5px;padding-left:5px;color:#000;font-weight:normal;font-style:nor= mal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:1= 2pt;"><b>From: </b>"Johan Kooijman" <mail@johankooijman.com><br><b>To= : </b>"Meital Bourvine" <mbourvin@redhat.com><br><b>Cc: </b>"users" &= lt;users@ovirt.org><br><b>Sent: </b>Tuesday, February 18, 2014 2:55:11 P= M<br><b>Subject: </b>Re: [Users] Nodes lose storage at random<br><div><br><= /div><div dir=3D"ltr">To follow up on this: The setup has only ~80 VM's act= ive right now. The 2 bugreports are not in scope for this setup, the issues= occur at random, even when there's no activity (create/delete VM's) and th= ere are only 4 directories in /<span style=3D"color:rgb(0,0,0);white-space:= pre-wrap">rhev/data-center/mnt/.</span><div> <span style=3D"color:rgb(0,0,0);white-space:pre-wrap"><br></span></div></di= v><div class=3D"gmail_extra"><br><div><br></div><div class=3D"gmail_quote">= On Tue, Feb 18, 2014 at 1:51 PM, Johan Kooijman <span dir=3D"ltr"><<a hr= ef=3D"mailto:mail@johankooijman.com" target=3D"_blank">mail@johankooijman.c= om</a>></span> wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Meital,<div><br></div><div>= I'm running the latest stable oVirt, 3.3.3 on Centos 6.5. For my nodes I us= e the node iso CentOS 6 "oVirt Node - 3.0.1 - 1.0.2.el6".</div> <div><br></div><div>I have no ways of reproducing just yet. I can confirm t= hat it's happening on all nodes in the cluster. And every time a node goes = offline, this error pops up.</div> <div><br></div><div>Could the fact that lockd & statd were not running = on the NFS host cause this error? Is there a workaround available that we k= now of?</div> </div><div class=3D"gmail_extra"><div><div class=3D"h5"><br><div><br></div>= <div class=3D"gmail_quote">On Tue, Feb 18, 2014 at 12:57 PM, Meital Bourvin= e <span dir=3D"ltr"><<a href=3D"mailto:mbourvin@redhat.com" target=3D"_b= lank">mbourvin@redhat.com</a>></span> wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"><div><div style=3D"font-size:12pt;font-famil= y:times new roman,new york,times,serif"><pre>Hi Johan,<br><div><br></div>Pl= ease take a look at this error (from vdsm.log):<br> <div><br></div>Thread-636938::DEBUG::2014-02-18 10:48:06,374::task::579::Ta= skManager.Task::(_updateState) Task=3D`f4ce9a6e-0292-4071-9a24-a8d8fba7222b= `::moving from state init -> state preparing Thread-636938::INFO::2014-02-18 10:48:06,375::logUtils::44::dispatcher::(wr= apper) Run and protect: getVolumeSize(sdUUID=3D'e9f70496-f181-4c9b-9ecb-d7f= 780772b04', spUUID=3D'59980e09-b329-4254-b66e-790abd69e194', imgUUID=3D'd50= ecfbb-dc98-40cf-9b19-4bd402952aeb', volUUID=3D'68fefe24-0346-4d0d-b377-ddd7= be7be29c', options=3DNone) Thread-636938::ERROR::2014-02-18 10:48:06,376::task::850::TaskManager.Task:= :(_setError) Task=3D`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Unexpected erro= r Thread-636938::DEBUG::2014-02-18 10:48:06,415::task::869::TaskManager.Task:= :(_run) Task=3D`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Task._run: f4ce9a6e-= 0292-4071-9a24-a8d8fba7222b ('e9f70496-f181-4c9b-9ecb-d7f780772b04', '59980= e09-b329-4254-b66e-790abd69e194', 'd50ecfbb-dc98-40cf-9b19-4bd402952aeb', '= 68fefe24-0346-4d0d-b377-ddd7be7be29c') {} failed - stopping task Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::1194::TaskManager.Task= ::(stop) Task=3D`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::stopping in state p= reparing (force False) Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::974::TaskManager.Task:= :(_decref) Task=3D`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::ref 1 aborting Tr= ue Thread-636938::INFO::2014-02-18 10:48:06,416::task::1151::TaskManager.Task:= :(prepare) Task=3D`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::aborting: Task is= aborted: u'No free file handlers in pool' - code 100 Thread-636938::DEBUG::2014-02-18 10:48:06,417::task::1156::TaskManager.Task= ::(prepare) Task=3D`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Prepare: aborted= : No free file handlers in pool<br></pre><div><br></div><pre><br></pre><pre=

...

And then you can see after a few seconds:<br></pre><pre>MainThread::INFO::= 2014-02-18 10:48:45,258::vdsm::101::vds::(run) (PID: 1450) I am the actual = vdsm 4.12.1-2.el6 hv5.ovirt.gs.cloud.lan (2.6.32-358.18.1.el6.x86_64)<br>

<div><br></div><br>Meaning that vdsm was restarted.<br><div><br></div>Which= oVirt version are you using?<br>I see that there are a few old bugs that d= escribes the same behaviour, but with different reproduction steps, for exa= mple [1], [2].<br> Can you think of any reproduction steps that might be causing this issue?<b= r><div><br></div><br>[1] <a href=3D"https://bugzilla.redhat.com/show_bug.cg= i?id=3D948210" target=3D"_blank">https://bugzilla.redhat.com/show_bug.cgi?i= d=3D948210</a><br> [2] <a href=3D"https://bugzilla.redhat.com/show_bug.cgi?id=3D853011" target= =3D"_blank">https://bugzilla.redhat.com/show_bug.cgi?id=3D853011</a><br></p= re><div><br></div><div><br></div><hr><blockquote style=3D"padding-left:5px;= font-size:12pt;font-style:normal;margin-left:5px;font-family:Helvetica,Aria= l,sans-serif;text-decoration:none;font-weight:normal;border-left:2px solid = #1010ff"> <b>From: </b>"Johan Kooijman" <<a href=3D"mailto:mail@johankooijman.com"= target=3D"_blank">mail@johankooijman.com</a>><br><b>To: </b>"users" <= ;<a href=3D"mailto:users@ovirt.org" target=3D"_blank">users@ovirt.org</a>&g= t;<br> <b>Sent: </b>Tuesday, February 18, 2014 1:32:56 PM<br><b>Subject: </b>[User= s] Nodes lose storage at random<div><div><br><div><br></div><div dir=3D"ltr= ">Hi All,<div><br></div><div>We're seeing some weird issues in our ovirt se= tup. We have 4 nodes connected and an NFS (v3) filestore (FreeBSD/ZFS).</di= v> <div><br></div><div>Once in a while, it seems at random, a node loses their= connection to storage, recovers it a minute later. The other nodes usually= don't lose their storage at that moment. Just one, or two at a time. = </div> <div><br></div><div>We've setup extra tooling to verify the storage perform= ance at those moments and the availability for other systems. It's always o= nline, just the nodes don't think so. </div><div><br></div> <div>The engine tells me this:</div><div><br></div><div><div>2014-02-18 11:= 48:03,598 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCo= mmand] (pool-6-thread-48) domain d88764c8-ecc3-4f22-967e-2ce225ac4498:Expor= t in problem. vds: hv5</div> <div>2014-02-18 11:48:18,909 WARN [org.ovirt.engine.core.vdsbroker.ir= sbroker.IrsBrokerCommand] (pool-6-thread-48) domain e9f70496-f181-4c9b-9ecb= -d7f780772b04:Data in problem. vds: hv5</div><div>2014-02-18 11:48:45,021 W= ARN [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzSchedu= ler_Worker-18) [46683672] Failed to refresh VDS , vds =3D 66e6aace-e51d-400= 6-bb2f-d85c2f1fd8d2 : hv5, VDS Network Error, continuing.</div> <div>2014-02-18 11:48:45,070 INFO [org.ovirt.engine.core.dal.dbbroker= .auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-41) [2ef= 1a894] Correlation ID: 2ef1a894, Call Stack: null, Custom Event ID: -1, Mes= sage: Invalid status on Data Center GS. Setting Data Center status to Non R= esponsive (On host hv5, Error: Network error during communication with the = Host.).</div> <div><br></div><div>The export and data domain live over NFS. There's anoth= er domain, ISO, that lives on the engine machine, also shared over NFS. Tha= t domain doesn't have any issue at all. </div><div><br></div> <div> Attached are the logfiles for the relevant time period for both the engine = server and the node. The node by the way, is a deployment of the node ISO, = not a full blown installation.</div><div><br></div><div>Any clues on where = to begin searching? The NFS server shows no issues nor anything in the logs= . I did notice that the statd and lockd daemons were not running, but I won= der if that can have anything to do with the issue.</div> <div><br></div>-- <br>Met vriendelijke groeten / With kind regards,<br>Joha= n Kooijman<br><div><br></div><a href=3D"mailto:mail@johankooijman.com" targ= et=3D"_blank">mail@johankooijman.com</a> </div></div> <br></div></div>_______________________________________________<br>Users ma= iling list<br><a href=3D"mailto:Users@ovirt.org" target=3D"_blank">Users@ov= irt.org</a><br><a href=3D"http://lists.ovirt.org/mailman/listinfo/users" ta= rget=3D"_blank">http://lists.ovirt.org/mailman/listinfo/users</a><br> </blockquote><div><br></div></div></div></blockquote></div><br><br clear=3D= "all"><div><br></div>-- <br>Met vriendelijke groeten / With kind regards,<b= r>Johan Kooijman<br><div><br></div></div></div>T <a href=3D"tel:%2B31%280%2= 9%206%2043%2044%2045%2027" target=3D"_blank">+31(0) 6 43 44 45 27</a><br> F <a href=3D"tel:%2B31%280%29%20162%2082%2000%2001" target=3D"_blank">+31(0= ) 162 82 00 01</a><br> E <a href=3D"mailto:mail@johankooijman.com" target=3D"_blank">mail@johankoo= ijman.com</a> </div> </blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>Met vriendel= ijke groeten / With kind regards,<br>Johan Kooijman<br><div><br></div>T +31= (0) 6 43 44 45 27<br>F +31(0) 162 82 00 01<br>E <a href=3D"mailto:mail@joha= nkooijman.com" target=3D"_blank">mail@johankooijman.com</a> </div> </blockquote><div><br></div></div></body></html> ------=_Part_5070193_841446361.1392731812625--

Johan Kooijman

2:04 p.m.

Ok, will do. The process_pool_max_slots_per_domain is not defined, default node values. On Tue, Feb 18, 2014 at 2:56 PM, Meital Bourvine <mbourvin@redhat.com>wrote:

...

Hi Johan,

Can you please run something like this on the spm node? while true; do echo `date; ps ax | grep -i remotefilehandler | wc -l` >> /tmp/handler_num.txt; sleep 1; done

When it'll happen again, please stop the script, and write here the maximum number and the time that it happened.

Also, please check if "process_pool_max_slots_per_domain" is defined in /etc/vdsm/vdsm.conf, and if so, what's the value? (if it's not defined there, the default is 10)

Thanks!

------------------------------

*From: *"Johan Kooijman" <mail@johankooijman.com> *To: *"Meital Bourvine" <mbourvin@redhat.com> *Cc: *"users" <users@ovirt.org> *Sent: *Tuesday, February 18, 2014 2:55:11 PM *Subject: *Re: [Users] Nodes lose storage at random

To follow up on this: The setup has only ~80 VM's active right now. The 2 bugreports are not in scope for this setup, the issues occur at random, even when there's no activity (create/delete VM's) and there are only 4 directories in /rhev/data-center/mnt/.

On Tue, Feb 18, 2014 at 1:51 PM, Johan Kooijman <mail@johankooijman.com>wrote:

...
Meital,

I'm running the latest stable oVirt, 3.3.3 on Centos 6.5. For my nodes I use the node iso CentOS 6 "oVirt Node - 3.0.1 - 1.0.2.el6".

I have no ways of reproducing just yet. I can confirm that it's happening on all nodes in the cluster. And every time a node goes offline, this error pops up.

Could the fact that lockd & statd were not running on the NFS host cause this error? Is there a workaround available that we know of?

On Tue, Feb 18, 2014 at 12:57 PM, Meital Bourvine <mbourvin@redhat.com>wrote:

...
Hi Johan,

Please take a look at this error (from vdsm.log):

Thread-636938::DEBUG::2014-02-18 10:48:06,374::task::579::TaskManager.Task::(_updateState) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::moving from state init -> state preparing Thread-636938::INFO::2014-02-18 10:48:06,375::logUtils::44::dispatcher::(wrapper) Run and protect: getVolumeSize(sdUUID='e9f70496-f181-4c9b-9ecb-d7f780772b04', spUUID='59980e09-b329-4254-b66e-790abd69e194', imgUUID='d50ecfbb-dc98-40cf-9b19-4bd402952aeb', volUUID='68fefe24-0346-4d0d-b377-ddd7be7be29c', options=None) Thread-636938::ERROR::2014-02-18 10:48:06,376::task::850::TaskManager.Task::(_setError) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Unexpected error Thread-636938::DEBUG::2014-02-18 10:48:06,415::task::869::TaskManager.Task::(_run) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Task._run: f4ce9a6e-0292-4071-9a24-a8d8fba7222b ('e9f70496-f181-4c9b-9ecb-d7f780772b04', '59980e09-b329-4254-b66e-790abd69e194', 'd50ecfbb-dc98-40cf-9b19-4bd402952aeb', '68fefe24-0346-4d0d-b377-ddd7be7be29c') {} failed - stopping task Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::1194::TaskManager.Task::(stop) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::stopping in state preparing (force False) Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::974::TaskManager.Task::(_decref) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::ref 1 aborting True Thread-636938::INFO::2014-02-18 10:48:06,416::task::1151::TaskManager.Task::(prepare) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::aborting: Task is aborted: u'No free file handlers in pool' - code 100 Thread-636938::DEBUG::2014-02-18 10:48:06,417::task::1156::TaskManager.Task::(prepare) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Prepare: aborted: No free file handlers in pool

And then you can see after a few seconds:

MainThread::INFO::2014-02-18 10:48:45,258::vdsm::101::vds::(run) (PID: 1450) I am the actual vdsm 4.12.1-2.el6 hv5.ovirt.gs.cloud.lan (2.6.32-358.18.1.el6.x86_64)

Meaning that vdsm was restarted.

Which oVirt version are you using? I see that there are a few old bugs that describes the same behaviour, but with different reproduction steps, for example [1], [2].

Can you think of any reproduction steps that might be causing this issue?

[1] https://bugzilla.redhat.com/show_bug.cgi?id=948210

[2] https://bugzilla.redhat.com/show_bug.cgi?id=853011

------------------------------

*From: *"Johan Kooijman" <mail@johankooijman.com> *To: *"users" <users@ovirt.org> *Sent: *Tuesday, February 18, 2014 1:32:56 PM *Subject: *[Users] Nodes lose storage at random

Hi All,

We're seeing some weird issues in our ovirt setup. We have 4 nodes connected and an NFS (v3) filestore (FreeBSD/ZFS).

Once in a while, it seems at random, a node loses their connection to storage, recovers it a minute later. The other nodes usually don't lose their storage at that moment. Just one, or two at a time.

We've setup extra tooling to verify the storage performance at those moments and the availability for other systems. It's always online, just the nodes don't think so.

The engine tells me this:

2014-02-18 11:48:03,598 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-48) domain d88764c8-ecc3-4f22-967e-2ce225ac4498:Export in problem. vds: hv5 2014-02-18 11:48:18,909 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-48) domain e9f70496-f181-4c9b-9ecb-d7f780772b04:Data in problem. vds: hv5 2014-02-18 11:48:45,021 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-18) [46683672] Failed to refresh VDS , vds = 66e6aace-e51d-4006-bb2f-d85c2f1fd8d2 : hv5, VDS Network Error, continuing. 2014-02-18 11:48:45,070 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-41) [2ef1a894] Correlation ID: 2ef1a894, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center GS. Setting Data Center status to Non Responsive (On host hv5, Error: Network error during communication with the Host.).

The export and data domain live over NFS. There's another domain, ISO, that lives on the engine machine, also shared over NFS. That domain doesn't have any issue at all.

Attached are the logfiles for the relevant time period for both the engine server and the node. The node by the way, is a deployment of the node ISO, not a full blown installation.

Any clues on where to begin searching? The NFS server shows no issues nor anything in the logs. I did notice that the statd and lockd daemons were not running, but I wonder if that can have anything to do with the issue.

-- Met vriendelijke groeten / With kind regards, Johan Kooijman

mail@johankooijman.com

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

-- Met vriendelijke groeten / With kind regards, Johan Kooijman

T +31(0) 6 43 44 45 27 F +31(0) 162 82 00 01 E mail@johankooijman.com

-- Met vriendelijke groeten / With kind regards, Johan Kooijman

T +31(0) 6 43 44 45 27 F +31(0) 162 82 00 01 E mail@johankooijman.com

-- Met vriendelijke groeten / With kind regards, Johan Kooijman T +31(0) 6 43 44 45 27 F +31(0) 162 82 00 01 E mail@johankooijman.com

Johan Kooijman

9:42 p.m.

One other interesting fact is that each node has 4 NFS mountpoints. 2 (data & export) to the main SAN, 1 to the engine machine for ISO and one to the legacy SAN. When this issue occurs, the only mountpoint in a problem state seem to be the 2 mountpoints to the main SAN: 2014-02-18 11:48:03,598 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-48) domain d88764c8-ecc3-4f22-967e-2ce225ac4498:Export in problem. vds: hv5 2014-02-18 11:48:18,909 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-48) domain e9f70496-f181-4c9b-9ecb-d7f780772b04:Data in problem. vds: hv5 On Tue, Feb 18, 2014 at 3:04 PM, Johan Kooijman <mail@johankooijman.com>wrote:

...

Ok, will do. The process_pool_max_slots_per_domain is not defined, default node values.

On Tue, Feb 18, 2014 at 2:56 PM, Meital Bourvine <mbourvin@redhat.com>wrote:

...
Hi Johan,

Can you please run something like this on the spm node? while true; do echo `date; ps ax | grep -i remotefilehandler | wc -l` >> /tmp/handler_num.txt; sleep 1; done

When it'll happen again, please stop the script, and write here the maximum number and the time that it happened.

Also, please check if "process_pool_max_slots_per_domain" is defined in /etc/vdsm/vdsm.conf, and if so, what's the value? (if it's not defined there, the default is 10)

Thanks!

------------------------------

*From: *"Johan Kooijman" <mail@johankooijman.com> *To: *"Meital Bourvine" <mbourvin@redhat.com> *Cc: *"users" <users@ovirt.org> *Sent: *Tuesday, February 18, 2014 2:55:11 PM *Subject: *Re: [Users] Nodes lose storage at random

To follow up on this: The setup has only ~80 VM's active right now. The 2 bugreports are not in scope for this setup, the issues occur at random, even when there's no activity (create/delete VM's) and there are only 4 directories in /rhev/data-center/mnt/.

On Tue, Feb 18, 2014 at 1:51 PM, Johan Kooijman <mail@johankooijman.com>wrote:

...
Meital,

I'm running the latest stable oVirt, 3.3.3 on Centos 6.5. For my nodes I use the node iso CentOS 6 "oVirt Node - 3.0.1 - 1.0.2.el6".

I have no ways of reproducing just yet. I can confirm that it's happening on all nodes in the cluster. And every time a node goes offline, this error pops up.

Could the fact that lockd & statd were not running on the NFS host cause this error? Is there a workaround available that we know of?

On Tue, Feb 18, 2014 at 12:57 PM, Meital Bourvine <mbourvin@redhat.com>wrote:

...
Hi Johan,

Please take a look at this error (from vdsm.log):

Thread-636938::DEBUG::2014-02-18 10:48:06,374::task::579::TaskManager.Task::(_updateState) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::moving from state init -> state preparing Thread-636938::INFO::2014-02-18 10:48:06,375::logUtils::44::dispatcher::(wrapper) Run and protect: getVolumeSize(sdUUID='e9f70496-f181-4c9b-9ecb-d7f780772b04', spUUID='59980e09-b329-4254-b66e-790abd69e194', imgUUID='d50ecfbb-dc98-40cf-9b19-4bd402952aeb', volUUID='68fefe24-0346-4d0d-b377-ddd7be7be29c', options=None) Thread-636938::ERROR::2014-02-18 10:48:06,376::task::850::TaskManager.Task::(_setError) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Unexpected error Thread-636938::DEBUG::2014-02-18 10:48:06,415::task::869::TaskManager.Task::(_run) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Task._run: f4ce9a6e-0292-4071-9a24-a8d8fba7222b ('e9f70496-f181-4c9b-9ecb-d7f780772b04', '59980e09-b329-4254-b66e-790abd69e194', 'd50ecfbb-dc98-40cf-9b19-4bd402952aeb', '68fefe24-0346-4d0d-b377-ddd7be7be29c') {} failed - stopping task Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::1194::TaskManager.Task::(stop) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::stopping in state preparing (force False) Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::974::TaskManager.Task::(_decref) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::ref 1 aborting True Thread-636938::INFO::2014-02-18 10:48:06,416::task::1151::TaskManager.Task::(prepare) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::aborting: Task is aborted: u'No free file handlers in pool' - code 100 Thread-636938::DEBUG::2014-02-18 10:48:06,417::task::1156::TaskManager.Task::(prepare) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Prepare: aborted: No free file handlers in pool

And then you can see after a few seconds:

MainThread::INFO::2014-02-18 10:48:45,258::vdsm::101::vds::(run) (PID: 1450) I am the actual vdsm 4.12.1-2.el6 hv5.ovirt.gs.cloud.lan (2.6.32-358.18.1.el6.x86_64)

Meaning that vdsm was restarted.

Which oVirt version are you using? I see that there are a few old bugs that describes the same behaviour, but with different reproduction steps, for example [1], [2].

Can you think of any reproduction steps that might be causing this issue?

[1] https://bugzilla.redhat.com/show_bug.cgi?id=948210

[2] https://bugzilla.redhat.com/show_bug.cgi?id=853011

------------------------------

*From: *"Johan Kooijman" <mail@johankooijman.com> *To: *"users" <users@ovirt.org> *Sent: *Tuesday, February 18, 2014 1:32:56 PM *Subject: *[Users] Nodes lose storage at random

Hi All,

We're seeing some weird issues in our ovirt setup. We have 4 nodes connected and an NFS (v3) filestore (FreeBSD/ZFS).

Once in a while, it seems at random, a node loses their connection to storage, recovers it a minute later. The other nodes usually don't lose their storage at that moment. Just one, or two at a time.

We've setup extra tooling to verify the storage performance at those moments and the availability for other systems. It's always online, just the nodes don't think so.

The engine tells me this:

2014-02-18 11:48:03,598 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-48) domain d88764c8-ecc3-4f22-967e-2ce225ac4498:Export in problem. vds: hv5 2014-02-18 11:48:18,909 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-48) domain e9f70496-f181-4c9b-9ecb-d7f780772b04:Data in problem. vds: hv5 2014-02-18 11:48:45,021 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-18) [46683672] Failed to refresh VDS , vds = 66e6aace-e51d-4006-bb2f-d85c2f1fd8d2 : hv5, VDS Network Error, continuing. 2014-02-18 11:48:45,070 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-41) [2ef1a894] Correlation ID: 2ef1a894, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center GS. Setting Data Center status to Non Responsive (On host hv5, Error: Network error during communication with the Host.).

The export and data domain live over NFS. There's another domain, ISO, that lives on the engine machine, also shared over NFS. That domain doesn't have any issue at all.

Attached are the logfiles for the relevant time period for both the engine server and the node. The node by the way, is a deployment of the node ISO, not a full blown installation.

Any clues on where to begin searching? The NFS server shows no issues nor anything in the logs. I did notice that the statd and lockd daemons were not running, but I wonder if that can have anything to do with the issue.

-- Met vriendelijke groeten / With kind regards, Johan Kooijman

mail@johankooijman.com

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

-- Met vriendelijke groeten / With kind regards, Johan Kooijman

T +31(0) 6 43 44 45 27 F +31(0) 162 82 00 01 E mail@johankooijman.com

-- Met vriendelijke groeten / With kind regards, Johan Kooijman

T +31(0) 6 43 44 45 27 F +31(0) 162 82 00 01 E mail@johankooijman.com

-- Met vriendelijke groeten / With kind regards, Johan Kooijman

T +31(0) 6 43 44 45 27 F +31(0) 162 82 00 01 E mail@johankooijman.com

-- Met vriendelijke groeten / With kind regards, Johan Kooijman T +31(0) 6 43 44 45 27 F +31(0) 162 82 00 01 E mail@johankooijman.com

Johan Kooijman

21 Feb 21 Feb

11:46 p.m.

Meital, It's been 4 days since the last crash - but 5 minutes ago one of the nodes had the same issues. I've been running the script on the SPM as you mentioned. It turns out that at the time the node went down, the SPM didn't have more remoteFileHandler processes than before or after the crash - 29. I'm not sure what to make of this piece of information. On Tue, Feb 18, 2014 at 2:56 PM, Meital Bourvine <mbourvin@redhat.com>wrote:

...

Hi Johan,

Can you please run something like this on the spm node? while true; do echo `date; ps ax | grep -i remotefilehandler | wc -l` >> /tmp/handler_num.txt; sleep 1; done

When it'll happen again, please stop the script, and write here the maximum number and the time that it happened.

Also, please check if "process_pool_max_slots_per_domain" is defined in /etc/vdsm/vdsm.conf, and if so, what's the value? (if it's not defined there, the default is 10)

Thanks!

------------------------------

*From: *"Johan Kooijman" <mail@johankooijman.com> *To: *"Meital Bourvine" <mbourvin@redhat.com> *Cc: *"users" <users@ovirt.org> *Sent: *Tuesday, February 18, 2014 2:55:11 PM *Subject: *Re: [Users] Nodes lose storage at random

To follow up on this: The setup has only ~80 VM's active right now. The 2 bugreports are not in scope for this setup, the issues occur at random, even when there's no activity (create/delete VM's) and there are only 4 directories in /rhev/data-center/mnt/.

On Tue, Feb 18, 2014 at 1:51 PM, Johan Kooijman <mail@johankooijman.com>wrote:

...
Meital,

I'm running the latest stable oVirt, 3.3.3 on Centos 6.5. For my nodes I use the node iso CentOS 6 "oVirt Node - 3.0.1 - 1.0.2.el6".

I have no ways of reproducing just yet. I can confirm that it's happening on all nodes in the cluster. And every time a node goes offline, this error pops up.

Could the fact that lockd & statd were not running on the NFS host cause this error? Is there a workaround available that we know of?

On Tue, Feb 18, 2014 at 12:57 PM, Meital Bourvine <mbourvin@redhat.com>wrote:

...
Hi Johan,

Please take a look at this error (from vdsm.log):

Thread-636938::DEBUG::2014-02-18 10:48:06,374::task::579::TaskManager.Task::(_updateState) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::moving from state init -> state preparing Thread-636938::INFO::2014-02-18 10:48:06,375::logUtils::44::dispatcher::(wrapper) Run and protect: getVolumeSize(sdUUID='e9f70496-f181-4c9b-9ecb-d7f780772b04', spUUID='59980e09-b329-4254-b66e-790abd69e194', imgUUID='d50ecfbb-dc98-40cf-9b19-4bd402952aeb', volUUID='68fefe24-0346-4d0d-b377-ddd7be7be29c', options=None) Thread-636938::ERROR::2014-02-18 10:48:06,376::task::850::TaskManager.Task::(_setError) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Unexpected error Thread-636938::DEBUG::2014-02-18 10:48:06,415::task::869::TaskManager.Task::(_run) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Task._run: f4ce9a6e-0292-4071-9a24-a8d8fba7222b ('e9f70496-f181-4c9b-9ecb-d7f780772b04', '59980e09-b329-4254-b66e-790abd69e194', 'd50ecfbb-dc98-40cf-9b19-4bd402952aeb', '68fefe24-0346-4d0d-b377-ddd7be7be29c') {} failed - stopping task Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::1194::TaskManager.Task::(stop) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::stopping in state preparing (force False) Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::974::TaskManager.Task::(_decref) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::ref 1 aborting True Thread-636938::INFO::2014-02-18 10:48:06,416::task::1151::TaskManager.Task::(prepare) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::aborting: Task is aborted: u'No free file handlers in pool' - code 100 Thread-636938::DEBUG::2014-02-18 10:48:06,417::task::1156::TaskManager.Task::(prepare) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Prepare: aborted: No free file handlers in pool

And then you can see after a few seconds:

MainThread::INFO::2014-02-18 10:48:45,258::vdsm::101::vds::(run) (PID: 1450) I am the actual vdsm 4.12.1-2.el6 hv5.ovirt.gs.cloud.lan (2.6.32-358.18.1.el6.x86_64)

Meaning that vdsm was restarted.

Which oVirt version are you using? I see that there are a few old bugs that describes the same behaviour, but with different reproduction steps, for example [1], [2].

Can you think of any reproduction steps that might be causing this issue?

[1] https://bugzilla.redhat.com/show_bug.cgi?id=948210

[2] https://bugzilla.redhat.com/show_bug.cgi?id=853011

------------------------------

*From: *"Johan Kooijman" <mail@johankooijman.com> *To: *"users" <users@ovirt.org> *Sent: *Tuesday, February 18, 2014 1:32:56 PM *Subject: *[Users] Nodes lose storage at random

Hi All,

We're seeing some weird issues in our ovirt setup. We have 4 nodes connected and an NFS (v3) filestore (FreeBSD/ZFS).

Once in a while, it seems at random, a node loses their connection to storage, recovers it a minute later. The other nodes usually don't lose their storage at that moment. Just one, or two at a time.

We've setup extra tooling to verify the storage performance at those moments and the availability for other systems. It's always online, just the nodes don't think so.

The engine tells me this:

2014-02-18 11:48:03,598 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-48) domain d88764c8-ecc3-4f22-967e-2ce225ac4498:Export in problem. vds: hv5 2014-02-18 11:48:18,909 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-48) domain e9f70496-f181-4c9b-9ecb-d7f780772b04:Data in problem. vds: hv5 2014-02-18 11:48:45,021 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-18) [46683672] Failed to refresh VDS , vds = 66e6aace-e51d-4006-bb2f-d85c2f1fd8d2 : hv5, VDS Network Error, continuing. 2014-02-18 11:48:45,070 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-41) [2ef1a894] Correlation ID: 2ef1a894, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center GS. Setting Data Center status to Non Responsive (On host hv5, Error: Network error during communication with the Host.).

The export and data domain live over NFS. There's another domain, ISO, that lives on the engine machine, also shared over NFS. That domain doesn't have any issue at all.

Attached are the logfiles for the relevant time period for both the engine server and the node. The node by the way, is a deployment of the node ISO, not a full blown installation.

Any clues on where to begin searching? The NFS server shows no issues nor anything in the logs. I did notice that the statd and lockd daemons were not running, but I wonder if that can have anything to do with the issue.

-- Met vriendelijke groeten / With kind regards, Johan Kooijman

mail@johankooijman.com

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

-- Met vriendelijke groeten / With kind regards, Johan Kooijman

T +31(0) 6 43 44 45 27 F +31(0) 162 82 00 01 E mail@johankooijman.com

-- Met vriendelijke groeten / With kind regards, Johan Kooijman

T +31(0) 6 43 44 45 27 F +31(0) 162 82 00 01 E mail@johankooijman.com

-- Met vriendelijke groeten / With kind regards, Johan Kooijman T +31(0) 6 43 44 45 27 F +31(0) 162 82 00 01 E mail@johankooijman.com

Nir Soffer

19 Feb 19 Feb

9:38 a.m.

----- Original Message -----

...

From: "Johan Kooijman" <mail@johankooijman.com> To: "users" <users@ovirt.org> Sent: Tuesday, February 18, 2014 1:32:56 PM Subject: [Users] Nodes lose storage at random

Hi All,

We're seeing some weird issues in our ovirt setup. We have 4 nodes connected and an NFS (v3) filestore (FreeBSD/ZFS).

Once in a while, it seems at random, a node loses their connection to storage, recovers it a minute later. The other nodes usually don't lose their storage at that moment. Just one, or two at a time.

We've setup extra tooling to verify the storage performance at those moments and the availability for other systems. It's always online, just the nodes don't think so.

In the logs, we see that vdsm was restarted: MainThread::DEBUG::2014-02-18 10:48:35,809::vdsm::45::vds::(sigtermHandler) Received signal 15 But we don't know why it happened. Please attach also /var/log/messages and /var/log/sanlock.log around the time that vdsm was restarted. Thanks, Nir

Johan Kooijman

12:34 p.m.

Nir, Messages: https://t-x.dignus.nl/messages.txt Sanlock: https://t-x.dignus.nl/sanlock.log.txt Any input is more than welcome! On Wed, Feb 19, 2014 at 10:38 AM, Nir Soffer <nsoffer@redhat.com> wrote:

...

----- Original Message -----

...
From: "Johan Kooijman" <mail@johankooijman.com> To: "users" <users@ovirt.org> Sent: Tuesday, February 18, 2014 1:32:56 PM Subject: [Users] Nodes lose storage at random

Hi All,

We're seeing some weird issues in our ovirt setup. We have 4 nodes connected and an NFS (v3) filestore (FreeBSD/ZFS).

Once in a while, it seems at random, a node loses their connection to storage, recovers it a minute later. The other nodes usually don't lose their storage at that moment. Just one, or two at a time.

We've setup extra tooling to verify the storage performance at those moments and the availability for other systems. It's always online, just the nodes don't think so.

In the logs, we see that vdsm was restarted: MainThread::DEBUG::2014-02-18 10:48:35,809::vdsm::45::vds::(sigtermHandler) Received signal 15

But we don't know why it happened.

Please attach also /var/log/messages and /var/log/sanlock.log around the time that vdsm was restarted.

Thanks, Nir

-- Met vriendelijke groeten / With kind regards, Johan Kooijman T +31(0) 6 43 44 45 27 F +31(0) 162 82 00 01 E mail@johankooijman.com

Nir Soffer

22 Feb 22 Feb

7:57 p.m.

----- Original Message -----

...

From: "Johan Kooijman" <mail@johankooijman.com> To: "Nir Soffer" <nsoffer@redhat.com> Cc: "users" <users@ovirt.org> Sent: Wednesday, February 19, 2014 2:34:36 PM Subject: Re: [Users] Nodes lose storage at random

Messages: https://t-x.dignus.nl/messages.txt Sanlock: https://t-x.dignus.nl/sanlock.log.txt

We can see in /var/log/messages, that sanlock failed to write to the ids lockspace [1], which after 80 seconds [2], caused vdsm to loose its host id lease. In this case, sanlock kill vdsm [3], which die after 11 retries [4]. Then vdsm is respawned again [5]. This is expected. We don't know why sanlock failed to write to the storage, but in [6] the kernel tell us that the nfs server is not responding. Since the nfs server is accessible from other machines, it means you have some issue with this host. Later the machine reboots [7], and nfs server is still not accessible. Then you have lot of WARN_ON call traces [8], that looks related to network code. We can see that you are not running most recent kernel [7]. We experienced various nfs issues during the 6.5 beta. I would try to get help from kernel folks about this. [1] Feb 18 10:47:46 hv5 sanlock[14753]: 2014-02-18 10:47:46+0000 1251833 [21345]: s2 delta_renew read rv -202 offset 0 /rhev/data-center/mnt/10.0.24.1:_santank_ovirt-data/e9f70496-f181-4c9b-9ecb-d7f780772b04/dom_md/ids [2] Feb 18 10:48:35 hv5 sanlock[14753]: 2014-02-18 10:48:35+0000 1251882 [14753]: s2 check_our_lease failed 80 [3] Feb 18 10:48:35 hv5 sanlock[14753]: 2014-02-18 10:48:35+0000 1251882 [14753]: s2 kill 19317 sig 15 count 1 [4] Feb 18 10:48:45 hv5 sanlock[14753]: 2014-02-18 10:48:45+0000 1251892 [14753]: dead 19317 ci 3 count 11 [5] Feb 18 10:48:45 hv5 respawn: slave '/usr/share/vdsm/vdsm' died, respawning slave [6] Feb 18 10:57:36 hv5 kernel: nfs: server 10.0.24.1 not responding, timed out [7] Feb 18 11:03:01 hv5 kernel: imklog 5.8.10, log source = /proc/kmsg started. Feb 18 11:03:01 hv5 kernel: Linux version 2.6.32-358.18.1.el6.x86_64 (mockbuild@c6b10.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Wed Aug 28 17:19:38 UTC 2013 [8] Feb 18 18:29:53 hv5 kernel: ------------[ cut here ]------------ Feb 18 18:29:53 hv5 kernel: WARNING: at net/core/dev.c:1759 skb_gso_segment+0x1df/0x2b0() (Not tainted) Feb 18 18:29:53 hv5 kernel: Hardware name: X9DRW Feb 18 18:29:53 hv5 kernel: igb: caps=(0x12114bb3, 0x0) len=1596 data_len=0 ip_summed=0 Feb 18 18:29:53 hv5 kernel: Modules linked in: ebt_arp nfs fscache auth_rpcgss nfs_acl bonding softdog ebtable_nat ebtables bnx2fc fcoe libfcoe libfc scsi_transport_fc scsi_tgt lockd sunrpc bridge ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables xt_physdev ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack xt_multi port ip6table_filter ip6_tables ext4 jbd2 8021q garp stp llc sha256_generic cbc cryptoloop dm_crypt aesni_intel cryptd aes_x86_64 aes_generic vhost_net macvtap macvlan tun kvm_ intel kvm sg sb_edac edac_core iTCO_wdt iTCO_vendor_support ioatdma shpchp dm_snapshot squashfs ext2 mbcache dm_round_robin sd_mod crc_t10dif isci libsas scsi_transport_sas 3w_ sas ahci ixgbe igb dca ptp pps_core dm_multipath dm_mirror dm_region_hash dm_log dm_mod be2iscsi bnx2i cnic uio ipv6 cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xx x iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: scsi_wait_scan] Feb 18 18:29:53 hv5 kernel: Pid: 5462, comm: vhost-5458 Not tainted 2.6.32-358.18.1.el6.x86_64 #1 Feb 18 18:29:53 hv5 kernel: Call Trace: Feb 18 18:29:53 hv5 kernel: <IRQ> [<ffffffff8106e3e7>] ? warn_slowpath_common+0x87/0xc0 Feb 18 18:29:53 hv5 kernel: [<ffffffff8106e4d6>] ? warn_slowpath_fmt+0x46/0x50 Feb 18 18:29:53 hv5 kernel: [<ffffffffa020bd62>] ? igb_get_drvinfo+0x82/0xe0 [igb] Feb 18 18:29:53 hv5 kernel: [<ffffffff81448e7f>] ? skb_gso_segment+0x1df/0x2b0 Feb 18 18:29:53 hv5 kernel: [<ffffffff81449260>] ? dev_hard_start_xmit+0x1b0/0x530 Feb 18 18:29:53 hv5 kernel: [<ffffffff8146773a>] ? sch_direct_xmit+0x15a/0x1c0 Feb 18 18:29:53 hv5 kernel: [<ffffffff8144d0c0>] ? dev_queue_xmit+0x3b0/0x550 Feb 18 18:29:53 hv5 kernel: [<ffffffffa04af65c>] ? br_dev_queue_push_xmit+0x6c/0xa0 [bridge] Feb 18 18:29:53 hv5 kernel: [<ffffffffa04af6e8>] ? br_forward_finish+0x58/0x60 [bridge] Feb 18 18:29:53 hv5 kernel: [<ffffffffa04af79a>] ? __br_forward+0xaa/0xd0 [bridge] Feb 18 18:29:53 hv5 kernel: [<ffffffff81474f34>] ? nf_hook_slow+0x74/0x110 Feb 18 18:29:53 hv5 kernel: [<ffffffffa04af81d>] ? br_forward+0x5d/0x70 [bridge] Feb 18 18:29:53 hv5 kernel: [<ffffffffa04b0609>] ? br_handle_frame_finish+0x179/0x2a0 [bridge] Feb 18 18:29:53 hv5 kernel: [<ffffffffa04b08da>] ? br_handle_frame+0x1aa/0x250 [bridge] Feb 18 18:29:53 hv5 kernel: [<ffffffffa0331690>] ? pit_timer_fn+0x0/0x80 [kvm] Feb 18 18:29:53 hv5 kernel: [<ffffffff81448929>] ? __netif_receive_skb+0x529/0x750 Feb 18 18:29:53 hv5 kernel: [<ffffffff81448bea>] ? process_backlog+0x9a/0x100 Feb 18 18:29:53 hv5 kernel: [<ffffffff8144d453>] ? net_rx_action+0x103/0x2f0 Feb 18 18:29:53 hv5 kernel: [<ffffffff810770b1>] ? __do_softirq+0xc1/0x1e0 Feb 18 18:29:53 hv5 kernel: [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30 Feb 18 18:29:53 hv5 kernel: <EOI> [<ffffffff8100de05>] ? do_softirq+0x65/0xa0 Feb 18 18:29:53 hv5 kernel: [<ffffffff8144d8d8>] ? netif_rx_ni+0x28/0x30 Feb 18 18:29:53 hv5 kernel: [<ffffffffa02b7749>] ? tun_sendmsg+0x229/0x4ec [tun] Feb 18 18:29:53 hv5 kernel: [<ffffffffa037bcf5>] ? handle_tx+0x275/0x5e0 [vhost_net] Feb 18 18:29:53 hv5 kernel: [<ffffffffa037c095>] ? handle_tx_kick+0x15/0x20 [vhost_net] Feb 18 18:29:53 hv5 kernel: [<ffffffffa037955c>] ? vhost_worker+0xbc/0x140 [vhost_net] Feb 18 18:29:53 hv5 kernel: [<ffffffffa03794a0>] ? vhost_worker+0x0/0x140 [vhost_net] Feb 18 18:29:53 hv5 kernel: [<ffffffff81096a36>] ? kthread+0x96/0xa0 Feb 18 18:29:53 hv5 kernel: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20 Feb 18 18:29:53 hv5 kernel: [<ffffffff810969a0>] ? kthread+0x0/0xa0 Feb 18 18:29:53 hv5 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Feb 18 18:29:53 hv5 kernel: ---[ end trace 2ae4b3142333fe7d ]---

Johan Kooijman

8 p.m.

Thanks for looking into it. I've been running the ovirt ISO untill now, will switch to stock C6.5 to see if it makes a difference. On Sat, Feb 22, 2014 at 8:57 PM, Nir Soffer <nsoffer@redhat.com> wrote:

...

----- Original Message -----

...
From: "Johan Kooijman" <mail@johankooijman.com> To: "Nir Soffer" <nsoffer@redhat.com> Cc: "users" <users@ovirt.org> Sent: Wednesday, February 19, 2014 2:34:36 PM Subject: Re: [Users] Nodes lose storage at random

Messages: https://t-x.dignus.nl/messages.txt Sanlock: https://t-x.dignus.nl/sanlock.log.txt

We can see in /var/log/messages, that sanlock failed to write to the ids lockspace [1], which after 80 seconds [2], caused vdsm to loose its host id lease. In this case, sanlock kill vdsm [3], which die after 11 retries [4]. Then vdsm is respawned again [5]. This is expected.

We don't know why sanlock failed to write to the storage, but in [6] the kernel tell us that the nfs server is not responding. Since the nfs server is accessible from other machines, it means you have some issue with this host.

Later the machine reboots [7], and nfs server is still not accessible. Then you have lot of WARN_ON call traces [8], that looks related to network code.

We can see that you are not running most recent kernel [7]. We experienced various nfs issues during the 6.5 beta.

I would try to get help from kernel folks about this.

[1] Feb 18 10:47:46 hv5 sanlock[14753]: 2014-02-18 10:47:46+0000 1251833 [21345]: s2 delta_renew read rv -202 offset 0 /rhev/data-center/mnt/10.0.24.1: _santank_ovirt-data/e9f70496-f181-4c9b-9ecb-d7f780772b04/dom_md/ids

[2] Feb 18 10:48:35 hv5 sanlock[14753]: 2014-02-18 10:48:35+0000 1251882 [14753]: s2 check_our_lease failed 80

[3] Feb 18 10:48:35 hv5 sanlock[14753]: 2014-02-18 10:48:35+0000 1251882 [14753]: s2 kill 19317 sig 15 count 1

[4] Feb 18 10:48:45 hv5 sanlock[14753]: 2014-02-18 10:48:45+0000 1251892 [14753]: dead 19317 ci 3 count 11

[5] Feb 18 10:48:45 hv5 respawn: slave '/usr/share/vdsm/vdsm' died, respawning slave

[6] Feb 18 10:57:36 hv5 kernel: nfs: server 10.0.24.1 not responding, timed out

[7] Feb 18 11:03:01 hv5 kernel: imklog 5.8.10, log source = /proc/kmsg started. Feb 18 11:03:01 hv5 kernel: Linux version 2.6.32-358.18.1.el6.x86_64 ( mockbuild@c6b10.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Wed Aug 28 17:19:38 UTC 2013

[8] Feb 18 18:29:53 hv5 kernel: ------------[ cut here ]------------ Feb 18 18:29:53 hv5 kernel: WARNING: at net/core/dev.c:1759 skb_gso_segment+0x1df/0x2b0() (Not tainted) Feb 18 18:29:53 hv5 kernel: Hardware name: X9DRW Feb 18 18:29:53 hv5 kernel: igb: caps=(0x12114bb3, 0x0) len=1596 data_len=0 ip_summed=0 Feb 18 18:29:53 hv5 kernel: Modules linked in: ebt_arp nfs fscache auth_rpcgss nfs_acl bonding softdog ebtable_nat ebtables bnx2fc fcoe libfcoe libfc scsi_transport_fc scsi_tgt lockd sunrpc bridge ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables xt_physdev ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack xt_multi port ip6table_filter ip6_tables ext4 jbd2 8021q garp stp llc sha256_generic cbc cryptoloop dm_crypt aesni_intel cryptd aes_x86_64 aes_generic vhost_net macvtap macvlan tun kvm_ intel kvm sg sb_edac edac_core iTCO_wdt iTCO_vendor_support ioatdma shpchp dm_snapshot squashfs ext2 mbcache dm_round_robin sd_mod crc_t10dif isci libsas scsi_transport_sas 3w_ sas ahci ixgbe igb dca ptp pps_core dm_multipath dm_mirror dm_region_hash dm_log dm_mod be2iscsi bnx2i cnic uio ipv6 cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xx x iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: scsi_wait_scan] Feb 18 18:29:53 hv5 kernel: Pid: 5462, comm: vhost-5458 Not tainted 2.6.32-358.18.1.el6.x86_64 #1 Feb 18 18:29:53 hv5 kernel: Call Trace: Feb 18 18:29:53 hv5 kernel: <IRQ> [<ffffffff8106e3e7>] ? warn_slowpath_common+0x87/0xc0 Feb 18 18:29:53 hv5 kernel: [<ffffffff8106e4d6>] ? warn_slowpath_fmt+0x46/0x50 Feb 18 18:29:53 hv5 kernel: [<ffffffffa020bd62>] ? igb_get_drvinfo+0x82/0xe0 [igb] Feb 18 18:29:53 hv5 kernel: [<ffffffff81448e7f>] ? skb_gso_segment+0x1df/0x2b0 Feb 18 18:29:53 hv5 kernel: [<ffffffff81449260>] ? dev_hard_start_xmit+0x1b0/0x530 Feb 18 18:29:53 hv5 kernel: [<ffffffff8146773a>] ? sch_direct_xmit+0x15a/0x1c0 Feb 18 18:29:53 hv5 kernel: [<ffffffff8144d0c0>] ? dev_queue_xmit+0x3b0/0x550 Feb 18 18:29:53 hv5 kernel: [<ffffffffa04af65c>] ? br_dev_queue_push_xmit+0x6c/0xa0 [bridge] Feb 18 18:29:53 hv5 kernel: [<ffffffffa04af6e8>] ? br_forward_finish+0x58/0x60 [bridge] Feb 18 18:29:53 hv5 kernel: [<ffffffffa04af79a>] ? __br_forward+0xaa/0xd0 [bridge] Feb 18 18:29:53 hv5 kernel: [<ffffffff81474f34>] ? nf_hook_slow+0x74/0x110 Feb 18 18:29:53 hv5 kernel: [<ffffffffa04af81d>] ? br_forward+0x5d/0x70 [bridge] Feb 18 18:29:53 hv5 kernel: [<ffffffffa04b0609>] ? br_handle_frame_finish+0x179/0x2a0 [bridge] Feb 18 18:29:53 hv5 kernel: [<ffffffffa04b08da>] ? br_handle_frame+0x1aa/0x250 [bridge] Feb 18 18:29:53 hv5 kernel: [<ffffffffa0331690>] ? pit_timer_fn+0x0/0x80 [kvm] Feb 18 18:29:53 hv5 kernel: [<ffffffff81448929>] ? __netif_receive_skb+0x529/0x750 Feb 18 18:29:53 hv5 kernel: [<ffffffff81448bea>] ? process_backlog+0x9a/0x100 Feb 18 18:29:53 hv5 kernel: [<ffffffff8144d453>] ? net_rx_action+0x103/0x2f0 Feb 18 18:29:53 hv5 kernel: [<ffffffff810770b1>] ? __do_softirq+0xc1/0x1e0 Feb 18 18:29:53 hv5 kernel: [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30 Feb 18 18:29:53 hv5 kernel: <EOI> [<ffffffff8100de05>] ? do_softirq+0x65/0xa0 Feb 18 18:29:53 hv5 kernel: [<ffffffff8144d8d8>] ? netif_rx_ni+0x28/0x30 Feb 18 18:29:53 hv5 kernel: [<ffffffffa02b7749>] ? tun_sendmsg+0x229/0x4ec [tun] Feb 18 18:29:53 hv5 kernel: [<ffffffffa037bcf5>] ? handle_tx+0x275/0x5e0 [vhost_net] Feb 18 18:29:53 hv5 kernel: [<ffffffffa037c095>] ? handle_tx_kick+0x15/0x20 [vhost_net] Feb 18 18:29:53 hv5 kernel: [<ffffffffa037955c>] ? vhost_worker+0xbc/0x140 [vhost_net] Feb 18 18:29:53 hv5 kernel: [<ffffffffa03794a0>] ? vhost_worker+0x0/0x140 [vhost_net] Feb 18 18:29:53 hv5 kernel: [<ffffffff81096a36>] ? kthread+0x96/0xa0 Feb 18 18:29:53 hv5 kernel: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20 Feb 18 18:29:53 hv5 kernel: [<ffffffff810969a0>] ? kthread+0x0/0xa0 Feb 18 18:29:53 hv5 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Feb 18 18:29:53 hv5 kernel: ---[ end trace 2ae4b3142333fe7d ]---

-- Met vriendelijke groeten / With kind regards, Johan Kooijman T +31(0) 6 43 44 45 27 F +31(0) 162 82 00 01 E mail@johankooijman.com

Johan Kooijman

23 Feb 23 Feb

1:48 a.m.

Been reinstalling to stocj CentOS 6.5 last night, all successful. Until roughly midnight GMT, 2 out of 4 hosts were showing the same errors. Any more suggestions? On Sat, Feb 22, 2014 at 8:57 PM, Nir Soffer <nsoffer@redhat.com> wrote:

...

----- Original Message -----

...
From: "Johan Kooijman" <mail@johankooijman.com> To: "Nir Soffer" <nsoffer@redhat.com> Cc: "users" <users@ovirt.org> Sent: Wednesday, February 19, 2014 2:34:36 PM Subject: Re: [Users] Nodes lose storage at random

Messages: https://t-x.dignus.nl/messages.txt Sanlock: https://t-x.dignus.nl/sanlock.log.txt

We can see in /var/log/messages, that sanlock failed to write to the ids lockspace [1], which after 80 seconds [2], caused vdsm to loose its host id lease. In this case, sanlock kill vdsm [3], which die after 11 retries [4]. Then vdsm is respawned again [5]. This is expected.

We don't know why sanlock failed to write to the storage, but in [6] the kernel tell us that the nfs server is not responding. Since the nfs server is accessible from other machines, it means you have some issue with this host.

Later the machine reboots [7], and nfs server is still not accessible. Then you have lot of WARN_ON call traces [8], that looks related to network code.

We can see that you are not running most recent kernel [7]. We experienced various nfs issues during the 6.5 beta.

I would try to get help from kernel folks about this.

[1] Feb 18 10:47:46 hv5 sanlock[14753]: 2014-02-18 10:47:46+0000 1251833 [21345]: s2 delta_renew read rv -202 offset 0 /rhev/data-center/mnt/10.0.24.1: _santank_ovirt-data/e9f70496-f181-4c9b-9ecb-d7f780772b04/dom_md/ids

[2] Feb 18 10:48:35 hv5 sanlock[14753]: 2014-02-18 10:48:35+0000 1251882 [14753]: s2 check_our_lease failed 80

[3] Feb 18 10:48:35 hv5 sanlock[14753]: 2014-02-18 10:48:35+0000 1251882 [14753]: s2 kill 19317 sig 15 count 1

[4] Feb 18 10:48:45 hv5 sanlock[14753]: 2014-02-18 10:48:45+0000 1251892 [14753]: dead 19317 ci 3 count 11

[5] Feb 18 10:48:45 hv5 respawn: slave '/usr/share/vdsm/vdsm' died, respawning slave

[6] Feb 18 10:57:36 hv5 kernel: nfs: server 10.0.24.1 not responding, timed out

[7] Feb 18 11:03:01 hv5 kernel: imklog 5.8.10, log source = /proc/kmsg started. Feb 18 11:03:01 hv5 kernel: Linux version 2.6.32-358.18.1.el6.x86_64 ( mockbuild@c6b10.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Wed Aug 28 17:19:38 UTC 2013

[8] Feb 18 18:29:53 hv5 kernel: ------------[ cut here ]------------ Feb 18 18:29:53 hv5 kernel: WARNING: at net/core/dev.c:1759 skb_gso_segment+0x1df/0x2b0() (Not tainted) Feb 18 18:29:53 hv5 kernel: Hardware name: X9DRW Feb 18 18:29:53 hv5 kernel: igb: caps=(0x12114bb3, 0x0) len=1596 data_len=0 ip_summed=0 Feb 18 18:29:53 hv5 kernel: Modules linked in: ebt_arp nfs fscache auth_rpcgss nfs_acl bonding softdog ebtable_nat ebtables bnx2fc fcoe libfcoe libfc scsi_transport_fc scsi_tgt lockd sunrpc bridge ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables xt_physdev ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack xt_multi port ip6table_filter ip6_tables ext4 jbd2 8021q garp stp llc sha256_generic cbc cryptoloop dm_crypt aesni_intel cryptd aes_x86_64 aes_generic vhost_net macvtap macvlan tun kvm_ intel kvm sg sb_edac edac_core iTCO_wdt iTCO_vendor_support ioatdma shpchp dm_snapshot squashfs ext2 mbcache dm_round_robin sd_mod crc_t10dif isci libsas scsi_transport_sas 3w_ sas ahci ixgbe igb dca ptp pps_core dm_multipath dm_mirror dm_region_hash dm_log dm_mod be2iscsi bnx2i cnic uio ipv6 cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xx x iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: scsi_wait_scan] Feb 18 18:29:53 hv5 kernel: Pid: 5462, comm: vhost-5458 Not tainted 2.6.32-358.18.1.el6.x86_64 #1 Feb 18 18:29:53 hv5 kernel: Call Trace: Feb 18 18:29:53 hv5 kernel: <IRQ> [<ffffffff8106e3e7>] ? warn_slowpath_common+0x87/0xc0 Feb 18 18:29:53 hv5 kernel: [<ffffffff8106e4d6>] ? warn_slowpath_fmt+0x46/0x50 Feb 18 18:29:53 hv5 kernel: [<ffffffffa020bd62>] ? igb_get_drvinfo+0x82/0xe0 [igb] Feb 18 18:29:53 hv5 kernel: [<ffffffff81448e7f>] ? skb_gso_segment+0x1df/0x2b0 Feb 18 18:29:53 hv5 kernel: [<ffffffff81449260>] ? dev_hard_start_xmit+0x1b0/0x530 Feb 18 18:29:53 hv5 kernel: [<ffffffff8146773a>] ? sch_direct_xmit+0x15a/0x1c0 Feb 18 18:29:53 hv5 kernel: [<ffffffff8144d0c0>] ? dev_queue_xmit+0x3b0/0x550 Feb 18 18:29:53 hv5 kernel: [<ffffffffa04af65c>] ? br_dev_queue_push_xmit+0x6c/0xa0 [bridge] Feb 18 18:29:53 hv5 kernel: [<ffffffffa04af6e8>] ? br_forward_finish+0x58/0x60 [bridge] Feb 18 18:29:53 hv5 kernel: [<ffffffffa04af79a>] ? __br_forward+0xaa/0xd0 [bridge] Feb 18 18:29:53 hv5 kernel: [<ffffffff81474f34>] ? nf_hook_slow+0x74/0x110 Feb 18 18:29:53 hv5 kernel: [<ffffffffa04af81d>] ? br_forward+0x5d/0x70 [bridge] Feb 18 18:29:53 hv5 kernel: [<ffffffffa04b0609>] ? br_handle_frame_finish+0x179/0x2a0 [bridge] Feb 18 18:29:53 hv5 kernel: [<ffffffffa04b08da>] ? br_handle_frame+0x1aa/0x250 [bridge] Feb 18 18:29:53 hv5 kernel: [<ffffffffa0331690>] ? pit_timer_fn+0x0/0x80 [kvm] Feb 18 18:29:53 hv5 kernel: [<ffffffff81448929>] ? __netif_receive_skb+0x529/0x750 Feb 18 18:29:53 hv5 kernel: [<ffffffff81448bea>] ? process_backlog+0x9a/0x100 Feb 18 18:29:53 hv5 kernel: [<ffffffff8144d453>] ? net_rx_action+0x103/0x2f0 Feb 18 18:29:53 hv5 kernel: [<ffffffff810770b1>] ? __do_softirq+0xc1/0x1e0 Feb 18 18:29:53 hv5 kernel: [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30 Feb 18 18:29:53 hv5 kernel: <EOI> [<ffffffff8100de05>] ? do_softirq+0x65/0xa0 Feb 18 18:29:53 hv5 kernel: [<ffffffff8144d8d8>] ? netif_rx_ni+0x28/0x30 Feb 18 18:29:53 hv5 kernel: [<ffffffffa02b7749>] ? tun_sendmsg+0x229/0x4ec [tun] Feb 18 18:29:53 hv5 kernel: [<ffffffffa037bcf5>] ? handle_tx+0x275/0x5e0 [vhost_net] Feb 18 18:29:53 hv5 kernel: [<ffffffffa037c095>] ? handle_tx_kick+0x15/0x20 [vhost_net] Feb 18 18:29:53 hv5 kernel: [<ffffffffa037955c>] ? vhost_worker+0xbc/0x140 [vhost_net] Feb 18 18:29:53 hv5 kernel: [<ffffffffa03794a0>] ? vhost_worker+0x0/0x140 [vhost_net] Feb 18 18:29:53 hv5 kernel: [<ffffffff81096a36>] ? kthread+0x96/0xa0 Feb 18 18:29:53 hv5 kernel: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20 Feb 18 18:29:53 hv5 kernel: [<ffffffff810969a0>] ? kthread+0x0/0xa0 Feb 18 18:29:53 hv5 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Feb 18 18:29:53 hv5 kernel: ---[ end trace 2ae4b3142333fe7d ]---

-- Met vriendelijke groeten / With kind regards, Johan Kooijman T +31(0) 6 43 44 45 27 F +31(0) 162 82 00 01 E mail@johankooijman.com

Nir Soffer

11:11 a.m.

----- Original Message -----

...

From: "Johan Kooijman" <mail@johankooijman.com> To: "Nir Soffer" <nsoffer@redhat.com> Cc: "users" <users@ovirt.org> Sent: Sunday, February 23, 2014 3:48:25 AM Subject: Re: [Users] Nodes lose storage at random

Been reinstalling to stocj CentOS 6.5 last night, all successful. Until roughly midnight GMT, 2 out of 4 hosts were showing the same errors.

Any more suggestions?

Lets see the logs from these hosts?

Johan Kooijman

24 Feb 24 Feb

12:45 a.m.

Interestingly enough - same thing happened today, around the same time. Logs from this host are attached. Around 1:10 AM stuff starts to go wrong again. Same pattern - we reboot the node and the node is fine again. On Sun, Feb 23, 2014 at 12:11 PM, Nir Soffer <nsoffer@redhat.com> wrote:

...

----- Original Message -----

...
From: "Johan Kooijman" <mail@johankooijman.com> To: "Nir Soffer" <nsoffer@redhat.com> Cc: "users" <users@ovirt.org> Sent: Sunday, February 23, 2014 3:48:25 AM Subject: Re: [Users] Nodes lose storage at random

Been reinstalling to stocj CentOS 6.5 last night, all successful. Until roughly midnight GMT, 2 out of 4 hosts were showing the same errors.

Any more suggestions?

Lets see the logs from these hosts?

-- Met vriendelijke groeten / With kind regards, Johan Kooijman T +31(0) 6 43 44 45 27 F +31(0) 162 82 00 01 E mail@johankooijman.com

Nir Soffer

9:48 a.m.

----- Original Message -----

...

From: "Johan Kooijman" <mail@johankooijman.com> To: "Nir Soffer" <nsoffer@redhat.com> Cc: "users" <users@ovirt.org> Sent: Monday, February 24, 2014 2:45:59 AM Subject: Re: [Users] Nodes lose storage at random

Interestingly enough - same thing happened today, around the same time. Logs from this host are attached.

Around 1:10 AM stuff starts to go wrong again. Same pattern - we reboot the node and the node is fine again.

So we made some progress, we know that it is not a problem with old kernel. In messages we see the same picture: 1. sanlock fail to renew the lease 2. after 80 secodns, kill vdsm 3. sanlock and vdsm cannot access the storage 4. kernel complain about nfs server timeouts (explains why sanlock failed to renew the lease) 5. after reboot, nfs is accessible again 6. after few days goto step 1 This looks like kernel nfs issue. Could be also kvm issue (running bsd on the one of the vm?) Could be also some incompatibility with the nfs server - maybe you are using esoteric configuration options? CCing Ronen, in case this is related to kvm. thread: http://lists.ovirt.org/pipermail/users/2014-February/021507.html Nir

Ronen Hod

2:55 p.m.

On 02/24/2014 11:48 AM, Nir Soffer wrote:

...

----- Original Message -----

...
From: "Johan Kooijman" <mail@johankooijman.com> To: "Nir Soffer" <nsoffer@redhat.com> Cc: "users" <users@ovirt.org> Sent: Monday, February 24, 2014 2:45:59 AM Subject: Re: [Users] Nodes lose storage at random

Interestingly enough - same thing happened today, around the same time. Logs from this host are attached.

Around 1:10 AM stuff starts to go wrong again. Same pattern - we reboot the node and the node is fine again. So we made some progress, we know that it is not a problem with old kernel.

In messages we see the same picture:

1. sanlock fail to renew the lease 2. after 80 secodns, kill vdsm 3. sanlock and vdsm cannot access the storage 4. kernel complain about nfs server timeouts (explains why sanlock failed to renew the lease) 5. after reboot, nfs is accessible again 6. after few days goto step 1

This looks like kernel nfs issue.

Could be also kvm issue (running bsd on the one of the vm?)

Could be also some incompatibility with the nfs server - maybe you are using esoteric configuration options?

CCing Ronen, in case this is related to kvm.

Not seems to be related to KVM. Adding Ric Wheeler. Ronen.

...

thread: http://lists.ovirt.org/pipermail/users/2014-February/021507.html

Nir

Johan Kooijman

12 Mar 12 Mar

9:56 p.m.

Gents, I'm sending this email for archiving purposes: It's been a while since my last update on this topic. It turned out that although only one node, randomly, at a time lost connection to storage, the issue was not at all with ovirt, but with the storage. I'd like to refer to these 2 topics for more information: http://lists.freebsd.org/pipermail/freebsd-net/2014-March/038061.html and http://lists.freebsd.org/pipermail/freebsd-net/2014-February/037967.html. Keywords: ovirt freebsd 9.2 zfs ixgbe intel 10gbit NFS On Mon, Feb 24, 2014 at 3:55 PM, Ronen Hod <rhod@redhat.com> wrote:

...

On 02/24/2014 11:48 AM, Nir Soffer wrote:

...
----- Original Message -----

...
From: "Johan Kooijman" <mail@johankooijman.com> To: "Nir Soffer" <nsoffer@redhat.com> Cc: "users" <users@ovirt.org> Sent: Monday, February 24, 2014 2:45:59 AM Subject: Re: [Users] Nodes lose storage at random

Interestingly enough - same thing happened today, around the same time. Logs from this host are attached.

Around 1:10 AM stuff starts to go wrong again. Same pattern - we reboot the node and the node is fine again.

So we made some progress, we know that it is not a problem with old kernel.

In messages we see the same picture:

1. sanlock fail to renew the lease 2. after 80 secodns, kill vdsm 3. sanlock and vdsm cannot access the storage 4. kernel complain about nfs server timeouts (explains why sanlock failed to renew the lease) 5. after reboot, nfs is accessible again 6. after few days goto step 1

This looks like kernel nfs issue.

Could be also kvm issue (running bsd on the one of the vm?)

Could be also some incompatibility with the nfs server - maybe you are using esoteric configuration options?

CCing Ronen, in case this is related to kvm.

Not seems to be related to KVM. Adding Ric Wheeler.

Ronen.

...
thread: http://lists.ovirt.org/pipermail/users/2014-February/021507.html

Nir

-- Met vriendelijke groeten / With kind regards, Johan Kooijman T +31(0) 6 43 44 45 27 F +31(0) 162 82 00 01 E mail@johankooijman.com

4168

Age (days ago)

4190

Last active (days ago)

List overview

Download

17 comments

4 participants

participants (4)

Johan Kooijman
Meital Bourvine
Nir Soffer
Ronen Hod