[Users] Data Center stuck between "Non Responsive" and "Contending"

Mon Jan 27 21:46:10 UTC 2014

----- Original Message -----
> From: "Ted Miller" <tmiller at hcjb.org>
> To: "Federico Simoncelli" <fsimonce at redhat.com>, "Itamar Heim" <iheim at redhat.com>
> Cc: users at ovirt.org
> Sent: Monday, January 27, 2014 7:16:14 PM
> Subject: Re: [Users] Data Center stuck between "Non Responsive" and "Contending"
> 
> 
> On 1/27/2014 3:47 AM, Federico Simoncelli wrote:
> > Maybe someone from gluster can identify easily what happened. Meanwhile if
> > you just want to repair your data-center you could try with:
> >
> >   $ cd
> >   /rhev/data-center/mnt/glusterSD/10.41.65.2\:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/
> >   $ touch ids
> >   $ sanlock direct init -s
> >   0322a407-2b16-40dc-ac67-13d387c6eb4c:0:ids:1048576
> Federico,
> 
> I won't be able to do anything to the ovirt setup for another 5 hours or so
> (it is a trial system I am working on  at home, I am at work), but I will try
> your repair script and report back.
> 
> In bugzilla 862975 they suggested turning off write-behind caching and "eager
> locking" on the gluster volume to avoid/reduce the problems that come from
> many different computers all writing to the same file(s) on a very frequent
> basis.  If I interpret the comment in the bug correctly, it did seem to help
> in that situation.  My situation is a little different.  My gluster setup is
> replicate only, replica 3 (though there are only two hosts).  I was not
> stress-testing it, I was just using it, trying to figure out how I can import
> some old VMWare VMs without an ESXi server to run them on.

Have you done anything similar to what is described here in comment 21?

https://bugzilla.redhat.com/show_bug.cgi?id=859589#c21

When did you realize that you weren't able to use the data-center anymore?
Can you describe exactly what you did and what happened, for example:

1. I created the data center (up and running)
2. I tried to import some VMs from VMWare
3. During the import (or after it) the data-center went in the contending state
...

Did something special happened? I don't know, power loss, split-brain?
For example also an excessive load on one of the servers could have triggered
a timeout somewhere (forcing the data-center to go back in the contending
state).

Could you check if any host was fenced? (Forcibly rebooted)

> I am guessing that what makes cluster storage have the (Master) designation
> is that this is the one that actually contains the sanlocks?  If so, would it
> make sense to set up a gluster volume to be (Master), but not use it for VM
> storage, just for storing the sanlock info?  Separate gluster volume(s) could
> then have the VMs on it(them), and would not need the optimizations turned
> off.

Any domain must be able to become the master at any time. Without a master
the data center is unusable (at the present time), that's why we migrate (or
reconstruct) it on another domain when necessary.

-- 
Federico