Re: [Users] Data Center stuck between "Non Responsive" and "Contending"

27 Jan 2014

      ----- Original Message -----
...
From: "Ted Miller" <tmiller@hcjb.org>
To: "Federico Simoncelli" <fsimonce@redhat.com>, "Itamar Heim" <iheim@redhat.com>
Cc: users@ovirt.org
Sent: Monday, January 27, 2014 7:16:14 PM
Subject: Re: [Users] Data Center stuck between "Non Responsive" and "Contending"
On 1/27/2014 3:47 AM, Federico Simoncelli wrote:
...
Maybe someone from gluster can identify easily what happened. Meanwhile if
you just want to repair your data-center you could try with:
$ cd
  /rhev/data-center/mnt/glusterSD/10.41.65.2\:VM2/0322a407-2b16-40dc-ac67-13d387c6eb4c/dom_md/
  $ touch ids
  $ sanlock direct init -s
  0322a407-2b16-40dc-ac67-13d387c6eb4c:0:ids:1048576
Federico,
I won't be able to do anything to the ovirt setup for another 5 hours or so
(it is a trial system I am working on  at home, I am at work), but I will try
your repair script and report back.
In bugzilla 862975 they suggested turning off write-behind caching and "eager
locking" on the gluster volume to avoid/reduce the problems that come from
many different computers all writing to the same file(s) on a very frequent
basis.  If I interpret the comment in the bug correctly, it did seem to help
in that situation.  My situation is a little different.  My gluster setup is
replicate only, replica 3 (though there are only two hosts).  I was not
stress-testing it, I was just using it, trying to figure out how I can import
some old VMWare VMs without an ESXi server to run them on.
Have you done anything similar to what is described here in comment 21?

https://bugzilla.redhat.com/show_bug.cgi?id=859589#c21

When did you realize that you weren't able to use the data-center anymore?
Can you describe exactly what you did and what happened, for example:

1. I created the data center (up and running)
2. I tried to import some VMs from VMWare
3. During the import (or after it) the data-center went in the contending state
...

Did something special happened? I don't know, power loss, split-brain?
For example also an excessive load on one of the servers could have triggered
a timeout somewhere (forcing the data-center to go back in the contending
state).

Could you check if any host was fenced? (Forcibly rebooted)
...
I am guessing that what makes cluster storage have the (Master) designation
is that this is the one that actually contains the sanlocks?  If so, would it
make sense to set up a gluster volume to be (Master), but not use it for VM
storage, just for storing the sanlock info?  Separate gluster volume(s) could
then have the VMs on it(them), and would not need the optimizations turned
off.
Any domain must be able to become the master at any time. Without a master
the data center is unusable (at the present time), that's why we migrate (or
reconstruct) it on another domain when necessary.

-- 
Federico