[ovirt-users] ovirt+gluster+NFS : storage hicups

Nicolas Ecarnot nicolas at ecarnot.net
Wed Aug 5 14:32:38 UTC 2015


Hi,

I used the two links below to setup a test DC :

http://community.redhat.com/blog/2014/05/ovirt-3-4-glusterized/
http://community.redhat.com/blog/2014/11/up-and-running-with-ovirt-3-5-part-two/

The only thing I did different is I did not usea hosted engine, but I 
dedicated a solid server for that.
So I have one engine (CentOS 6.6), and 3 hosts (CentOS 7.0)

As in the doc above, my 3 hosts are publishing 300 Go of replicated 
gluster storage, above which ctdb is managing a floating virtual ip that 
is used by NFS as the master storage domain.

The last point is that the manager is also presenting a NFS storage I'm 
using as an export domain.

It took me some time to plug all this setup as it is a bit more 
complicated as my other DC with a real SAN and no gluster, but it is 
eventually working (I can run VMs, migrate them...)

I have made many severe tests (from a very dumb user point of view : 
unplug/replug the power cable of this server - does ctdb floats the vIP? 
does gluster self-heals?, does the VM restart?...)
When precisely looking each layer one by one, all seems to be correct : 
ctdb is fast at managing the ip, NFS is OK, gluster seems to 
reconstruct, fencing eventually worked with the lanplus workaround, and 
so on...

But from times to times, there seem to appear a severe hicup which I 
have great difficulties to diagnose.
The messages in the web gui are not very precise, and not consistent:
- some tell about some host having network issues, but I can ping it 
from every place it needs to be reached (especially from the SPM and the 
manager)
"On host serv-vm-al01, Error: Network error during communication with 
the Host"

- some tell that some volume is degraded, when it's not (gluster 
commands are showing no issue. Even the oVirt tab about the volumes are 
all green)

- "Host serv-vm-al03 cannot access the Storage Domain(s) <UNKNOWN> 
attached to the Data Center"
Just by waiting a couple of seconds lead to a self heal with no action.

- Repeated "Detected change in status of brick 
serv-vm-al03:/gluster/data/brick of volume data from DOWN to UP."
but absolutely no action is made on this filesystem.

At this time, zero VM is running in this test datacenter, and no action 
is made on the hosts. Though, I see some looping errors coming and 
going, and I find no way to diagnose.

Amongst the *actions* that I had the idea to use to solve some issues :
- I've found that trying to force the self-healing, and playing with 
gluster commands had no effect
- I've found that playing with gluster adviced actions "find /gluster 
-exec stat {} \; ..." seem to have no either effect
- I've found that forcing ctdb to move the vIp ("ctdb stop, ctdb 
continue") DID SOLVE most of these issue.
I believe that it's not what ctdb is doing that helps, but maybe one of 
its shell hook that is cleaning some troubles?

As this setup is complexe, I don't ask anyone a silver bullet, but maybe 
you may know which layer is the most fragile, and which one I should 
look at more closely?

-- 
Nicolas ECARNOT



More information about the Users mailing list