[ovirt-users] ovirt+gluster+NFS : storage hicups
Nicolas Ecarnot
nicolas at ecarnot.net
Wed Aug 5 14:32:38 UTC 2015
Hi,
I used the two links below to setup a test DC :
http://community.redhat.com/blog/2014/05/ovirt-3-4-glusterized/
http://community.redhat.com/blog/2014/11/up-and-running-with-ovirt-3-5-part-two/
The only thing I did different is I did not usea hosted engine, but I
dedicated a solid server for that.
So I have one engine (CentOS 6.6), and 3 hosts (CentOS 7.0)
As in the doc above, my 3 hosts are publishing 300 Go of replicated
gluster storage, above which ctdb is managing a floating virtual ip that
is used by NFS as the master storage domain.
The last point is that the manager is also presenting a NFS storage I'm
using as an export domain.
It took me some time to plug all this setup as it is a bit more
complicated as my other DC with a real SAN and no gluster, but it is
eventually working (I can run VMs, migrate them...)
I have made many severe tests (from a very dumb user point of view :
unplug/replug the power cable of this server - does ctdb floats the vIP?
does gluster self-heals?, does the VM restart?...)
When precisely looking each layer one by one, all seems to be correct :
ctdb is fast at managing the ip, NFS is OK, gluster seems to
reconstruct, fencing eventually worked with the lanplus workaround, and
so on...
But from times to times, there seem to appear a severe hicup which I
have great difficulties to diagnose.
The messages in the web gui are not very precise, and not consistent:
- some tell about some host having network issues, but I can ping it
from every place it needs to be reached (especially from the SPM and the
manager)
"On host serv-vm-al01, Error: Network error during communication with
the Host"
- some tell that some volume is degraded, when it's not (gluster
commands are showing no issue. Even the oVirt tab about the volumes are
all green)
- "Host serv-vm-al03 cannot access the Storage Domain(s) <UNKNOWN>
attached to the Data Center"
Just by waiting a couple of seconds lead to a self heal with no action.
- Repeated "Detected change in status of brick
serv-vm-al03:/gluster/data/brick of volume data from DOWN to UP."
but absolutely no action is made on this filesystem.
At this time, zero VM is running in this test datacenter, and no action
is made on the hosts. Though, I see some looping errors coming and
going, and I find no way to diagnose.
Amongst the *actions* that I had the idea to use to solve some issues :
- I've found that trying to force the self-healing, and playing with
gluster commands had no effect
- I've found that playing with gluster adviced actions "find /gluster
-exec stat {} \; ..." seem to have no either effect
- I've found that forcing ctdb to move the vIp ("ctdb stop, ctdb
continue") DID SOLVE most of these issue.
I believe that it's not what ctdb is doing that helps, but maybe one of
its shell hook that is cleaning some troubles?
As this setup is complexe, I don't ask anyone a silver bullet, but maybe
you may know which layer is the most fragile, and which one I should
look at more closely?
--
Nicolas ECARNOT
More information about the Users
mailing list