[ovirt-users] ovirt+gluster+NFS : storage hicups

Thu Aug 6 09:08:06 UTC 2015

----- Original Message -----
> From: "Nicolas Ecarnot" <nicolas at ecarnot.net>
> To: "users at ovirt.org" <Users at ovirt.org>
> Sent: Wednesday, August 5, 2015 5:32:38 PM
> Subject: [ovirt-users] ovirt+gluster+NFS : storage hicups
> 
> Hi,
> 
> I used the two links below to setup a test DC :
> 
> http://community.redhat.com/blog/2014/05/ovirt-3-4-glusterized/
> http://community.redhat.com/blog/2014/11/up-and-running-with-ovirt-3-5-part-two/
> 
> The only thing I did different is I did not usea hosted engine, but I
> dedicated a solid server for that.
> So I have one engine (CentOS 6.6), and 3 hosts (CentOS 7.0)
> 
> As in the doc above, my 3 hosts are publishing 300 Go of replicated
> gluster storage, above which ctdb is managing a floating virtual ip that
> is used by NFS as the master storage domain.
> 
> The last point is that the manager is also presenting a NFS storage I'm
> using as an export domain.
> 
> It took me some time to plug all this setup as it is a bit more
> complicated as my other DC with a real SAN and no gluster, but it is
> eventually working (I can run VMs, migrate them...)
> 
> I have made many severe tests (from a very dumb user point of view :
> unplug/replug the power cable of this server - does ctdb floats the vIP?
> does gluster self-heals?, does the VM restart?...)
> When precisely looking each layer one by one, all seems to be correct :
> ctdb is fast at managing the ip, NFS is OK, gluster seems to
> reconstruct, fencing eventually worked with the lanplus workaround, and
> so on...
> 
> But from times to times, there seem to appear a severe hicup which I
> have great difficulties to diagnose.
> The messages in the web gui are not very precise, and not consistent:
> - some tell about some host having network issues, but I can ping it
> from every place it needs to be reached (especially from the SPM and the
> manager)
Ping doesn't say much as the ssh protocol is the one being used.
Please try this and report.
Please attach logs (engine+vdsm). Log snippets would be helpful (but more important are full logs).

In general it smells like an ssh/firewall issue.

> "On host serv-vm-al01, Error: Network error during communication with
> the Host"
> 
> - some tell that some volume is degraded, when it's not (gluster
> commands are showing no issue. Even the oVirt tab about the volumes are
> all green)
> 
> - "Host serv-vm-al03 cannot access the Storage Domain(s) <UNKNOWN>
> attached to the Data Center"
> Just by waiting a couple of seconds lead to a self heal with no action.
> 
> - Repeated "Detected change in status of brick
> serv-vm-al03:/gluster/data/brick of volume data from DOWN to UP."
> but absolutely no action is made on this filesystem.
> 
> At this time, zero VM is running in this test datacenter, and no action
> is made on the hosts. Though, I see some looping errors coming and
> going, and I find no way to diagnose.
> 
> Amongst the *actions* that I had the idea to use to solve some issues :
> - I've found that trying to force the self-healing, and playing with
> gluster commands had no effect
> - I've found that playing with gluster adviced actions "find /gluster
> -exec stat {} \; ..." seem to have no either effect
> - I've found that forcing ctdb to move the vIp ("ctdb stop, ctdb
> continue") DID SOLVE most of these issue.
> I believe that it's not what ctdb is doing that helps, but maybe one of
> its shell hook that is cleaning some troubles?
> 
> As this setup is complexe, I don't ask anyone a silver bullet, but maybe
> you may know which layer is the most fragile, and which one I should
> look at more closely?
> 
> --
> Nicolas ECARNOT
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
>