[Users] Opinions needed: 3 node gluster replica 3 | NFS async | snapshots for consistency
Ayal Baron
abaron at redhat.com
Tue Feb 25 07:42:20 UTC 2014
----- Original Message -----
> On 2/23/2014 3:20 PM, Ayal Baron wrote:
> > ----- Original Message -----
> >> On Sun, Feb 23, 2014 at 4:27 AM, Ayal Baron <abaron at redhat.com> wrote:
> >>> ----- Original Message -----
> >>>> I'm looking for some opinions on this configuration in an effort to
> >>>> increase
> >>>> write performance:
> >>>>
> >>>> 3 storage nodes using glusterfs in replica 3, quorum.
> >>> gluster doesn't support replica 3 yet, so I'm not sure how heavily I'd
> >>> rely on this.
> >>>
> >> Glusterfs or RHSS doesn't support rep 3? How could I create a quorum
> >> without 3+ hosts?
> > glusterfs has the capability but it hasn't been widely tested with oVirt
> > yet and we've already found a couple of issues there.
> > afaiu gluster has the ability to define a tie breaker (a third node which
> > is part of the quorum but does not provide a third replica of the data).
>
> I've been researching glusterfs for quite a while, and have had a 3 node
> replica up and running, but never heard of a "tie breaker" node. Anywhere
> this is documented? I could use something like that.
Deferring to Vijay who can give much more qualified answers on these matters.
>
> I will be testing a 3-node ovirt + gluster setup, hopefully yet this week.
> >>
> >>>> Ovirt storage domain via NFS
> >>> why NFS and not gluster?
> >>>
> >> Gluster via posix SD doesn't have any performance gains over NFS, maybe
> >> the
> >> opposite.
> > gluster via posix is mounting it using the gluster fuse client which should
> > provide better performance + availability than NFS.
> >
> >> Gluster 'native' SD's are broken on EL6.5 so I have been unable to test
> >> performance. I have heard performance can be upwards of 3x NFS for raw
> >> write.
> > Broken how?
> >
> >> Gluster doesn't have an async write option, so its doubtful it will ever
> >> be
> >> close to NFS async speeds.
> >>
> >>
> >>>> Volume set nfs.trusted-sync on
> >>>> On Ovirt, taking snapshots often enough to recover from a storage crash
> >>> Note that this would have negative write performance impact
> >>>
> >> The difference between NFS sync (<50MB/s) and async (>300MB/s on 10g)
> >> write
> >> speeds should more than compensate for the performance hit of taking
> >> snapshots more often. And that's just raw speed. If we take into
> >> consideration IOPS (guest small writes) async is leaps and bounds ahead.
> > I would test this, since qemu is already doing async I/O (using threads
> > when native AIO is not supported) and oVirt runs it with cache=none
> > (direct I/O) so sync ops should not happen that often (depends on guest).
> > You may be still enjoying performance boost, but I've seen UPS systems
> > fail before bringing down multiple nodes at once.
> > In addition, if you do not guarantee your data is safe when you create a
> > snapshot (and it doesn't seem like you are) then I see no reason to think
> > your snapshots are any better off than latest state on disk.
> >
> >>
> >> If we assume the site has backup UPS and generator power and we can build
> >> a
> >> highly available storage system with 3 nodes in quorum, are there any
> >> potential issues other than a write performance hit?
> >>
> >> The issue I thought might be most prevalent is if an ovirt host goes down
> >> and the VM's are automatically brought back up on another host, they could
> >> incur disk corruption and need to be brought back down and restored to the
> >> last snapshot state. This basically means the HA feature should be
> >> disabled.
> > I'm not sure I understand what your concern is here, what would cause the
> > data corruption? if your node crashed then there is no I/O in flight. So
> > starting up the VM should be perfectly safe.
> It seems to me that either the VM can start and clean up it's own disk, or it
> can't, same as a bare-metal computer after a crash. I have not experienced
> any "additional" corruption opportunities. I see no reason to not use the
> HA. The worst that happens is that the boot hangs, you have to revert to a
> snapshot. My experience (in bare metal and other virtualization
> environments) is that about 98% of the time the computer will reboot after an
> unclean shutdown (power failure or virtual equivalent). I have done this to
> virtual machines more times than I want to admit.
>
> Maybe you need to tell us more about what you have in mind as far as
> corruption, so that we can either confirm or debunk your concerns.
> >
> >> Even worse, if the gluster node with CTDB NFS IP goes down, it may not
> >> have
> >> written out and replicated to its peers. <-- I think I may have just
> >> answered my own question.
> > If 'trusted-sync' means that the CTDB NFS node acks the I/O before it
> > reached quorum then I'd say that's a gluster bug. It should ack the I/O
> > before data hits the disc, but it should not ack it before it has quorum.
> > However, the configuration we feel comfortable using gluster is with both
> > server and client quorum (gluster has 2 different configs and you need to
> > configure both to work safely).
> Gluster does not "replicate to its peers". Gluster writes to all peers at
> the same time, as part of the original write process. If quorum is on, the
> process either works or it doesn't.
This is only true using the native gluster client (fuse or libgfapi), if you are mounting gluster using NFS then you are working against a single gluster NFS server and your client is oblivious to the replicas and hashing and the need to send the data to multiple servers. In this case, the server you mounted against serves as a gateway and *it* is the one acting as the gluster client and replicating the I/O to the gluster servers (in fact, it might be that this server doesn't even have a copy of the data).
>
> I assume you are keeping in mind that the kernel NFS server does not get
> along with gluster. You need to run gluster's own NFS server, and turn off
> the kernel NFS server. Gluster's own NFS server is gluster-aware, so I think
> some of the problems you envision may be covered in that server.
>
> Ted Miller
> Elkhart, IN, USA
>
>
More information about the Users
mailing list