[Users] Opinions needed: 3 node gluster replica 3 | NFS async | snapshots for consistency

Mon Feb 24 08:20:54 UTC 2014

----- Original Message -----
> On Sun, Feb 23, 2014 at 3:20 PM, Ayal Baron <abaron at redhat.com> wrote:
> 
> >
> >
> > ----- Original Message -----
> > > On Sun, Feb 23, 2014 at 4:27 AM, Ayal Baron <abaron at redhat.com> wrote:
> > >
> > > >
> > > >
> > > > ----- Original Message -----
> > > > > I'm looking for some opinions on this configuration in an effort to
> > > > increase
> > > > > write performance:
> > > > >
> > > > > 3 storage nodes using glusterfs in replica 3, quorum.
> > > >
> > > > gluster doesn't support replica 3 yet, so I'm not sure how heavily I'd
> > > > rely on this.
> > > >
> > >
> > > Glusterfs or RHSS doesn't support rep 3? How could I create a quorum
> > > without 3+ hosts?
> >
> > glusterfs has the capability but it hasn't been widely tested with oVirt
> > yet and we've already found a couple of issues there.
> > afaiu gluster has the ability to define a tie breaker (a third node which
> > is part of the quorum but does not provide a third replica of the data).
> >
> 
> Good to know, I'll dig into this.
> 
> 
> >
> > >
> > >
> > > >
> > > > > Ovirt storage domain via NFS
> > > >
> > > > why NFS and not gluster?
> > > >
> > >
> > > Gluster via posix SD doesn't have any performance gains over NFS, maybe
> > the
> > > opposite.
> >
> > gluster via posix is mounting it using the gluster fuse client which
> > should provide better performance + availability than NFS.
> >
> 
> Availability for sure, but performance is seriously questionable. I've run
> in both scenarios and haven't seen a performance improvement, the general
> consensus seems to be fuse is adding overhead and therefore decreasing
> performance vs. NFS.

The fuse client has 2 performance aspects:
1. latency overhead (this can only be fixed using libgfapi, see below for more on that)
2. throughput - using the fuse mount we've been able to max out 10G ethernet. Per VM you don't need such throughput, but for many VMs, the way to get better throughput is to simply have multiple mounts.
This can be either:
1. create multiple storage domains (on same gluster) and you'll get better overall throughput
2. add code to vdsm to support multiple mounts per gluster volume and round robin the VMs between them (it is run time round robin, simply which path we're passing to libvirt).

> 
> 
> >
> > >
> > > Gluster 'native' SD's are broken on EL6.5 so I have been unable to test
> > > performance. I have heard performance can be upwards of 3x NFS for raw
> > > write.
> >
> > Broken how?
> >
> 
> Ongoing issues, libgfapi support wasn't available, and then was disabled
> because snapshot support wasn't built into the kvm packages which was a
> dependency. There are a few threads in reference to this, and some effort
> to get CentOS builds to enable snapshot support in kvm.
> 
> I have installed rebuilt qemu packages with the RHEV snapshot flag enabled,
> and was just able to create a native gluster SD, maybe I missed something
> during a previous attempt. I'll test performance and see if its close to
> what I'm looking for.

Just wanted to make sure that we're talking about the same thing.  Creating a gluster storage domain will still be using fuse in current version of ovirt, but the moment libgfapi works well with snapshots (libvirt has patches which we're testing) then we'll update vdsm to use libgfapi behind the scenes (so using gluster storage domain will automatically lead to usage of libgfapi once it is stable.

> 
> 
> >
> > >
> > > Gluster doesn't have an async write option, so its doubtful it will ever
> > be
> > > close to NFS async speeds.t
> > >
> > >
> > > >
> > > > > Volume set nfs.trusted-sync on
> > > > > On Ovirt, taking snapshots often enough to recover from a storage
> > crash
> > > >
> > > > Note that this would have negative write performance impact
> > > >
> > >
> > > The difference between NFS sync (<50MB/s) and async (>300MB/s on 10g)
> > write
> > > speeds should more than compensate for the performance hit of taking
> > > snapshots more often. And that's just raw speed. If we take into
> > > consideration IOPS (guest small writes) async is leaps and bounds ahead.
> >
> > I would test this, since qemu is already doing async I/O (using threads
> > when native AIO is not supported) and oVirt runs it with cache=none (direct
> > I/O) so sync ops should not happen that often (depends on guest).  You may
> > be still enjoying performance boost, but I've seen UPS systems fail before
> > bringing down multiple nodes at once.
> > In addition, if you do not guarantee your data is safe when you create a
> > snapshot (and it doesn't seem like you are) then I see no reason to think
> > your snapshots are any better off than latest state on disk.
> >
> 
> My logic here was if a snapshot is run, then the disk and system state
> should be consistent at time of snapshot once its been written to storage.

correct, but you have no guarantee that it is actually written to storage.  That is what's giving you the better performance.

> If the host failed during snapshot then the snapshot would be incomplete,
> and the last complete snapshot would need to be used for recovery.
> 
> 
> >
> > >
> > >
> > > If we assume the site has backup UPS and generator power and we can
> > build a
> > > highly available storage system with 3 nodes in quorum, are there any
> > > potential issues other than a write performance hit?
> > >
> > > The issue I thought might be most prevalent is if an ovirt host goes down
> > > and the VM's are automatically brought back up on another host, they
> > could
> > > incur disk corruption and need to be brought back down and restored to
> > the
> > > last snapshot state. This basically means the HA feature should be
> > disabled.
> >
> > I'm not sure I understand what your concern is here, what would cause the
> > data corruption? if your node crashed then there is no I/O in flight.  So
> > starting up the VM should be perfectly safe.
> >
> 
> Good point, that makes sense.
> 
> 
> >
> > >
> > > Even worse, if the gluster node with CTDB NFS IP goes down, it may not
> > have
> > > written out and replicated to its peers.  <-- I think I may have just
> > > answered my own question.
> >
> > If 'trusted-sync' means that the CTDB NFS node acks the I/O before it
> > reached quorum then I'd say that's a gluster bug.
> 
> 
> http://gluster.org/community/documentation/index.php/Gluster_3.2:_Setting_Volume_Options#nfs.trusted-syncSpecifically
> mentions data won't be guaranteed to be on disk, but doesn't
> mention if data would be replicated in memory between gluster nodes.
> Technically async breaks the NFS protocol standard anyways but this seems
> like a question for the gluster guys, I'll reach out on freenode.

Please reply back here with the info, I'm sure it could interest people on this list.
Once you reach a working configuration it would also be great if you could do a quick writeup of the config in the ovirt wiki as a reference.

> 
> 
> >  It should ack the I/O before data hits the disc, but it should not ack it
> > before it has quorum.
> > However, the configuration we feel comfortable using gluster is with both
> > server and client quorum (gluster has 2 different configs and you need to
> > configure both to work safely).
> >
> 
> Is this specifically in relation to posix/fuse mounts? With libgfapi does
> the host pick up the config from the server side?

Afaiu libgfapi is simply a gluster client inside qemu.  So it should have the same options available, but again, this is something that needs to be checked with the gluster folks.

> 
> 
> >
> >
> > >
> > >
> > > Thanks,
> > > Steve
> > >
> >
>