[Users] oVirt/RHEV fencing; a single point of failure

Sat Mar 3 20:18:08 UTC 2012

----- Original Message -----
> From: "Perry Myers" <pmyers at redhat.com>
> To: "xrx" <xrx-ovirt at xrx.me>, "Ryan O'Hara" <rohara at redhat.com>, "Andrew Beekhof" <abeekhof at redhat.com>
> Cc: users at ovirt.org
> Sent: Saturday, March 3, 2012 3:16:02 PM
> Subject: Re: [Users] oVirt/RHEV fencing; a single point of failure
> 
> On 03/03/2012 11:52 AM, xrx wrote:
> > Hello,
> > 
> > I was worried about the high availability approach taken by
> > RHEV/oVirt.
> > I had read the thread titled "Some thoughts on enhancing High
> > Availability in oVirt" but couldn't help but feel that oVirt is
> > missing
> > basic HA while it's developers are considering adding (and in my
> > opinion
> > unneeded) complexity with service monitoring.
> 
> Service monitoring is a highly desirable feature, but for the most
> part
> (today) people achieve it by running service monitoring in a layered
> fashion.
> 
> For example, running the RHEL HA cluster stack on top of VMs on RHEV
> (or
> Fedora Clustering on top of oVirt VMs)
> 
> So we could certainly skip providing service HA as an integral
> feature
> of oVirt and continue to leverage the Pacemaker style service HA as a
> layered option instead.
> 
> In the past I've gotten the impression that tighter integration and a
> single UI/API for managing both VM and service HA was desirable.
> 
> > It all comes down to fencing. Picture this: 3 HP hypervisors
> > running
> > RHEV/oVirt with iLO fencing. Say hypervisor A runs 10 VMs, all of
> > which
> > are set to be highly available. Now suppose that hypervisor A has a
> > power failure or an iLO failure (I've seen it happen more than once
> > with
> > a batch of HP DL380 G6s). Because RHEV would not be able to fence
> > the
> > hypervisor as it's iLO is unresponsive; those 10 HA VMs that were
> > halted
> > are NOT moved to other hypervisors automatically.
> > 
> > I suggest that oVirt concentrates on having support for multiple
> > fencing
> > devices as a development priority. SCSI persistent reservation
> > based
> > fencing would be an ideal secondary, if not primary, fencing
> > device; it
> > would be easy to set up for users as SANs generally support it and
> > is
> > proven to work well, as seen on Red Hat clusters.
> 
> Completely agree here.  The Pacemaker/rgmanager cluster stacks
> already
> support an arbitrary number of fence devices per host, to provide
> support for both redundant power supplies and also for redundant
> fencing
> devices.  In order to provide resilient service HA, fixing this would
> be
> a prerequisite anyhow.  I've cc'd Andrew Beekhof from the
> Pacemaker/stonith_ng, since I think it might be useful to model the
> fencing for oVirt similarly to how Pacemaker/stonith_ng does it.
> Perhaps there's even some code that could be reused for this as well.
> 
> As for SCSI III PR based fencing... the trouble here has been that
> the
> fence_scsi script provided in fence-agents is Perl based, and we were
> hesitant to drag Perl into the list of required things on oVirt Node
> (and in general)
> 
> on the other hand, fence-scsi might not be the right level of
> granularity for oVirt based SCSI III PR based fencing anyhow.
>  Perhaps
> better would be to just have vdsm directly call sg_persist commands
> directly.
> 
> I've cc'd Ryan O'Hara who wrote fence_scsi and knows a fair bit about
> SCSI III PR.  If oVirt is interested in pursuing this, perhaps he can
> be
> of assistance.

There's also sanlock which plays a role here. In the past we required some form of fencing action but once sanlock is integrated that provides another path.

> 
> > I have brought up this point about fencing being a single point of
> > failure in RHEV with a Red Hat employee (Mark Wagner) during the
> > RHEV
> > virtual event; but he said that it is not. I don't see how it
> > isn't; one
> > single loose iLO cable and the VMs are stuck until there is manual
> > intervention.
> 
> Agreed.  This is something that should be easily fixed in order to
> provide greater HA.
> 
> That being said, I still think more tightly integrated service HA is
> a
> good idea as well.
> 
> Perry
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
>