[Users] oVirt/RHEV fencing; a single point of failure

Sat Mar 3 16:52:06 UTC 2012

Hello,

I was worried about the high availability approach taken by RHEV/oVirt. 
I had read the thread titled "Some thoughts on enhancing High 
Availability in oVirt" but couldn't help but feel that oVirt is missing 
basic HA while it's developers are considering adding (and in my opinion 
unneeded) complexity with service monitoring.

It all comes down to fencing. Picture this: 3 HP hypervisors running 
RHEV/oVirt with iLO fencing. Say hypervisor A runs 10 VMs, all of which 
are set to be highly available. Now suppose that hypervisor A has a 
power failure or an iLO failure (I've seen it happen more than once with 
a batch of HP DL380 G6s). Because RHEV would not be able to fence the 
hypervisor as it's iLO is unresponsive; those 10 HA VMs that were halted 
are NOT moved to other hypervisors automatically.

I suggest that oVirt concentrates on having support for multiple fencing 
devices as a development priority. SCSI persistent reservation based 
fencing would be an ideal secondary, if not primary, fencing device; it 
would be easy to set up for users as SANs generally support it and is 
proven to work well, as seen on Red Hat clusters.

I have brought up this point about fencing being a single point of 
failure in RHEV with a Red Hat employee (Mark Wagner) during the RHEV 
virtual event; but he said that it is not. I don't see how it isn't; one 
single loose iLO cable and the VMs are stuck until there is manual 
intervention.

Any thoughts?

-xrx