Hello,
I was worried about the high availability approach taken by RHEV/oVirt.
I had read the thread titled "Some thoughts on enhancing High
Availability in oVirt" but couldn't help but feel that oVirt is missing
basic HA while it's developers are considering adding (and in my opinion
unneeded) complexity with service monitoring.
It all comes down to fencing. Picture this: 3 HP hypervisors running
RHEV/oVirt with iLO fencing. Say hypervisor A runs 10 VMs, all of which
are set to be highly available. Now suppose that hypervisor A has a
power failure or an iLO failure (I've seen it happen more than once with
a batch of HP DL380 G6s). Because RHEV would not be able to fence the
hypervisor as it's iLO is unresponsive; those 10 HA VMs that were halted
are NOT moved to other hypervisors automatically.
I suggest that oVirt concentrates on having support for multiple fencing
devices as a development priority. SCSI persistent reservation based
fencing would be an ideal secondary, if not primary, fencing device; it
would be easy to set up for users as SANs generally support it and is
proven to work well, as seen on Red Hat clusters.
I have brought up this point about fencing being a single point of
failure in RHEV with a Red Hat employee (Mark Wagner) during the RHEV
virtual event; but he said that it is not. I don't see how it isn't; one
single loose iLO cable and the VMs are stuck until there is manual
intervention.
Any thoughts?
-xrx