Strange fencing behaviour 3.5.3

11 Sep 2015

      Hello,

I manage 2 oVirt clusters that are not associated in any way, they each 
have their own management engine running ovirt-engine-3.5.3.1-1.  The 
servers are Dell 6xx series and the power-management is configured using 
idrac5 settings and each cluster is a pair of hypervisors.

The engines are both in a datacenter that had an electrical issue, each 
cluster is at a different unrelated location.  The problem I had was 
caused by a downed switch causing the individual engines to continue to 
function, however no longer have connectivity to their respective 
clusters.  Once the switch was replaced (about 30 minutes of downtime) , 
when connectivity was resumed, both engines chose to fence one of the 
two "unresponsive hypervisors" by sending an iDrac command to power down.

The downed hypervisor Cluster1 for some reason, 8 minutes later, got a 
iDrac command to power-up again.  When I logged into the engine, the 
guests that were running on the powered-down host were in "off" state.  
I simply powered them back on.

The downed hypervisor on Cluster2 stayed off, and was unresponsive 
according to the engine, however the VMs that were running on it were in 
an unknown state.  I had to power on the host and click the "host has 
been rebooted" dialog for the cluster to free these guests to be booted 
again.

My question is, is it normal for the engine to fence one or more hosts 
when it loses connectivity to all thehypervisors in the cluster?  Is 
there a minimum of 3 hosts in a cluster for it to not fall into this 
mode?    I'd like to know what I can troubleshoot or how I can avoid an 
issue like this should the engine be disconnected from the hypervisors 
temporarily and then resume connectivity only to kill the well-running 
guests.

Thanks in advance,

Marty

Martin Breault

Martin Perina

tags

participants (2)