
Hi, sorry for late reponse I somehow missed your email :-( I cannot completely understand you exact issue from the description, but the situation when engine loses connection to all hypervisors is always bad. Fortunately we made a few improvements in 3.5, which should help in those scenarios. Please take a look at "Fencing policy" tab in "Edit cluster" dialog: 1. Skip fencing if host has live lease on storage - when host is connected to storage it has to renew its storage lease at least every 60 secs - so if the option is enabled and engine tries to fence the host host using fence proxy (another host in cluster/DC which has good connection), fence proxy checks if non responsive host renewed its storage lease in the last 90 secs. And if lease was renewed, fencing is aborted 2. Skip fencing on cluster connectivity issues - if this options is enabled, engine test prior to fencing how many of the hosts in the cluster has connectivity issues. And if number of hosts with connectivity issues is higher than the specified percentage, fencing is aborted - of course this option is useless in clusters with less than 3 hosts 3. Enable fencing - by disabling this option you can completely disable fencing for hosts in the cluster - this is usable in the situation when you expect connectivity issues between engine and hosts (for example during switch replacement), so you can disable fencing, replace the switch and when connection is restored, enable fencing again - however if you disable fencing completely, your HA VMs won't be restarted on different hosts, so please use this option with caution Please let me known if have any other issues/questions with fencing. Thanks Martin Perina ----- Original Message -----
From: "Martin Breault" <martyb@creenet.com> To: users@ovirt.org Sent: Friday, September 11, 2015 9:14:23 PM Subject: [ovirt-users] Strange fencing behaviour 3.5.3
Hello,
I manage 2 oVirt clusters that are not associated in any way, they each have their own management engine running ovirt-engine-3.5.3.1-1. The servers are Dell 6xx series and the power-management is configured using idrac5 settings and each cluster is a pair of hypervisors.
The engines are both in a datacenter that had an electrical issue, each cluster is at a different unrelated location. The problem I had was caused by a downed switch causing the individual engines to continue to function, however no longer have connectivity to their respective clusters. Once the switch was replaced (about 30 minutes of downtime) , when connectivity was resumed, both engines chose to fence one of the two "unresponsive hypervisors" by sending an iDrac command to power down.
The downed hypervisor Cluster1 for some reason, 8 minutes later, got a iDrac command to power-up again. When I logged into the engine, the guests that were running on the powered-down host were in "off" state. I simply powered them back on.
The downed hypervisor on Cluster2 stayed off, and was unresponsive according to the engine, however the VMs that were running on it were in an unknown state. I had to power on the host and click the "host has been rebooted" dialog for the cluster to free these guests to be booted again.
My question is, is it normal for the engine to fence one or more hosts when it loses connectivity to all thehypervisors in the cluster? Is there a minimum of 3 hosts in a cluster for it to not fall into this mode? I'd like to know what I can troubleshoot or how I can avoid an issue like this should the engine be disconnected from the hypervisors temporarily and then resume connectivity only to kill the well-running guests.
Thanks in advance,
Marty
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users