Need advice using pacemaker on VMs for application HA

Hi, Excuse the English :) Need some advice regarding the use of high availability software on VMs on oVirt. We are migrating a physical high available application to oVirt. The HA platform uses pacemaker, it contains two nodes and a shared storage, fence (stonith) is configured to use ILO. I know that oVirt offers HA for VM, but this HA is not application aware, if a service crashes on the VM, it will not be detected. Fencing could be achieved by fence agent rhev. My questions are about the best way to migrate this platform to oVirt. - Is it a good idea to make both VMs (formally nodes) Highly-Available VMs? or may be pin each one of them to a particular hypervisor and/or use VM-Affinity? - I am thinking about the situation where the hypervisor containing one of the VMs crashes, what will be the behavior of the the fence agent on the application? - if the crashed VM is not HA, it will not start on another hypervisor, so the fence agent will try to fence a VM that does not exist anymore, and it will stuck. - if the crashed VM is HA, it will be started on another hypervisor, but what will happen with the fence agent? I think that one VM will fence the other one, and the application will still be unreachable for a longer period. - What about the shared storage, we will use a shared disk on oVirt which does not support snapshot - What are the things to avoid? Any suggestions will be appreciated. Regards.

On Thu, May 24, 2018 at 2:20 PM wodel youchi <wodel.youchi@gmail.com> wrote:
Hi,
Excuse the English :)
Need some advice regarding the use of high availability software on VMs on oVirt.
We are migrating a physical high available application to oVirt. The HA platform uses pacemaker, it contains two nodes and a shared storage, fence (stonith) is configured to use ILO.
I know that oVirt offers HA for VM, but this HA is not application aware, if a service crashes on the VM, it will not be detected. Fencing could be achieved by fence agent rhev.
My questions are about the best way to migrate this platform to oVirt.
- Is it a good idea to make both VMs (formally nodes) Highly-Available VMs? or may be pin each one of them to a particular hypervisor and/or use VM-Affinity?
If the VM should be highly available, you should not pin them to any host. Pinning them will make sure the vm will *not* be available when the host is down :-) So you probably want to use HA VM - with a VM lease. Warning: do not use HA VM without a VM lease, this will make sure that you will have split-brain eventually.
- I am thinking about the situation where the hypervisor containing one of the VMs crashes, what will be the behavior of the the fence agent on the application?
Not sure what do you mean by "crashes". If the hypervisor lost power, HA VM with a VM lease will be started on another hypevisor. The guest agent on the VM will not be able to do anything since it is not running :-)
- if the crashed VM is not HA, it will not start on another hypervisor, so the fence agent will try to fence a VM that does not exist anymore, and it will stuck. - if the crashed VM is HA, it will be started on another hypervisor, but what will happen with the fence agent? I think that one VM will fence the other one, and the application will still be unreachable for a longer period.
Not clear what fence agent are you talking about.
- What about the shared storage, we will use a shared disk on oVirt which does not support snapshot
What is the question?
- What are the things to avoid?
I think in general, don't try to have two mechanisms that try to do the same. Either use your HA solution or oVirt HA solution, but not both in the same time. Nir

On Thu, May 24, 2018 at 4:28 PM, Nir Soffer <nsoffer@redhat.com> wrote:
On Thu, May 24, 2018 at 2:20 PM wodel youchi <wodel.youchi@gmail.com> wrote:
[snip]
We are migrating a physical high available application to oVirt. The HA platform uses pacemaker, it contains two nodes and a shared storage, fence (stonith) is configured to use ILO.
I know that oVirt offers HA for VM, but this HA is not application aware, if a service crashes on the VM, it will not be detected. Fencing could be achieved by fence agent rhev.
My questions are about the best way to migrate this platform to oVirt.
- Is it a good idea to make both VMs (formally nodes) Highly-Available VMs? or may be pin each one of them to a particular hypervisor and/or use VM-Affinity?
If the VM should be highly available, you should not pin them to any host. Pinning them will make sure the vm will *not* be available when the host is down :-)
I think the OP means an alternate way to provide HA, more for a service inside a VM, than for the VM itself. Eg to target the scenario in which you have one/many services on a VM and you have a situation where only one of them (or only the service but not the OS itself) for some reasons fails. So the question is about building a virtual-cluster (eg with RHCS cluster sw in former versions of RHEL/CentOS 6 or with Pacemaker in RHEL/CentOS 7) that provides HA for the services that you configure on it. The 2 (or more) VMs composing the virtual-cluster could have a vnic on the production lan providing services and another vnic for the usual intra-cluster virtual lan if needed by the cluster software btw: vSphere targeted similar kind of need with the App HA feature in 5.5: http://www.virtualizationsoftware.com/vsphere-55-application-ha-advanced- application-monitoring/ I think it had not great success and I don't know if it is present in 6.x any more
So you probably want to use HA VM - with a VM lease.
This is an alternate / different approach.
- I am thinking about the situation where the hypervisor containing one of the VMs crashes, what will be the behavior of the the fence agent on the application?
Not sure what do you mean by "crashes".
Panic or loose power as you made the example for HA VM
The guest agent on the VM will not be able to do anything since it is not running :-)
It means that a cluster fence agent on the surviving VM will take care of monitoring the state of the other virtual-node of the cluster and take actions accordingly In case of RHEL/CentOS we are talking about fence-agents-rhevm rpm package. DESCRIPTION fence_rhevm is an I/O Fencing agent which can be used with RHEV-M REST API to fence virtual machines.
- if the crashed VM is not HA, it will not start on another hypervisor, so the fence agent will try to fence a VM that does not exist anymore, and it will stuck.
In case of not-HA VM1 crash, and no affinity defined for it, it can start anywhere, any host, so in case of problems on host where it was running before, this will not be a problem. So the rhevm fence agent on VM2 is able to get the VM1 status from the oVirt Engine. And after a certain timeout should be: down (using the "mandatory" power mgmt features of oVirt defined on the crashed host). As soon as it gets the down status it can failover the service on itself and power on again VM1
- if the crashed VM is HA, it will be started on another hypervisor,
but what will happen with the fence agent? I think that one VM will fence the other one, and the application will still be unreachable for a longer period.
I wuold not configure a virtual cluster on HA-VM, because actions from virtual-cluster could interfere with actions from oVirt
Not clear what fence agent are you talking about.
See above: fence_rhevm
- What about the shared storage, we will use a shared disk on oVirt which does not support snapshot
What is the question?
In the past I had to configure a virtual CentOS 6 cluster because I needed to replicate a problem I had in a physical production cluster and to verify if some actions/updates would have solved the problem. I had no more spare hw to configure an so using the poor-man method (dd + reconfigure) I had the cluster up and running with two twin nodes identical to the physical ones. I also opened this bugzilla to backport the el7 package to el6: https://bugzilla.redhat.com/show_bug.cgi?id=1446474 The intracluster network has been put on OVN btw But honestly I doubt I will use a virtual-cluster software stack to provide high availability to production services inside a VM. Too many inter-relations
- What are the things to avoid?
I think in general, don't try to have two mechanisms that try to do the same.
Either use your HA solution or oVirt HA solution, but not both in the same time.
Nir
Agreed HIH, Gianluca
participants (3)
-
Gianluca Cecchi
-
Nir Soffer
-
wodel youchi