Fenced hosts VM's never migrate

I have a 3 host cluster setup with HA enabled and fencing enabled and it appears to be working properly. Executing power management stop, start, and restart work along with host shutdown/restart following a simulated crash. When network is pulled a proxy is chosen and it powers off the downed host, and then restarts it. Since the network is still down it repeats the following in events: "Host kvm01 is not responding. It will stay in Connecting state for a grace period of 162 seconds and after that an attempt to fence the host will be issued." The real problem here is that the VM's on the host that has failed never migrate to a new host and remain down until the network is reconnected. We have tested this with back-end storage on gluster and NFS with the same result. This is on oVirt Engine Version: 3.5.1.1-1.el6. Hosts are on CentOS 7 and the Engine is standalone on CentOS 6.6.

----- Original Message -----
From: "Tim Macy" <macytd@gmail.com> To: users@ovirt.org Sent: Tuesday, February 10, 2015 6:55:31 PM Subject: [ovirt-users] Fenced hosts VM's never migrate
I have a 3 host cluster setup with HA enabled and fencing enabled and it appears to be working properly. Executing power management stop, start, and restart work along with host shutdown/restart following a simulated crash. When network is pulled a proxy is chosen and it powers off the downed host, and then restarts it. Since the network is still down it repeats the following in events: "Host kvm01 is not responding. It will stay in Connecting state for a grace period of 162 seconds and after that an attempt to fence the host will be issued."
The real problem here is that the VM's on the host that has failed never migrate to a new host and remain down until the network is reconnected.
once the host is powered off by the proxy, HA vms will be started (not migrated) on other host, if there are resources for it.. if you have HA vms that are not started although there is another host available for it, it might be a bug, can you please attach engine.log from the time of the failure?
We have tested this with back-end storage on gluster and NFS with the same result. This is on oVirt Engine Version: 3.5.1.1-1.el6. Hosts are on CentOS 7 and the Engine is standalone on CentOS 6.6.
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

On 11/02/15, 4:34 PM, Omer Frenkel wrote:
----- Original Message -----
From: "Tim Macy" <macytd@gmail.com> To: users@ovirt.org Sent: Tuesday, February 10, 2015 6:55:31 PM Subject: [ovirt-users] Fenced hosts VM's never migrate
I have a 3 host cluster setup with HA enabled and fencing enabled and it appears to be working properly. Executing power management stop, start, and restart work along with host shutdown/restart following a simulated crash. When network is pulled a proxy is chosen and it powers off the downed host, and then restarts it. Since the network is still down it repeats the following in events: "Host kvm01 is not responding. It will stay in Connecting state for a grace period of 162 seconds and after that an attempt to fence the host will be issued."
The real problem here is that the VM's on the host that has failed never migrate to a new host and remain down until the network is reconnected.
once the host is powered off by the proxy, HA vms will be started (not migrated) on other host, if there are resources for it.. if you have HA vms that are not started although there is another host available for it, it might be a bug, can you please attach engine.log from the time of the failure?
We have tested this with back-end storage on gluster and NFS with the same result. This is on oVirt Engine Version: 3.5.1.1-1.el6. Hosts are on CentOS 7 and the Engine is standalone on CentOS 6.6.
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
I've had the exact same problem during testing yesterday. The HA VMs never restarted on the other available hosts. The only difference is that we're using iSCSI storage backend. oVirt Engine Version: 3.5.1.1-1.el6 (hosted engine) Host: CentOS 6.6 Engine logs are attached. Thanks, Siddharth

Hi, I looked at the logs and the reason why host vmh-02 wasn't restarted is that PM restart failed using both other hosts (vmh-01 and vmh-03) with error: Test Failed, [Powering off machine @ IPMI:10.9.1.11...Failed So we couldn't restart HA VMs on another hosts, because we were not sure that host vmh-02 is really down. I also noticed that even getting PM status of host vmh-02 is problematic, fence agent returned this message: Power Management test failed for Host vmh-02.Done but it also returned successful operation. This looks very very suspicious! Could you please execute following command from vmh-01 or vmh-03 to test PM agent on vmh-02 fence_ipmilan -a <IP> -l <USER> -p <PASSWORD> -o status -v -P where <IP>, <USER> and <PASSWORD> contains values valid for vmh-02? Could you please send us also vdsm.log from machines vmh-01 and vmh-03 so we could investigate details of fence agents execution failures? Thanks a lot Martin Perina ----- Original Message -----
From: "Siddharth Patil" <siddharth@patil.co.uk> To: users@ovirt.org Sent: Wednesday, February 11, 2015 5:17:28 PM Subject: Re: [ovirt-users] Fenced hosts VM's never migrate
On 11/02/15, 4:34 PM, Omer Frenkel wrote:
----- Original Message -----
From: "Tim Macy" <macytd@gmail.com> To: users@ovirt.org Sent: Tuesday, February 10, 2015 6:55:31 PM Subject: [ovirt-users] Fenced hosts VM's never migrate
I have a 3 host cluster setup with HA enabled and fencing enabled and it appears to be working properly. Executing power management stop, start, and restart work along with host shutdown/restart following a simulated crash. When network is pulled a proxy is chosen and it powers off the downed host, and then restarts it. Since the network is still down it repeats the following in events: "Host kvm01 is not responding. It will stay in Connecting state for a grace period of 162 seconds and after that an attempt to fence the host will be issued."
The real problem here is that the VM's on the host that has failed never migrate to a new host and remain down until the network is reconnected.
once the host is powered off by the proxy, HA vms will be started (not migrated) on other host, if there are resources for it.. if you have HA vms that are not started although there is another host available for it, it might be a bug, can you please attach engine.log from the time of the failure?
We have tested this with back-end storage on gluster and NFS with the same result. This is on oVirt Engine Version: 3.5.1.1-1.el6. Hosts are on CentOS 7 and the Engine is standalone on CentOS 6.6.
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
I've had the exact same problem during testing yesterday. The HA VMs never restarted on the other available hosts. The only difference is that we're using iSCSI storage backend.
oVirt Engine Version: 3.5.1.1-1.el6 (hosted engine) Host: CentOS 6.6
Engine logs are attached.
Thanks, Siddharth
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

On 11/02/15, 6:46 PM, Martin Perina wrote:
Hi,
I looked at the logs and the reason why host vmh-02 wasn't restarted is that PM restart failed using both other hosts (vmh-01 and vmh-03) with error:
Test Failed, [Powering off machine @ IPMI:10.9.1.11...Failed
So we couldn't restart HA VMs on another hosts, because we were not sure that host vmh-02 is really down.
I also noticed that even getting PM status of host vmh-02 is problematic, fence agent returned this message:
Power Management test failed for Host vmh-02.Done
but it also returned successful operation. This looks very very suspicious!
Could this be because we turned off power to vmh-02 completely? We are testing to make sure that the HA VMs will be restarted on another host even if the host suffers complete hardware failure.
Could you please execute following command from vmh-01 or vmh-03 to test PM agent on vmh-02
fence_ipmilan -a <IP> -l <USER> -p <PASSWORD> -o status -v -P
where <IP>, <USER> and <PASSWORD> contains values valid for vmh-02?
Here's the result (from both): Getting status of IPMI:10.9.1.11...Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power status'... Chassis power = On Done Of course, the server is now up and running so this is expected.
Could you please send us also vdsm.log from machines vmh-01 and vmh-03 so we could investigate details of fence agents execution failures?
See attached. Regards, Siddharth

----- Original Message -----
From: "Siddharth Patil" <siddharth@patil.co.uk> To: users@ovirt.org Sent: Wednesday, February 11, 2015 6:25:02 PM Subject: Re: [ovirt-users] Fenced hosts VM's never migrate
On 11/02/15, 6:46 PM, Martin Perina wrote:
Hi,
I looked at the logs and the reason why host vmh-02 wasn't restarted is that PM restart failed using both other hosts (vmh-01 and vmh-03) with error:
Test Failed, [Powering off machine @ IPMI:10.9.1.11...Failed
So we couldn't restart HA VMs on another hosts, because we were not sure that host vmh-02 is really down.
I also noticed that even getting PM status of host vmh-02 is problematic, fence agent returned this message:
Power Management test failed for Host vmh-02.Done
but it also returned successful operation. This looks very very suspicious!
Could this be because we turned off power to vmh-02 completely? We are testing to make sure that the HA VMs will be restarted on another host even if the host suffers complete hardware failure.
IPMI interface should work even if server is turned off. But of course it needs power, so if you loose power to it, it cannot work. For this case you would need another (secondary) fencing agent (for example APC) which will control the power for server and its IPMI interface.
Could you please execute following command from vmh-01 or vmh-03 to test PM agent on vmh-02
fence_ipmilan -a <IP> -l <USER> -p <PASSWORD> -o status -v -P
where <IP>, <USER> and <PASSWORD> contains values valid for vmh-02?
Here's the result (from both):
Getting status of IPMI:10.9.1.11...Spawning: '/usr/bin/ipmitool -I lanplus -H '10.9.1.11' -U 'ADMIN' -P '[set]' -v chassis power status'... Chassis power = On Done
Of course, the server is now up and running so this is expected.
Yes this is the correct result. And you should get result Chassis power = Off when server is turned off.
Could you please send us also vdsm.log from machines vmh-01 and vmh-03 so we could investigate details of fence agents execution failures?
See attached.
I looked at the vdsm logs and I wasn't able to find any additional error details. But it looks like a bug, that fence agent status command returned error code 1 and in engine we reported this as success. So could you please file a new bug with above logs attached, reproducing steps and also following versions: Host where engine is running: ovirt-engine Host vmh-01: vdsm fence-agents Thanks a lot Martin Perina
Regards, Siddharth
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

On 11/02/15, 8:03 PM, Martin Perina wrote:
So could you please file a new bug with above logs attached, reproducing steps and also following versions:
Done. https://bugzilla.redhat.com/show_bug.cgi?id=1191709 Thanks, Siddharth
participants (4)
-
Martin Perina
-
Omer Frenkel
-
Siddharth Patil
-
Tim Macy