[ovirt-users] oVirt 3.5 test day: retrying failed fencing

Martin Mucha mmucha at redhat.com
Wed Jul 30 16:57:17 UTC 2014


Hi, 

I've tested: "Bug 1090511 [RFE] Improve fencing robustness by retrying failed attempts".
Spoiler alert: Tested feature worked, but fencing was not successful due to bug https://bugzilla.redhat.com/1124141

---

How to setup environment for testing:
- 3 hosts are required, at least two of them with PM enabled.
- 2 hosts (A, B), with pm enabled, should be with one cluster, remaining one (C) in another cluster. Reason for that is that search for fencing proxy is first done in same cluster, only if there's none host available, hosts outside of this cluster is considered; this separation is needed to make sure that right (not working) fencing proxy is selected first.

notation: 
host A ~ defective host to be fenced
host B ~ first selected fencing proxy, which will fail fencing host A.
host C ~ second selected fencing proxy, which should succeed fencing host A.
A and B are in same cluster.

process:
1. On host B we alter iptables, so it cannot contact host A and fence it. SSH was blocked to disallow soft fencing and ipmi was blocked to disallow 'hard' fencing.

iptables -A OUTPUT -p udp -d 10.34.63.198 --dport 623 -j DROP
iptables -A OUTPUT -p tcp -d 10.34.63.178 --dport 22 -j DROP

2. On host A was removed rules allowing connection to vdsm [1] and vdsm was restarted vdsm[2] so all ssh connections needs to be reopened. That makes engine think, that host is down/overloaded.
drop rule: 
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:54321

followed by
systemctl restart vdsmd


Result: After restart of vdsmd engine recognised host A as iresponsive, and tried to fence it. First attempt to fence host A was performed by host B and failed as expected, second attempt to fence host A performed by host C and from code perspective succeeded. Error message [1] correctly displayed. However fence was not successful due to bug https://bugzilla.redhat.com/1124141 which causes java.lang.StackOverflowError. Code related to this bug should be OK, but will be working only after mentioned bug is fixed.

M.

[1]. Fencing operation failed with proxy host <ID>, trying another proxy...



More information about the Users mailing list