[ovirt-users] Standard operating procedure for a node failure on HCI required

2 Apr 2021

      oVirt may have started as a vSphere 'look-alike', but it graduated to a Nutanix 'clone', at least in terms of marketing.

IMHO that means the 3-node hyperconverged default oVirt setup (2 replicas and 1 arbiter) deserves special love in terms of documenting failure scenarios. 

3-node HCI is supposed to defend you against long-term effects of any single point of failure. There is no protection against the loss of dynamic state/session data, but state-free services should recover or resume: that's what it's all about.

Sadly, what I find missing in the oVirt and Gluster documentation is an SOP (standard operating procedure) that one should follow in case of a late-night/early-morning on-call wakeup when one of those three HCI nodes should have failed... dramatically or via a 'brown out' e.g. where only the storage part was actually lost.

My impression is that the oVirt and Gluster teams are barely talking, but in HCI that's fatal.

And I sure can't find those recovery procedures, not even in the commercial RH documents.

So please, either add them or show me where I missed them.

[ovirt-users] Standard operating procedure for a node failure on HCI required

Thomas Hoberg