Standard operating procedure for a node failure on HCI required

oVirt may have started as a vSphere 'look-alike', but it graduated to a Nutanix 'clone', at least in terms of marketing. IMHO that means the 3-node hyperconverged default oVirt setup (2 replicas and 1 arbiter) deserves special love in terms of documenting failure scenarios. 3-node HCI is supposed to defend you against long-term effects of any single point of failure. There is no protection against the loss of dynamic state/session data, but state-free services should recover or resume: that's what it's all about. Sadly, what I find missing in the oVirt and Gluster documentation is an SOP (standard operating procedure) that one should follow in case of a late-night/early-morning on-call wakeup when one of those three HCI nodes should have failed... dramatically or via a 'brown out' e.g. where only the storage part was actually lost. My impression is that the oVirt and Gluster teams are barely talking, but in HCI that's fatal. And I sure can't find those recovery procedures, not even in the commercial RH documents. So please, either add them or show me where I missed them.

Hi Thomas, the difference between paid solutions and oVirt is that the latter is free and "supported" by volunteers and people that believe in open source. If you need documentation that is missing , you are always welcome to write it and share it with the rest of us. Of course , both Red Hat and Oracle provide a paid , subscription-based solution that might suit your needs better. Best Regards,Strahil Nikolov On Fri, Apr 2, 2021 at 1:15, Thomas Hoberg<thomas@hoberg.net> wrote: oVirt may have started as a vSphere 'look-alike', but it graduated to a Nutanix 'clone', at least in terms of marketing. IMHO that means the 3-node hyperconverged default oVirt setup (2 replicas and 1 arbiter) deserves special love in terms of documenting failure scenarios. 3-node HCI is supposed to defend you against long-term effects of any single point of failure. There is no protection against the loss of dynamic state/session data, but state-free services should recover or resume: that's what it's all about. Sadly, what I find missing in the oVirt and Gluster documentation is an SOP (standard operating procedure) that one should follow in case of a late-night/early-morning on-call wakeup when one of those three HCI nodes should have failed... dramatically or via a 'brown out' e.g. where only the storage part was actually lost. My impression is that the oVirt and Gluster teams are barely talking, but in HCI that's fatal. And I sure can't find those recovery procedures, not even in the commercial RH documents. So please, either add them or show me where I missed them. _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/QZFFH2U2RM2R3P...

Hi Strahil, I did actually find the matching RHV documentation now. The reason I didn't before seems to be that this documentation was only added for RHHI 1.8 or oVirt 4.4 and did not exist for RHHI 1.7 or oVirt 4.3 https://access.redhat.com/documentation/en-us/red_hat_hyperconverged_infrast...
participants (2)
-
Strahil Nikolov
-
Thomas Hoberg