We had an issue with resources.ovirt.org becoming unresponsive to all network access on Tuesday, July 17th.

Tracking of the issue was done using:
https://ovirt-jira.atlassian.net/browse/OVIRT-2338

During the attempt to analyse and fix this issue we came across several other issues the have to do with out ability to gain access to the production VMs. Thos issues hade been documented and collected in the following epic:
https://ovirt-jira.atlassian.net/browse/OVIRT-2337

We need to resolve these issues ASAP to ensure that next time such a production outage occurs we don't end up struggling for many minutes to get basic access.

In the end we reached the conclusion that the server itself was stuck and needed a hard-reset. This raises the question why didn't we have a watchdog device configured on the VM to automatically detect and deal with such issues.

This experience led us to the understanding that resources is fulfilling far too many critical roles ATM for us to be able to responsibly keep it as a simple single VM.  I've created the following Epic to track and discuss work for improving the infrastructure behind resources.ovirt.org to make it less fragile and more reliable:
https://ovirt-jira.atlassian.net/browse/OVIRT-2344


--
Barak Korren
RHV DevOps team , RHCE, RHCi
Red Hat EMEA
redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted