[ovirt-users] Managed Block Storage/Ceph: Experiences from Catastrophic Hardware failure

Thursday, 12 September 2019

Yesterday we had a catastrophic hardware failure with one of our nodes
using ceph and the experimental cinderlib integration.

Unfortunately the ovirt cluster recover the situation well and took
some manual intervention to resolve. I thought I'd share what happened
and how we resolved it in case there is any best practice to share/bugs
which are worth creating to help others in similar situaiton. We are
early in our use of ovirt, so its quite possible we have things
incorreclty configured.

Our setup: We have two nodes, hosted engine on iSCSI, about 40vms all
using managed block storage mounting the rbd volumes directly. I hadn't
configured power management (perhaps this is the fundamental problem).

Yesterday a hardware fault caused one of the nodes to crash and stay
down awaiting user input in POST screens, taking 20 vms with it.

The hosted engine was fortunately on the 'good' node  and detected that
the node had become unresponsive, but noted 'Host cannot be fenced
automatically because power management for the host is disabled.'.

At this point, knowing that one node was dead, I wanted to bring up the
failed vms on the good node. However, the vms were appearing in an
unknown state and I couldn't do any operations on them. It wasn't clear
to me what the best course of action to do there would be. I am not
sure if there is a way to mark the node as failed?

In my urgency to try and resolve the situation I managed to get the
failed node startred back up, shortly after it came up the
engine detected that all the vms were down, I put the failed host into
maintaince mode and tried to start the failed vms.

Unfortunately the failed vms did not start up cleanly - it turned out
that they still had rbd locks preventing writing from the failed node.

To finally gets the vms to start I then manually went through every
vm's managed block, found the id and found the lock and removed it:
rbd lock list rbd/volume-{id}
rbd lock remove rbd/voleume-{id} 'auto {lockid}' {lockername}

Some overall thoughts I had:
* I'm not sure what the best course of action is to notify the engine
about a catastrophic hardware failure? If power management was
configured, I suppose it would've removed the power and marked them all
down?

* Would ovirt have been able to deal with clearing the rbd locks, or
did I miss a trick somewhere to resolve this situation with manually
going through each device and clering the lock?

* Might it be possible for ovirt to detect when the rbd images are
locked for writing and prevent launching?

regards,

Dan

________________________________

The Networking People (TNP) Limited. Registered office: Network House, Caton Rd,
Lancaster, LA1 3PE. Registered in England & Wales with company number: 07667393

This email and any files transmitted with it are confidential and intended solely for the
use of the individual or entity to whom they are addressed. If you have received this
email in error please notify the system manager. This message contains confidential
information and is intended only for the individual named. If you are not the named
addressee you should not disseminate, distribute or copy this e-mail. Please notify the
sender immediately by e-mail if you have received this e-mail by mistake and delete this
e-mail from your system. If you are not the intended recipient you are notified that
disclosing, copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

[ovirt-users] Managed Block Storage/Ceph: Experiences from Catastrophic Hardware failure