>* Would ovirt have been able to deal with clearing the rbd locks, or
did I miss a trick somewhere to resolve this situation with manually
going through each device and clering the lock?

Unfortunately there is no trick on ovirt's side

>* Might it be possible for ovirt to detect when the rbd images are
locked for writing and prevent launching?

Since rbd paths are provided via cinderlib, a higher level interface, ovirt does not have knowledge of implementation details like this

On Thu, Sep 12, 2019 at 11:27 PM Dan Poltawski <dan.poltawski@tnp.net.uk> wrote:
Yesterday we had a catastrophic hardware failure with one of our nodes
using ceph and the experimental cinderlib integration.

Unfortunately the ovirt cluster recover the situation well and took
some manual intervention to resolve. I thought I'd share what happened
and how we resolved it in case there is any best practice to share/bugs
which are worth creating to help others in similar situaiton. We are
early in our use of ovirt, so its quite possible we have things
incorreclty configured.

Our setup: We have two nodes, hosted engine on iSCSI, about 40vms all
using managed block storage mounting the rbd volumes directly. I hadn't
configured power management (perhaps this is the fundamental problem).

Yesterday a hardware fault caused one of the nodes to crash and stay
down awaiting user input in POST screens, taking 20 vms with it.

The hosted engine was fortunately on the 'good' nodeĀ  and detected that
the node had become unresponsive, but noted 'Host cannot be fenced
automatically because power management for the host is disabled.'.

At this point, knowing that one node was dead, I wanted to bring up the
failed vms on the good node. However, the vms were appearing in an
unknown state and I couldn't do any operations on them. It wasn't clear
to me what the best course of action to do there would be. I am not
sure if there is a way to mark the node as failed?

In my urgency to try and resolve the situation I managed to get the
failed node startred back up, shortly after it came up the
engine detected that all the vms were down, I put the failed host into
maintaince mode and tried to start the failed vms.

Unfortunately the failed vms did not start up cleanly - it turned out
that they still had rbd locks preventing writing from the failed node.

To finally gets the vms to start I then manually went through every
vm's managed block, found the id and found the lock and removed it:
rbd lock list rbd/volume-{id}
rbd lock remove rbd/voleume-{id} 'auto {lockid}' {lockername}

Some overall thoughts I had:
* I'm not sure what the best course of action is to notify the engine
about a catastrophic hardware failure? If power management was
configured, I suppose it would've removed the power and marked them all
down?

* Would ovirt have been able to deal with clearing the rbd locks, or
did I miss a trick somewhere to resolve this situation with manually
going through each device and clering the lock?

* Might it be possible for ovirt to detect when the rbd images are
locked for writing and prevent launching?

regards,

Dan

________________________________

The Networking People (TNP) Limited. Registered office: Network House, Caton Rd, Lancaster, LA1 3PE. Registered in England & Wales with company number: 07667393

This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/NZGGIT2KKBWCPXNB5JEQEA3KQP5ZBNXR/