Just an update for documentation purposes.
I tried physically rebooting the faulty node after placing the cluster in global
maintenance mode, as I couldn't place the node in local maintenance. It booted up ok,
but then after a few minutes the following logs started appearing on the screen:
"blk_update_request: I/O error, dev dm-1, sector 0
blk_update_request: I/O error, dev dm-1, sector 2048
blk_update_request: I/O error, dev dm-1, sector 2099200
EXT4-fs error (device dm-7): ext4_find_entry:1318:inode #6294136: comm python: reading
directory lblock 0
EXT4-fs (dm-7): previous I/O error to superblock detected
Buffer I/O error on dev dm-7, logical block 0, lost sync page write
device-mapper: thin: process_cell: dm_thin_find_block() failed: error= -5
blk_update_request: I/O error, dev dm-1, sector 1051168
Aborting journal on device dm-2-0
blk_update_request: I/O error, dev dm-1, sector 1050624
JBD2: Error -5 detected when updating journal superblock for dm-2-0
"
From what I can tell the filesystem is corrupted so now I'm in the process of either
fixing it with FSCK or replacing the node with a new one. (fyi, the node never changed
status and it stayed NonResponsive)
For the VM that was stuck on the node, the solution I found was described here
https://serverfault.com/questions/996649/how-to-confirm-reboot-unresponsi...
and it was to set the cluster in global maintenance mode, then shutdown the engine VM and
then start it again. It worked perfectly and I was able to start the VM on another node.