[ovirt-users] NonResponsive Host and unknown VMs

6 May 2019

      Hello,

Context :
(probably) After a failed deletion of snapshot during a backup (it had
been somes times since i got this problem), one of my host has gone
Nonresponsive, and with him, all the VMs who was stored on it or run on it.
Instead of removing the problem snapshot (usually it does the trick). I
did reboot the host (and without maintenance, it wasn't available)
On the Host log, it seems there's nothing wrong, on the Engine log,
there's mostly a "ERROR [org.ovirt.engine.core.dal.dbbroker.
auditloghandling.AuditLogDirector]
(EE-ManagedThreadFactory-engineScheduled-Thread-10) [] EVENT_ID:
VDS_BROKER_COMMAND_FAILURE(10,802), VDSM xxx command Get Host
Capabilities failed: Message timeout which can be caused by
communication issues", without further informations.
One last thing, it seems gluster is working well but the command
"gluster volume status" goes timeout (nothing from log seems wrong)
Of course, the host ping from the engine, and the engine ping from the host.

You'll find on the bottom of the mail more informations about the setup.

Here are a few questions if someone can help :
- Should i open a bug on that (looks a lot like
https://bugzilla.redhat.com/show_bug.cgi?id=1404082 but i can't do all
the test of this bug) ?
- Should i be worried about the gluster timeout (the engine is on it) /
should i move the gluster point of the unresponsive host on another, and
remove the unresponsive host from the pool to fix that ?
- i guess that if i fix the unresponsivness of the host, it'll fix the
vm (that one is an easy one i hope !)
- Once i will have found why the host is unresponsive, and if i can't
fix it (have to reinstall it by example), how can i remove the host and
the affected vms from the cluster (nearly everything is unavailable on
them, maintenance for storage and host are unavailable) ?

Any help will be greatly appreciated, thank you.

Informations about the setup :
ovirt 4.2.8.2
Gluster has been created on system before the installation of ovirt (but
is well working this way).
Mostly, storage of VMs are made with NFS mount on host, not on a gluster
exception made for the engine.
The unresponsive host is the arbiter brick of the gluster volumes.
I can wipe the unresponsive host, but the cluster is a production
cluster, i can't shutdown everything :)

----
Cordialement,
Alexis Grillon

Pôle Humanités Numériques, Outils, Méthodes et Analyse de Données.
Maison Européenne des Sciences de l'Homme et de la Société
MESHS - Lille Nord de France / CNRS
tel. +33 (0)3 20 12 58 57 | alexis.grillon@meshs.fr
www.meshs.fr | 2, rue des Canonniers 59000 Lille
GPG fingerprint AC37 4C4B 6308 975B 77D4 772F 214F 1E97 6C08 CD11