New subject: [Users] Host Non-Operational from sanlock and VM fails to migrate

3 Feb 2014

      I have a 2 node oVirt 3.3.2 cluster setup and am evaluating the setup
for production use on our HPC system for managing our VM
infrastructure.  Currently I'm trying to utilize our DDR InfiniBand
fabric for the storage domains in oVirt using NFS over RDMA.  I've
noticed some unstable behavior and it seems in every case to begin
with sanlock.

The ovirt web admin interface shows the following message as first
sign of trouble on 2014-Feb-03 07:45.

"Invalid status on Data Center Default. Setting Data Center status to
Non Responsive (On host vm01.brazos.tamu.edu, Error: Network error
during communication with the Host.).".

The single VM I had running is stuck in the "Migrating From" state.
virsh shows the VM paused on the crashed host and the one it attempted
to migrate to.

Right now I have a few concerns.

1) The cause of the sanlock (or other instability) and if it's related
to a bug or an issue using NFSoRDMA.
2) Why the VM failed to migrate if the second host had no issues.  If
the first host is down should the VM be considered offline and booted
on the second host after first is fenced?

Attached are logs from the failed host (vm01) and the healthy host
(vm02) as well as engine.  The failed host's /var/log/message is also
attached (vm01_message.log).

Thanks
- Trey

[Users] Host Non-Operational from sanlock and VM fails to migrate

Trey Dockendorf

Itamar Heim

Trey Dockendorf

Nir Soffer

Francesco Romani

Trey Dockendorf

Nir Soffer

tags

participants (4)