[Users] Host Non-Operational from sanlock and VM fails to migrate

Mon Feb 10 01:03:05 UTC 2014

No, in fact I just had the issue arise again after trying to figure
out what about my setup causes this crash.  So far it only seems to
occur if both nodes are running NFS over RDMA, but I'm unsure if it's
VM traffic or the host being SPM that causes it to misbehave.

vm02 was running a single VM and was SPM.  The crash was on vm02
"Invalid status on Data Center Default.  Setting Data Center status to
Non Responsive (On host vm02, Error: Network error during
communication with the Host).".  SPM successfully switched to vm01 but
the VM is stuck in migration and unresponsive.  Both engine and nodes
using ovirt 3.3.3.

vm01 and vm02 both have the following in vdsm.conf

[addresses]
management_port = 54321

[vars]
ssl = true

[irs]
nfs_mount_options = rdma,port=20049

This is the ovirt NFS mount lines in /proc/mounts for each:

vm01:

192.168.211.245:/tank/ovirt/import_export
/rhev/data-center/mnt/192.168.211.245:_tank_ovirt_import__export nfs
rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,mountaddr=192.168.211.245,mountvers=3,mountproto=tcp,local_lock=none,addr=192.168.211.245
0 0
192.168.211.245:/tank/ovirt/iso
/rhev/data-center/mnt/192.168.211.245:_tank_ovirt_iso nfs
rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,mountaddr=192.168.211.245,mountvers=3,mountproto=tcp,local_lock=none,addr=192.168.211.245
0 0
192.168.211.245:/tank/ovirt/data
/rhev/data-center/mnt/192.168.211.245:_tank_ovirt_data nfs
rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,mountaddr=192.168.211.245,mountvers=3,mountproto=tcp,local_lock=none,addr=192.168.211.245
0 0

vm02:

192.168.211.245:/tank/ovirt/import_export
/rhev/data-center/mnt/192.168.211.245:_tank_ovirt_import__export nfs
rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,mountaddr=192.168.211.245,mountvers=3,mountproto=tcp,local_lock=none,addr=192.168.211.245
0 0
192.168.211.245:/tank/ovirt/iso
/rhev/data-center/mnt/192.168.211.245:_tank_ovirt_iso nfs
rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,mountaddr=192.168.211.245,mountvers=3,mountproto=tcp,local_lock=none,addr=192.168.211.245
0 0
192.168.211.245:/tank/ovirt/data
/rhev/data-center/mnt/192.168.211.245:_tank_ovirt_data nfs
rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,mountaddr=192.168.211.245,mountvers=3,mountproto=tcp,local_lock=none,addr=192.168.211.245
0 0

The NFS server had these 2 log entries in /var/log/messages around the
time vm02 went Non-operational.

Feb  9 17:27:59 vmstore1 kernel: svcrdma: Error fast registering
memory for xprt ffff882014683400
Feb  9 17:28:21 vmstore1 kernel: svcrdma: Error fast registering
memory for xprt ffff882025bf1400

Attached is tar of the logs from vm01, vm02 and the engine server.

vm01 & vm02 folders contain files from '/var/log/messages
/var/log/sanlock.log /var/log/vdsm/*.log'
engine from '/var/log/messages /var/log/ovirt-engine/*.log'

Thanks
- Trey

On Sun, Feb 9, 2014 at 4:15 PM, Itamar Heim <iheim at redhat.com> wrote:
> On 02/03/2014 06:58 PM, Trey Dockendorf wrote:
>>
>> I have a 2 node oVirt 3.3.2 cluster setup and am evaluating the setup
>> for production use on our HPC system for managing our VM
>> infrastructure.  Currently I'm trying to utilize our DDR InfiniBand
>> fabric for the storage domains in oVirt using NFS over RDMA.  I've
>> noticed some unstable behavior and it seems in every case to begin
>> with sanlock.
>>
>> The ovirt web admin interface shows the following message as first
>> sign of trouble on 2014-Feb-03 07:45.
>>
>> "Invalid status on Data Center Default. Setting Data Center status to
>> Non Responsive (On host vm01.brazos.tamu.edu, Error: Network error
>> during communication with the Host.).".
>>
>> The single VM I had running is stuck in the "Migrating From" state.
>> virsh shows the VM paused on the crashed host and the one it attempted
>> to migrate to.
>>
>> Right now I have a few concerns.
>>
>> 1) The cause of the sanlock (or other instability) and if it's related
>> to a bug or an issue using NFSoRDMA.
>> 2) Why the VM failed to migrate if the second host had no issues.  If
>> the first host is down should the VM be considered offline and booted
>> on the second host after first is fenced?
>>
>> Attached are logs from the failed host (vm01) and the healthy host
>> (vm02) as well as engine.  The failed host's /var/log/message is also
>> attached (vm01_message.log).
>>
>> Thanks
>> - Trey
>>
>>
>>
>> _______________________________________________
>> Users mailing list
>> Users at ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/users
>>
>
> was this resolved?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: engine-vdsm-logs-20140209.tar.gz
Type: application/x-gzip
Size: 677078 bytes
Desc: not available
URL: <http://lists.ovirt.org/pipermail/users/attachments/20140209/66ff9bf4/attachment-0001.gz>