Re: [Users] Host Non-Operational from sanlock and VM fails to migrate

12 Feb 2014

      ----- Original Message -----
...
From: "Trey Dockendorf" <treydock@gmail.com>
To: "Itamar Heim" <iheim@redhat.com>
Cc: "users" <users@ovirt.org>
Sent: Monday, February 10, 2014 3:03:05 AM
Subject: Re: [Users] Host Non-Operational from sanlock and VM fails to	migrate
No, in fact I just had the issue arise again after trying to figure
out what about my setup causes this crash.  So far it only seems to
occur if both nodes are running NFS over RDMA, but I'm unsure if it's
VM traffic or the host being SPM that causes it to misbehave.
vm02 was running a single VM and was SPM.  The crash was on vm02
"Invalid status on Data Center Default.  Setting Data Center status to
Non Responsive (On host vm02, Error: Network error during
communication with the Host).".  SPM successfully switched to vm01 but
the VM is stuck in migration and unresponsive.  Both engine and nodes
using ovirt 3.3.3.
vm01 and vm02 both have the following in vdsm.conf
[addresses]
management_port = 54321
[vars]
ssl = true
[irs]
nfs_mount_options = rdma,port=20049
This is the ovirt NFS mount lines in /proc/mounts for each:
vm01:
192.168.211.245:/tank/ovirt/import_export
/rhev/data-center/mnt/192.168.211.245:_tank_ovirt_import__export nfs
rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,mountaddr=192.168.211.245,mountvers=3,mountproto=tcp,local_lock=none,addr=192.168.211.245
0 0
192.168.211.245:/tank/ovirt/iso
/rhev/data-center/mnt/192.168.211.245:_tank_ovirt_iso nfs
rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,mountaddr=192.168.211.245,mountvers=3,mountproto=tcp,local_lock=none,addr=192.168.211.245
0 0
192.168.211.245:/tank/ovirt/data
/rhev/data-center/mnt/192.168.211.245:_tank_ovirt_data nfs
rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,mountaddr=192.168.211.245,mountvers=3,mountproto=tcp,local_lock=none,addr=192.168.211.245
0 0
vm02:
192.168.211.245:/tank/ovirt/import_export
/rhev/data-center/mnt/192.168.211.245:_tank_ovirt_import__export nfs
rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,mountaddr=192.168.211.245,mountvers=3,mountproto=tcp,local_lock=none,addr=192.168.211.245
0 0
192.168.211.245:/tank/ovirt/iso
/rhev/data-center/mnt/192.168.211.245:_tank_ovirt_iso nfs
rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,mountaddr=192.168.211.245,mountvers=3,mountproto=tcp,local_lock=none,addr=192.168.211.245
0 0
192.168.211.245:/tank/ovirt/data
/rhev/data-center/mnt/192.168.211.245:_tank_ovirt_data nfs
rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,mountaddr=192.168.211.245,mountvers=3,mountproto=tcp,local_lock=none,addr=192.168.211.245
0 0
The NFS server had these 2 log entries in /var/log/messages around the
time vm02 went Non-operational.
Feb  9 17:27:59 vmstore1 kernel: svcrdma: Error fast registering
memory for xprt ffff882014683400
Feb  9 17:28:21 vmstore1 kernel: svcrdma: Error fast registering
memory for xprt ffff882025bf1400
This looks like the root cause - failure on the storage server

This leads to failure in the hosts connected to this storage:

Feb  2 13:37:11 vm01 kernel: rpcrdma: connection to 192.168.211.245:20049 closed (-103)
...
Feb  3 07:44:31 vm01 kernel: ------------[ cut here ]------------
Feb  3 07:44:31 vm01 kernel: WARNING: at kernel/softirq.c:159 local_bh_enable_ip+0x7d/0xb0() (Not tainted)
Feb  3 07:44:31 vm01 kernel: Hardware name: H8DMT-IBX
Feb  3 07:44:31 vm01 kernel: Modules linked in: ebt_arp xprtrdma nfs fscache auth_rpcgss nfs_acl bonding ebtable_nat ebtables softdog lockd sunrpc powernow_k8 freq_table mperf 
8021q garp bridge stp llc xt_physdev ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_multiport iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_
conntrack ip6table_filter ip6_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 dm_round_robin dm_multipath vhost_net macvtap macvlan tun kvm_a
md kvm microcode serio_raw k10temp amd64_edac_mod edac_core edac_mce_amd igb dca i2c_algo_bit ptp pps_core mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core sg i2c_nforce2 i2c_cor
e ext4 jbd2 mbcache raid1 sd_mod crc_t10dif sata_nv ata_generic pata_acpi pata_amd dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Feb  3 07:44:31 vm01 kernel: Pid: 0, comm: swapper Not tainted 2.6.32-431.3.1.el6.x86_64 #1
Feb  3 07:44:31 vm01 kernel: Call Trace:
Feb  3 07:44:31 vm01 kernel: <IRQ>  [<ffffffff81071e27>] ? warn_slowpath_common+0x87/0xc0
Feb  3 07:44:31 vm01 kernel: [<ffffffff81071e7a>] ? warn_slowpath_null+0x1a/0x20
Feb  3 07:44:31 vm01 kernel: [<ffffffff8107a3ed>] ? local_bh_enable_ip+0x7d/0xb0
Feb  3 07:44:31 vm01 kernel: [<ffffffff8152a4fb>] ? _spin_unlock_bh+0x1b/0x20
Feb  3 07:44:31 vm01 kernel: [<ffffffffa044c4f0>] ? rpc_wake_up_status+0x70/0x80 [sunrpc]
Feb  3 07:44:31 vm01 kernel: [<ffffffffa044579c>] ? xprt_wake_pending_tasks+0x2c/0x30 [sunrpc]
Feb  3 07:44:31 vm01 kernel: [<ffffffffa02782fc>] ? rpcrdma_conn_func+0x9c/0xb0 [xprtrdma]
Feb  3 07:44:31 vm01 kernel: [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
Feb  3 07:44:31 vm01 kernel: [<ffffffffa027b450>] ? rpcrdma_qp_async_error_upcall+0x40/0x80 [xprtrdma]
Feb  3 07:44:31 vm01 kernel: [<ffffffffa019a1cb>] ? mlx4_ib_qp_event+0x8b/0x100 [mlx4_ib]
Feb  3 07:44:31 vm01 kernel: [<ffffffffa0138c54>] ? mlx4_qp_event+0x74/0xf0 [mlx4_core]
Feb  3 07:44:31 vm01 kernel: [<ffffffffa01bd396>] ? igb_poll+0xb66/0x1020 [igb]
Feb  3 07:44:31 vm01 kernel: [<ffffffffa0126057>] ? mlx4_eq_int+0x557/0xcb0 [mlx4_core]
Feb  3 07:44:31 vm01 kernel: [<ffffffff8109a9a0>] ? posix_timer_fn+0x0/0xe0
Feb  3 07:44:31 vm01 kernel: [<ffffffff8109a982>] ? posix_timer_event+0x42/0x60
Feb  3 07:44:31 vm01 kernel: [<ffffffff810a7159>] ? ktime_get+0x69/0xf0
Feb  3 07:44:31 vm01 kernel: [<ffffffffa01267c4>] ? mlx4_msi_x_interrupt+0x14/0x20 [mlx4_core]
Feb  3 07:44:31 vm01 kernel: [<ffffffff810e6ed0>] ? handle_IRQ_event+0x60/0x170
Feb  3 07:44:31 vm01 kernel: [<ffffffff810e982e>] ? handle_edge_irq+0xde/0x180
Feb  3 07:44:31 vm01 kernel: [<ffffffff8100faf9>] ? handle_irq+0x49/0xa0
Feb  3 07:44:31 vm01 kernel: [<ffffffff81530fec>] ? do_IRQ+0x6c/0xf0
Feb  3 07:44:31 vm01 kernel: [<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11
Feb  3 07:44:31 vm01 kernel: <EOI>  [<ffffffff8103eacb>] ? native_safe_halt+0xb/0x10
Feb  3 07:44:31 vm01 kernel: [<ffffffff810167bd>] ? default_idle+0x4d/0xb0
Feb  3 07:44:31 vm01 kernel: [<ffffffff810168bd>] ? c1e_idle+0x9d/0x120
Feb  3 07:44:31 vm01 kernel: [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110
Feb  3 07:44:31 vm01 kernel: [<ffffffff81520e2c>] ? start_secondary+0x2ac/0x2ef
Feb  3 07:44:31 vm01 kernel: ---[ end trace 9d97672873a67a1f ]---

This looks like an error in the kernel. You should consult kernel/rpcrdma folks with this error.

Which cause sanlock to fail to update the lease on the storage (expected):

Feb  3 07:44:58 vm01 sanlock[2536]: 2014-02-03 07:44:58-0600 240689 [2536]: s1 check_our_lease failed 80

Then sanlock try to kill vdsm, the owner of the lease:

Feb  3 07:44:58 vm01 sanlock[2536]: 2014-02-03 07:44:58-0600 240689 [2536]: s1 kill 3030 sig 15 count 1
...
Feb  3 07:45:06 vm01 sanlock[2536]: 2014-02-03 07:45:06-0600 240698 [2536]: s1 kill 3030 sig 15 count 10
Feb  3 07:45:07 vm01 sanlock[2536]: 2014-02-03 07:45:07-0600 240698 [2536]: dead 3030 ci 3 count 10

This makes the host Non-Responsive (expected).

Now vdsm is restarted, which will make it responsive again:

Feb  3 07:45:07 vm01 respawn: slave '/usr/share/vdsm/vdsm --pidfile /var/run/vdsm/vdsmd.pid' died, respawning slave

But sanlock since there is no access to storage, the host is Non Operational (expected).

The vm was starting a migration to the other host:

Thread-26::DEBUG::2014-02-03 07:49:18,067::BindingXMLRPC::965::vds::(wrapper) client [192.168.202.99]::call vmMigrate with ({'tunneled': 'false', 'dstqemu': '192.168.202.103', 
'src': 'vm01.brazos.tamu.edu', 'dst': 'vm02.brazos.tamu.edu:54321', 'vmId': '741f9811-db68-4dc4-a88a-7cb9be576e57', 'method': 'online'},) {} flowID [7829ae2a]
Thread-26::DEBUG::2014-02-03 07:49:18,067::API::463::vds::(migrate) {'tunneled': 'false', 'dstqemu': '192.168.202.103', 'src': 'vm01.brazos.tamu.edu', 'dst': 'vm02.brazos.tamu.
edu:54321', 'vmId': '741f9811-db68-4dc4-a88a-7cb9be576e57', 'method': 'online'}
Thread-26::DEBUG::2014-02-03 07:49:18,068::BindingXMLRPC::972::vds::(wrapper) return vmMigrate with {'status': {'message': 'Migration in progress', 'code': 0}, 'progress': 0}

The migration was almost complete after 20 seconds:

Thread-29::INFO::2014-02-03 07:49:38,329::vm::815::vm.Vm::(run) vmId=`741f9811-db68-4dc4-a88a-7cb9be576e57`::Migration Progress: 20 seconds elapsed, 99% of data processed, 99% of mem processed

But it never completed:

Thread-29::WARNING::2014-02-03 07:54:38,383::vm::792::vm.Vm::(run) vmId=`741f9811-db68-4dc4-a88a-7cb9be576e57`::Migration is stuck: Hasn't progressed in 300.054134846 seconds. Aborting.

CCing Michal to inspect why the migration has failed.
...
Attached is tar of the logs from vm01, vm02 and the engine server.
vm01 & vm02 folders contain files from '/var/log/messages
/var/log/sanlock.log /var/log/vdsm/*.log'
engine from '/var/log/messages /var/log/ovirt-engine/*.log'
Thanks
- Trey
On Sun, Feb 9, 2014 at 4:15 PM, Itamar Heim <iheim@redhat.com> wrote:
...
On 02/03/2014 06:58 PM, Trey Dockendorf wrote:
...
I have a 2 node oVirt 3.3.2 cluster setup and am evaluating the setup
for production use on our HPC system for managing our VM
infrastructure.  Currently I'm trying to utilize our DDR InfiniBand
fabric for the storage domains in oVirt using NFS over RDMA.  I've
noticed some unstable behavior and it seems in every case to begin
with sanlock.
The ovirt web admin interface shows the following message as first
sign of trouble on 2014-Feb-03 07:45.
"Invalid status on Data Center Default. Setting Data Center status to
Non Responsive (On host vm01.brazos.tamu.edu, Error: Network error
during communication with the Host.).".
The single VM I had running is stuck in the "Migrating From" state.
virsh shows the VM paused on the crashed host and the one it attempted
to migrate to.
Right now I have a few concerns.
1) The cause of the sanlock (or other instability) and if it's related
to a bug or an issue using NFSoRDMA.
vdsm and sanlock seem to behave as they should when storage is not accessible.
...
...
...
2) Why the VM failed to migrate if the second host had no issues.  If
Virt team will have to answer this.
...
...
...
the first host is down should the VM be considered offline and booted
on the second host after first is fenced?
The host was not fenced, and was not down. It was up and the vm was still
running, possibly accessing the storage.
...
...
...
Attached are logs from the failed host (vm01) and the healthy host
(vm02) as well as engine.  The failed host's /var/log/message is also
attached (vm01_message.log).
Thanks
- Trey
_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users
was this resolved?
_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

Re: [Users] Host Non-Operational from sanlock and VM fails to migrate

Nir Soffer