[Users] Host Non-Operational from sanlock and VM fails to migrate

Wed Feb 12 09:04:04 UTC 2014

----- Original Message -----
> From: "Trey Dockendorf" <treydock at gmail.com>
> To: "Itamar Heim" <iheim at redhat.com>
> Cc: "users" <users at ovirt.org>
> Sent: Monday, February 10, 2014 3:03:05 AM
> Subject: Re: [Users] Host Non-Operational from sanlock and VM fails to	migrate
> 
> No, in fact I just had the issue arise again after trying to figure
> out what about my setup causes this crash.  So far it only seems to
> occur if both nodes are running NFS over RDMA, but I'm unsure if it's
> VM traffic or the host being SPM that causes it to misbehave.
> 
> vm02 was running a single VM and was SPM.  The crash was on vm02
> "Invalid status on Data Center Default.  Setting Data Center status to
> Non Responsive (On host vm02, Error: Network error during
> communication with the Host).".  SPM successfully switched to vm01 but
> the VM is stuck in migration and unresponsive.  Both engine and nodes
> using ovirt 3.3.3.
> 
> vm01 and vm02 both have the following in vdsm.conf
> 
> [addresses]
> management_port = 54321
> 
> [vars]
> ssl = true
> 
> 
> [irs]
> nfs_mount_options = rdma,port=20049
> 
> This is the ovirt NFS mount lines in /proc/mounts for each:
> 
> vm01:
> 
> 192.168.211.245:/tank/ovirt/import_export
> /rhev/data-center/mnt/192.168.211.245:_tank_ovirt_import__export nfs
> rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,mountaddr=192.168.211.245,mountvers=3,mountproto=tcp,local_lock=none,addr=192.168.211.245
> 0 0
> 192.168.211.245:/tank/ovirt/iso
> /rhev/data-center/mnt/192.168.211.245:_tank_ovirt_iso nfs
> rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,mountaddr=192.168.211.245,mountvers=3,mountproto=tcp,local_lock=none,addr=192.168.211.245
> 0 0
> 192.168.211.245:/tank/ovirt/data
> /rhev/data-center/mnt/192.168.211.245:_tank_ovirt_data nfs
> rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,mountaddr=192.168.211.245,mountvers=3,mountproto=tcp,local_lock=none,addr=192.168.211.245
> 0 0
> 
> vm02:
> 
> 192.168.211.245:/tank/ovirt/import_export
> /rhev/data-center/mnt/192.168.211.245:_tank_ovirt_import__export nfs
> rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,mountaddr=192.168.211.245,mountvers=3,mountproto=tcp,local_lock=none,addr=192.168.211.245
> 0 0
> 192.168.211.245:/tank/ovirt/iso
> /rhev/data-center/mnt/192.168.211.245:_tank_ovirt_iso nfs
> rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,mountaddr=192.168.211.245,mountvers=3,mountproto=tcp,local_lock=none,addr=192.168.211.245
> 0 0
> 192.168.211.245:/tank/ovirt/data
> /rhev/data-center/mnt/192.168.211.245:_tank_ovirt_data nfs
> rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,mountaddr=192.168.211.245,mountvers=3,mountproto=tcp,local_lock=none,addr=192.168.211.245
> 0 0
> 
> The NFS server had these 2 log entries in /var/log/messages around the
> time vm02 went Non-operational.
> 
> Feb  9 17:27:59 vmstore1 kernel: svcrdma: Error fast registering
> memory for xprt ffff882014683400
> Feb  9 17:28:21 vmstore1 kernel: svcrdma: Error fast registering
> memory for xprt ffff882025bf1400

This looks like the root cause - failure on the storage server

This leads to failure in the hosts connected to this storage:

Feb  2 13:37:11 vm01 kernel: rpcrdma: connection to 192.168.211.245:20049 closed (-103)
...
Feb  3 07:44:31 vm01 kernel: ------------[ cut here ]------------
Feb  3 07:44:31 vm01 kernel: WARNING: at kernel/softirq.c:159 local_bh_enable_ip+0x7d/0xb0() (Not tainted)
Feb  3 07:44:31 vm01 kernel: Hardware name: H8DMT-IBX
Feb  3 07:44:31 vm01 kernel: Modules linked in: ebt_arp xprtrdma nfs fscache auth_rpcgss nfs_acl bonding ebtable_nat ebtables softdog lockd sunrpc powernow_k8 freq_table mperf 
8021q garp bridge stp llc xt_physdev ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_multiport iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_
conntrack ip6table_filter ip6_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 dm_round_robin dm_multipath vhost_net macvtap macvlan tun kvm_a
md kvm microcode serio_raw k10temp amd64_edac_mod edac_core edac_mce_amd igb dca i2c_algo_bit ptp pps_core mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core sg i2c_nforce2 i2c_cor
e ext4 jbd2 mbcache raid1 sd_mod crc_t10dif sata_nv ata_generic pata_acpi pata_amd dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Feb  3 07:44:31 vm01 kernel: Pid: 0, comm: swapper Not tainted 2.6.32-431.3.1.el6.x86_64 #1
Feb  3 07:44:31 vm01 kernel: Call Trace:
Feb  3 07:44:31 vm01 kernel: <IRQ>  [<ffffffff81071e27>] ? warn_slowpath_common+0x87/0xc0
Feb  3 07:44:31 vm01 kernel: [<ffffffff81071e7a>] ? warn_slowpath_null+0x1a/0x20
Feb  3 07:44:31 vm01 kernel: [<ffffffff8107a3ed>] ? local_bh_enable_ip+0x7d/0xb0
Feb  3 07:44:31 vm01 kernel: [<ffffffff8152a4fb>] ? _spin_unlock_bh+0x1b/0x20
Feb  3 07:44:31 vm01 kernel: [<ffffffffa044c4f0>] ? rpc_wake_up_status+0x70/0x80 [sunrpc]
Feb  3 07:44:31 vm01 kernel: [<ffffffffa044579c>] ? xprt_wake_pending_tasks+0x2c/0x30 [sunrpc]
Feb  3 07:44:31 vm01 kernel: [<ffffffffa02782fc>] ? rpcrdma_conn_func+0x9c/0xb0 [xprtrdma]
Feb  3 07:44:31 vm01 kernel: [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
Feb  3 07:44:31 vm01 kernel: [<ffffffffa027b450>] ? rpcrdma_qp_async_error_upcall+0x40/0x80 [xprtrdma]
Feb  3 07:44:31 vm01 kernel: [<ffffffffa019a1cb>] ? mlx4_ib_qp_event+0x8b/0x100 [mlx4_ib]
Feb  3 07:44:31 vm01 kernel: [<ffffffffa0138c54>] ? mlx4_qp_event+0x74/0xf0 [mlx4_core]
Feb  3 07:44:31 vm01 kernel: [<ffffffffa01bd396>] ? igb_poll+0xb66/0x1020 [igb]
Feb  3 07:44:31 vm01 kernel: [<ffffffffa0126057>] ? mlx4_eq_int+0x557/0xcb0 [mlx4_core]
Feb  3 07:44:31 vm01 kernel: [<ffffffff8109a9a0>] ? posix_timer_fn+0x0/0xe0
Feb  3 07:44:31 vm01 kernel: [<ffffffff8109a982>] ? posix_timer_event+0x42/0x60
Feb  3 07:44:31 vm01 kernel: [<ffffffff810a7159>] ? ktime_get+0x69/0xf0
Feb  3 07:44:31 vm01 kernel: [<ffffffffa01267c4>] ? mlx4_msi_x_interrupt+0x14/0x20 [mlx4_core]
Feb  3 07:44:31 vm01 kernel: [<ffffffff810e6ed0>] ? handle_IRQ_event+0x60/0x170
Feb  3 07:44:31 vm01 kernel: [<ffffffff810e982e>] ? handle_edge_irq+0xde/0x180
Feb  3 07:44:31 vm01 kernel: [<ffffffff8100faf9>] ? handle_irq+0x49/0xa0
Feb  3 07:44:31 vm01 kernel: [<ffffffff81530fec>] ? do_IRQ+0x6c/0xf0
Feb  3 07:44:31 vm01 kernel: [<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11
Feb  3 07:44:31 vm01 kernel: <EOI>  [<ffffffff8103eacb>] ? native_safe_halt+0xb/0x10
Feb  3 07:44:31 vm01 kernel: [<ffffffff810167bd>] ? default_idle+0x4d/0xb0
Feb  3 07:44:31 vm01 kernel: [<ffffffff810168bd>] ? c1e_idle+0x9d/0x120
Feb  3 07:44:31 vm01 kernel: [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110
Feb  3 07:44:31 vm01 kernel: [<ffffffff81520e2c>] ? start_secondary+0x2ac/0x2ef
Feb  3 07:44:31 vm01 kernel: ---[ end trace 9d97672873a67a1f ]---

This looks like an error in the kernel. You should consult kernel/rpcrdma folks with this error.

Which cause sanlock to fail to update the lease on the storage (expected):

Feb  3 07:44:58 vm01 sanlock[2536]: 2014-02-03 07:44:58-0600 240689 [2536]: s1 check_our_lease failed 80

Then sanlock try to kill vdsm, the owner of the lease:

Feb  3 07:44:58 vm01 sanlock[2536]: 2014-02-03 07:44:58-0600 240689 [2536]: s1 kill 3030 sig 15 count 1
...
Feb  3 07:45:06 vm01 sanlock[2536]: 2014-02-03 07:45:06-0600 240698 [2536]: s1 kill 3030 sig 15 count 10
Feb  3 07:45:07 vm01 sanlock[2536]: 2014-02-03 07:45:07-0600 240698 [2536]: dead 3030 ci 3 count 10

This makes the host Non-Responsive (expected).

Now vdsm is restarted, which will make it responsive again:

Feb  3 07:45:07 vm01 respawn: slave '/usr/share/vdsm/vdsm --pidfile /var/run/vdsm/vdsmd.pid' died, respawning slave

But sanlock since there is no access to storage, the host is Non Operational (expected).

The vm was starting a migration to the other host:

Thread-26::DEBUG::2014-02-03 07:49:18,067::BindingXMLRPC::965::vds::(wrapper) client [192.168.202.99]::call vmMigrate with ({'tunneled': 'false', 'dstqemu': '192.168.202.103', 
'src': 'vm01.brazos.tamu.edu', 'dst': 'vm02.brazos.tamu.edu:54321', 'vmId': '741f9811-db68-4dc4-a88a-7cb9be576e57', 'method': 'online'},) {} flowID [7829ae2a]
Thread-26::DEBUG::2014-02-03 07:49:18,067::API::463::vds::(migrate) {'tunneled': 'false', 'dstqemu': '192.168.202.103', 'src': 'vm01.brazos.tamu.edu', 'dst': 'vm02.brazos.tamu.
edu:54321', 'vmId': '741f9811-db68-4dc4-a88a-7cb9be576e57', 'method': 'online'}
Thread-26::DEBUG::2014-02-03 07:49:18,068::BindingXMLRPC::972::vds::(wrapper) return vmMigrate with {'status': {'message': 'Migration in progress', 'code': 0}, 'progress': 0}

The migration was almost complete after 20 seconds:

Thread-29::INFO::2014-02-03 07:49:38,329::vm::815::vm.Vm::(run) vmId=`741f9811-db68-4dc4-a88a-7cb9be576e57`::Migration Progress: 20 seconds elapsed, 99% of data processed, 99% of mem processed

But it never completed:

Thread-29::WARNING::2014-02-03 07:54:38,383::vm::792::vm.Vm::(run) vmId=`741f9811-db68-4dc4-a88a-7cb9be576e57`::Migration is stuck: Hasn't progressed in 300.054134846 seconds. Aborting.

CCing Michal to inspect why the migration has failed.

> 
> Attached is tar of the logs from vm01, vm02 and the engine server.
> 
> vm01 & vm02 folders contain files from '/var/log/messages
> /var/log/sanlock.log /var/log/vdsm/*.log'
> engine from '/var/log/messages /var/log/ovirt-engine/*.log'
> 
> Thanks
> - Trey
> 
> On Sun, Feb 9, 2014 at 4:15 PM, Itamar Heim <iheim at redhat.com> wrote:
> > On 02/03/2014 06:58 PM, Trey Dockendorf wrote:
> >>
> >> I have a 2 node oVirt 3.3.2 cluster setup and am evaluating the setup
> >> for production use on our HPC system for managing our VM
> >> infrastructure.  Currently I'm trying to utilize our DDR InfiniBand
> >> fabric for the storage domains in oVirt using NFS over RDMA.  I've
> >> noticed some unstable behavior and it seems in every case to begin
> >> with sanlock.
> >>
> >> The ovirt web admin interface shows the following message as first
> >> sign of trouble on 2014-Feb-03 07:45.
> >>
> >> "Invalid status on Data Center Default. Setting Data Center status to
> >> Non Responsive (On host vm01.brazos.tamu.edu, Error: Network error
> >> during communication with the Host.).".
> >>
> >> The single VM I had running is stuck in the "Migrating From" state.
> >> virsh shows the VM paused on the crashed host and the one it attempted
> >> to migrate to.
> >>
> >> Right now I have a few concerns.
> >>
> >> 1) The cause of the sanlock (or other instability) and if it's related
> >> to a bug or an issue using NFSoRDMA.

vdsm and sanlock seem to behave as they should when storage is not accessible.

> >> 2) Why the VM failed to migrate if the second host had no issues.  If

Virt team will have to answer this.

> >> the first host is down should the VM be considered offline and booted
> >> on the second host after first is fenced?

The host was not fenced, and was not down. It was up and the vm was still
running, possibly accessing the storage.

> >>
> >> Attached are logs from the failed host (vm01) and the healthy host
> >> (vm02) as well as engine.  The failed host's /var/log/message is also
> >> attached (vm01_message.log).
> >>
> >> Thanks
> >> - Trey
> >>
> >>
> >>
> >> _______________________________________________
> >> Users mailing list
> >> Users at ovirt.org
> >> http://lists.ovirt.org/mailman/listinfo/users
> >>
> >
> > was this resolved?
> 
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
>