[ovirt-users] Ovirt 3.5 host gluster storage connection failure

Wed Dec 23 06:14:37 UTC 2015

[+ Ravi, Pranith]

On 12/23/2015 06:00 AM, Steve Dainard wrote:
> I have two hosts, only one of them was running VM's at the time of
> this crash so I can't tell if this is a node specific problem.
>
> rpm -qa | egrep -i 'gluster|vdsm|libvirt' |sort
> glusterfs-3.6.7-1.el7.x86_64
> glusterfs-api-3.6.7-1.el7.x86_64
> glusterfs-cli-3.6.7-1.el7.x86_64
> glusterfs-fuse-3.6.7-1.el7.x86_64
> glusterfs-libs-3.6.7-1.el7.x86_64
> glusterfs-rdma-3.6.7-1.el7.x86_64
> libvirt-client-1.2.8-16.el7_1.5.x86_64
> libvirt-daemon-1.2.8-16.el7_1.5.x86_64
> libvirt-daemon-config-nwfilter-1.2.8-16.el7_1.5.x86_64
> libvirt-daemon-driver-interface-1.2.8-16.el7_1.5.x86_64
> libvirt-daemon-driver-network-1.2.8-16.el7_1.5.x86_64
> libvirt-daemon-driver-nodedev-1.2.8-16.el7_1.5.x86_64
> libvirt-daemon-driver-nwfilter-1.2.8-16.el7_1.5.x86_64
> libvirt-daemon-driver-qemu-1.2.8-16.el7_1.5.x86_64
> libvirt-daemon-driver-secret-1.2.8-16.el7_1.5.x86_64
> libvirt-daemon-driver-storage-1.2.8-16.el7_1.5.x86_64
> libvirt-daemon-kvm-1.2.8-16.el7_1.5.x86_64
> libvirt-lock-sanlock-1.2.8-16.el7_1.5.x86_64
> libvirt-python-1.2.8-7.el7_1.1.x86_64
> vdsm-4.16.30-0.el7.centos.x86_64
> vdsm-cli-4.16.30-0.el7.centos.noarch
> vdsm-jsonrpc-4.16.30-0.el7.centos.noarch
> vdsm-python-4.16.30-0.el7.centos.noarch
> vdsm-python-zombiereaper-4.16.30-0.el7.centos.noarch
> vdsm-xmlrpc-4.16.30-0.el7.centos.noarch
> vdsm-yajsonrpc-4.16.30-0.el7.centos.noarch
>
> VM's were in a paused state, with errors in UI:
>
> 2015-Dec-22, 15:06
> VM pcic-apps has paused due to unknown storage error.
> 2015-Dec-22, 15:06
> Host compute2 is not responding. It will stay in Connecting state for
> a grace period of 82 seconds and after that an attempt to fence the
> host will be issued.
> 2015-Dec-22, 15:03
> Invalid status on Data Center EDC2. Setting Data Center status to Non
> Responsive (On host compute2, Error: General Exception).
> 2015-Dec-22, 15:03
> VM pcic-storage has paused due to unknown storage error.
> 2015-Dec-22, 15:03
> VM docker1 has paused due to unknown storage error.
>
> VDSM logs look normal until:
> Dummy-99::DEBUG::2015-12-22
> 23:03:58,949::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail)
> dd if=/rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox
> iflag=direct,fullblock count=1 bs=1024000 (cwd None)
> Dummy-99::DEBUG::2015-12-22
> 23:03:58,963::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail)
> SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0
> MB) copied, 0.00350501 s, 292 MB/s\n'; <rc> = 0
> VM Channels Listener::INFO::2015-12-22
> 23:03:59,527::guestagent::180::vm.Vm::(_handleAPIVersion)
> vmId=`7067679e-43aa-43c0-b263-b0a711ade2e2`::Guest API version changed
> from 2 to 1
> Thread-245428::DEBUG::2015-12-22
> 23:03:59,718::libvirtconnection::151::root::(wrapper) Unknown
> libvirterror: ecode: 80 edom: 20 level: 2 message: metadata not found:
> Requested metadata element is not present
> libvirtEventLoop::INFO::2015-12-22
> 23:04:00,447::vm::4982::vm.Vm::(_onIOError)
> vmId=`376e98b7-7798-46e8-be03-5dddf6cfb54f`::abnormal vm stop device
> virtio-disk0 error eother
> libvirtEventLoop::DEBUG::2015-12-22
> 23:04:00,447::vm::5666::vm.Vm::(_onLibvirtLifecycleEvent)
> vmId=`376e98b7-7798-46e8-be03-5dddf6cfb54f`::event Suspended detail 2
> opaque None
> libvirtEventLoop::INFO::2015-12-22
> 23:04:00,447::vm::4982::vm.Vm::(_onIOError)
> vmId=`376e98b7-7798-46e8-be03-5dddf6cfb54f`::abnormal vm stop device
> virtio-disk0 error eother
> ...
> libvirtEventLoop::INFO::2015-12-22
> 23:04:00,843::vm::4982::vm.Vm::(_onIOError)
> vmId=`97fbbf97-944b-4b77-b0bf-6a831f9090d8`::abnormal vm stop device
> virtio-disk1 error eother
> libvirtEventLoop::DEBUG::2015-12-22
> 23:04:00,844::vm::5666::vm.Vm::(_onLibvirtLifecycleEvent)
> vmId=`97fbbf97-944b-4b77-b0bf-6a831f9090d8`::event Suspended detail 2
> opaque None
> Dummy-99::DEBUG::2015-12-22
> 23:04:00,973::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail)
> dd if=/rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox
> iflag=direct,fullblock count=1 bs=1024000 (cwd None)
> Dummy-99::DEBUG::2015-12-22
> 23:04:00,983::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail)
> FAILED: <err> = "dd: failed to open
> '/rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox':
> Transport endpoint is not connected\n"; <rc> = 1
> Dummy-99::ERROR::2015-12-22
> 23:04:00,983::storage_mailbox::787::Storage.MailBox.SpmMailMonitor::(run)
> Error checking for mail
> Traceback (most recent call last):
>    File "/usr/share/vdsm/storage/storage_mailbox.py", line 785, in run
>      self._checkForMail()
>    File "/usr/share/vdsm/storage/storage_mailbox.py", line 734, in _checkForMail
>      "Could not read mailbox: %s" % self._inbox)
> IOError: [Errno 5] _handleRequests._checkForMail - Could not read
> mailbox: /rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox
> Dummy-99::DEBUG::2015-12-22
> 23:04:02,987::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail)
> dd if=/rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox
> iflag=direct,fullblock count=1 bs=1024000 (cwd None)
> Dummy-99::DEBUG::2015-12-22
> 23:04:02,994::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail)
> FAILED: <err> = "dd: failed to open
> '/rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox':
> Transport endpoint is not connected\n"; <rc> = 1
> Dummy-99::ERROR::2015-12-22
> 23:04:02,994::storage_mailbox::787::Storage.MailBox.SpmMailMonitor::(run)
> Error checking for mail
> Traceback (most recent call last):
>    File "/usr/share/vdsm/storage/storage_mailbox.py", line 785, in run
>      self._checkForMail()
>    File "/usr/share/vdsm/storage/storage_mailbox.py", line 734, in _checkForMail
>      "Could not read mailbox: %s" % self._inbox)
> IOError: [Errno 5] _handleRequests._checkForMail - Could not read
> mailbox: /rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox
> Thread-563692::DEBUG::2015-12-22
> 23:04:03,539::__init__::481::jsonrpc.JsonRpcServer::(_serveRequest)
> Calling 'StoragePool.getSpmStatus' in bridge with {u'storagepoolID':
> u'f72ec125-69a1-4c1b-a5e1-313fcb70b6ff'}
> Thread-563692::DEBUG::2015-12-22
> 23:04:03,540::task::595::Storage.TaskManager.Task::(_updateState)
> Task=`6121360f-af29-48cf-8a8c-957480472a9d`::moving from state init ->
> state preparing
> Thread-563692::INFO::2015-12-22
> 23:04:03,540::logUtils::44::dispatcher::(wrapper) Run and protect:
> getSpmStatus(spUUID=u'f72ec125-69a1-4c1b-a5e1-313fcb70b6ff',
> options=None)
> Thread-563692::ERROR::2015-12-22
> 23:04:03,541::task::866::Storage.TaskManager.Task::(_setError)
> Task=`6121360f-af29-48cf-8a8c-957480472a9d`::Unexpected error
> Traceback (most recent call last):
>    File "/usr/share/vdsm/storage/task.py", line 873, in _run
>      return fn(*args, **kargs)
>    File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
>      res = f(*args, **kwargs)
>    File "/usr/share/vdsm/storage/hsm.py", line 633, in getSpmStatus
>      status = self._getSpmStatusInfo(pool)
>    File "/usr/share/vdsm/storage/hsm.py", line 627, in _getSpmStatusInfo
>      (pool.spmRole,) + pool.getSpmStatus()))
>    File "/usr/share/vdsm/storage/sp.py", line 126, in getSpmStatus
>      return self._backend.getSpmStatus()
>    File "/usr/share/vdsm/storage/spbackends.py", line 416, in getSpmStatus
>      lVer, spmId = self.masterDomain.inquireClusterLock()
>    File "/usr/share/vdsm/storage/sd.py", line 515, in inquireClusterLock
>      return self._clusterLock.inquire()
>    File "/usr/share/vdsm/storage/clusterlock.py", line 321, in inquire
>      resource = sanlock.read_resource(self._leasesPath, SDM_LEASE_OFFSET)
> SanlockException: (107, 'Sanlock resource read failure', 'Transport
> endpoint is not connected')
> ...
>
> I noticed the gluster mount point was still shown running 'mount' but
> when I tried to ls the mount point I got a "transport endpoint is not
> connected" error.
>
> I haven't been able to find anything more interested in the logs, and
> the journal is blank before the recent host restart.
>
> In the gluster client logs from the host I see:
> ...
> [2015-12-22 22:52:23.238369] W [fuse-bridge.c:1261:fuse_err_cbk]
> 0-glusterfs-fuse: 31263720: REMOVEXATTR() /__DIRECT_IO_TEST__ => -1
> (No dat
> a available)
> [2015-12-22 22:57:23.778018] W [fuse-bridge.c:1261:fuse_err_cbk]
> 0-glusterfs-fuse: 31316031: REMOVEXATTR() /__DIRECT_IO_TEST__ => -1
> (No dat
> a available)
> [2015-12-22 23:02:24.376491] W [fuse-bridge.c:1261:fuse_err_cbk]
> 0-glusterfs-fuse: 31473552: REMOVEXATTR() /__DIRECT_IO_TEST__ => -1
> (No dat
> a available)
> pending frames:
> frame : type(1) op(FSTAT)
> frame : type(1) op(READ)
> frame : type(1) op(READ)
> frame : type(1) op(READ)
> frame : type(1) op(READ)
> frame : type(1) op(READ)
> frame : type(1) op(READ)
> frame : type(1) op(READ)
> frame : type(1) op(READ)
> frame : type(1) op(READ)
> frame : type(1) op(READ)
> frame : type(1) op(READ)
> frame : type(1) op(READ)
> frame : type(1) op(READ)
> frame : type(1) op(READ)
> frame : type(1) op(READ)
> frame : type(1) op(READ)
> frame : type(1) op(READ)
> frame : type(1) op(READ)
> frame : type(0) op(0)
> frame : type(0) op(0)
> ...
> frame : type(1) op(READ)
> frame : type(0) op(0)
> frame : type(1) op(OPEN)
> frame : type(1) op(OPEN)
> frame : type(1) op(OPEN)
> frame : type(1) op(OPEN)
> frame : type(0) op(0)
> patchset: git://git.gluster.com/glusterfs.git
> signal received: 6
> time of crash:
> 2015-12-22 23:04:00
> configuration details:
> argp 1
> backtrace 1
> dlfcn 1
> libpthread 1
> llistxattr 1
> setfsid 1
> spinlock 1
> epoll.h 1
> xattr.h 1
> st_atim.tv_nsec 1
> package-string: glusterfs 3.6.7
> /lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb2)[0x7f0d091f6392]
> /lib64/libglusterfs.so.0(gf_print_trace+0x32d)[0x7f0d0920d88d]
> /lib64/libc.so.6(+0x35650)[0x7f0d0820f650]
> /lib64/libc.so.6(gsignal+0x37)[0x7f0d0820f5d7]
> /lib64/libc.so.6(abort+0x148)[0x7f0d08210cc8]
> /lib64/libc.so.6(+0x75e07)[0x7f0d0824fe07]
> /lib64/libc.so.6(+0x7d1fd)[0x7f0d082571fd]
> /usr/lib64/glusterfs/3.6.7/xlator/protocol/client.so(client_local_wipe+0x39)[0x7f0cfe8acdf9]
> /usr/lib64/glusterfs/3.6.7/xlator/protocol/client.so(client3_3_readv_cbk+0x487)[0x7f0cfe8c1197]
> /lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0x90)[0x7f0d08fca100]
> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x174)[0x7f0d08fca374]
> /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f0d08fc62c3]
> /usr/lib64/glusterfs/3.6.7/rpc-transport/socket.so(+0x8790)[0x7f0d047f3790]
> /usr/lib64/glusterfs/3.6.7/rpc-transport/socket.so(+0xaf84)[0x7f0d047f5f84]
> /lib64/libglusterfs.so.0(+0x767c2)[0x7f0d0924b7c2]
> /usr/sbin/glusterfs(main+0x502)[0x7f0d096a0fe2]
> /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f0d081fbaf5]
> /usr/sbin/glusterfs(+0x6381)[0x7f0d096a1381]

Could you provide the gluster mount logs from the client, and the 
gluster brick logs from the gluster servers?

>
> In /var/log/messages:
> ...
> Dec 22 23:00:49 compute2 journal: metadata not found: Requested
> metadata element is not present
> Dec 22 23:00:59 compute2 journal: metadata not found: Requested
> metadata element is not present
> Dec 22 23:01:01 compute2 systemd: Created slice user-0.slice.
> Dec 22 23:01:01 compute2 systemd: Starting Session 1316 of user root.
> Dec 22 23:01:01 compute2 systemd: Started Session 1316 of user root.
> Dec 22 23:01:01 compute2 systemd: Failed to reset devices.list on
> /machine.slice: Invalid argument
> Dec 22 23:01:03 compute2 journal: metadata not found: Requested
> metadata element is not present
> ...
> Dec 22 23:04:00 compute2
> rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: pending
> frames:
> Dec 22 23:04:00 compute2
> rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: frame :
> type(1) op(FSTAT)
> Dec 22 23:04:00 compute2
> rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: frame :
> type(1) op(READ)
> Dec 22 23:04:00 compute2
> rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: frame :
> type(1) op(READ)
> ...
> Dec 22 23:04:00 compute2
> rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: frame :
> type(1) op(OPEN)
> Dec 22 23:04:00 compute2
> rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: frame :
> type(0) op(0)
> Dec 22 23:04:00 compute2
> rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]:
> patchset: git://git.gluster.com/glusterfs.git
> Dec 22 23:04:00 compute2
> rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: signal
> received: 6
> Dec 22 23:04:00 compute2
> rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: time of
> crash:
> Dec 22 23:04:00 compute2
> rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]:
> 2015-12-22 23:04:00
> Dec 22 23:04:00 compute2
> rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]:
> configuration details:
> Dec 22 23:04:00 compute2
> rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: argp 1
> Dec 22 23:04:00 compute2
> rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]:
> backtrace 1
> Dec 22 23:04:00 compute2
> rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: dlfcn 1
> Dec 22 23:04:00 compute2
> rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]:
> libpthread 1
> Dec 22 23:04:00 compute2
> rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]:
> llistxattr 1
> Dec 22 23:04:00 compute2
> rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: setfsid
> 1
> Dec 22 23:04:00 compute2
> rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]:
> spinlock 1
> Dec 22 23:04:00 compute2
> rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: epoll.h
> 1
> Dec 22 23:04:00 compute2
> rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: xattr.h
> 1
> Dec 22 23:04:00 compute2
> rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]:
> st_atim.tv_nsec 1
> Dec 22 23:04:00 compute2
> rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]:
> package-string: glusterfs 3.6.7
> Dec 22 23:04:00 compute2
> rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]:
> ---------
> Dec 22 23:04:03 compute2 sanlock[944]: 2015-12-22 23:04:03+0000 944341
> [948]: open error -107
> /rhev/data-center/mnt/glusterSD/10.0.231.50:_vm-storage/a5a83df1-47e2-4927-9add-079199ca7ef8/dom_md/leases
> Dec 22 23:04:03 compute2 sanlock[944]: 2015-12-22 23:04:03+0000 944342
> [944]: ci 8 fd 17 pid -1 recv errno 104
> Dec 22 23:04:03 compute2 sanlock[944]: 2015-12-22 23:04:03+0000 944342
> [949]: open error -107
> /rhev/data-center/mnt/glusterSD/10.0.231.50:_vm-storage/a5a83df1-47e2-4927-9add-079199ca7ef8/dom_md/leases
> Dec 22 23:04:03 compute2 sanlock[944]: 2015-12-22 23:04:03+0000 944342
> [944]: ci 9 fd 18 pid -1 recv errno 104
> Dec 22 23:04:03 compute2 sanlock[944]: 2015-12-22 23:04:03+0000 944342
> [948]: open error -1
> /rhev/data-center/mnt/glusterSD/10.0.231.50:_vm-storage/a5a83df1-47e2-4927-9add-079199ca7ef8/dom_md/leases
> Dec 22 23:04:03 compute2 sanlock[944]: 2015-12-22 23:04:03+0000 944342
> [948]: r1 release_token open error -107
> Dec 22 23:04:03 compute2 journal: vdsm root ERROR Panic: Unrecoverable
> errors during SPM stop process.
> Traceback (most recent call last):
>    File "/usr/share/vdsm/storage/sp.py", line 414, in stopSpm
>      self.masterDomain.releaseClusterLock()
>    File "/usr/share/vdsm/storage/sd.py", line 512, in releaseClusterLock
>      self._clusterLock.release()
>    File "/usr/share/vdsm/storage/clusterlock.py", line 342, in release
>      raise se.ReleaseLockFailure(self._sdUUID, e)
> ReleaseLockFailure: Cannot release lock:
> (u'a5a83df1-47e2-4927-9add-079199ca7ef8', SanlockException(107,
> 'Sanlock resource not released', 'Transport endpoint is not
> connected'))
> Dec 22 23:04:03 compute2 journal: End of file while reading data:
> Input/output error
> Dec 22 23:04:03 compute2 journal: End of file while reading data:
> Input/output error
> Dec 22 23:04:03 compute2 systemd: vdsmd.service: main process exited,
> code=killed, status=9/KILL
> Dec 22 23:04:03 compute2 systemd: Failed to reset devices.list on
> /machine.slice: Invalid argument
> Dec 22 23:04:03 compute2 vdsmd_init_common.sh: vdsm: Running run_final_hooks
> Dec 22 23:04:03 compute2 systemd: Unit vdsmd.service entered failed state.
> Dec 22 23:04:03 compute2 systemd: vdsmd.service holdoff time over,
> scheduling restart.
> Dec 22 23:04:03 compute2 systemd: Stopping Virtual Desktop Server Manager...
> Dec 22 23:04:03 compute2 systemd: Starting Virtual Desktop Server Manager...
> Dec 22 23:04:03 compute2 systemd: Failed to reset devices.list on
> /machine.slice: Invalid argument
>
> At this point I tried restarting vdsmd but ended up having to restart
> the host. There was network connectivity to gluster before I
> restarted, and I was able to manually mount the vm-storage volume on
> the host and ls without issue.
>
> After restarting the host I'm up and running again but I'm sure I'll
> get hit by this again if I can't sort out what the issue is.
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users