Ovirt 3.5 host gluster storage connection failure

23 Dec 2015

      I have two hosts, only one of them was running VM's at the time of
this crash so I can't tell if this is a node specific problem.

rpm -qa | egrep -i 'gluster|vdsm|libvirt' |sort
glusterfs-3.6.7-1.el7.x86_64
glusterfs-api-3.6.7-1.el7.x86_64
glusterfs-cli-3.6.7-1.el7.x86_64
glusterfs-fuse-3.6.7-1.el7.x86_64
glusterfs-libs-3.6.7-1.el7.x86_64
glusterfs-rdma-3.6.7-1.el7.x86_64
libvirt-client-1.2.8-16.el7_1.5.x86_64
libvirt-daemon-1.2.8-16.el7_1.5.x86_64
libvirt-daemon-config-nwfilter-1.2.8-16.el7_1.5.x86_64
libvirt-daemon-driver-interface-1.2.8-16.el7_1.5.x86_64
libvirt-daemon-driver-network-1.2.8-16.el7_1.5.x86_64
libvirt-daemon-driver-nodedev-1.2.8-16.el7_1.5.x86_64
libvirt-daemon-driver-nwfilter-1.2.8-16.el7_1.5.x86_64
libvirt-daemon-driver-qemu-1.2.8-16.el7_1.5.x86_64
libvirt-daemon-driver-secret-1.2.8-16.el7_1.5.x86_64
libvirt-daemon-driver-storage-1.2.8-16.el7_1.5.x86_64
libvirt-daemon-kvm-1.2.8-16.el7_1.5.x86_64
libvirt-lock-sanlock-1.2.8-16.el7_1.5.x86_64
libvirt-python-1.2.8-7.el7_1.1.x86_64
vdsm-4.16.30-0.el7.centos.x86_64
vdsm-cli-4.16.30-0.el7.centos.noarch
vdsm-jsonrpc-4.16.30-0.el7.centos.noarch
vdsm-python-4.16.30-0.el7.centos.noarch
vdsm-python-zombiereaper-4.16.30-0.el7.centos.noarch
vdsm-xmlrpc-4.16.30-0.el7.centos.noarch
vdsm-yajsonrpc-4.16.30-0.el7.centos.noarch

VM's were in a paused state, with errors in UI:

2015-Dec-22, 15:06
VM pcic-apps has paused due to unknown storage error.
2015-Dec-22, 15:06
Host compute2 is not responding. It will stay in Connecting state for
a grace period of 82 seconds and after that an attempt to fence the
host will be issued.
2015-Dec-22, 15:03
Invalid status on Data Center EDC2. Setting Data Center status to Non
Responsive (On host compute2, Error: General Exception).
2015-Dec-22, 15:03
VM pcic-storage has paused due to unknown storage error.
2015-Dec-22, 15:03
VM docker1 has paused due to unknown storage error.

VDSM logs look normal until:
Dummy-99::DEBUG::2015-12-22
23:03:58,949::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail)
dd if=/rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox
iflag=direct,fullblock count=1 bs=1024000 (cwd None)
Dummy-99::DEBUG::2015-12-22
23:03:58,963::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail)
SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0
MB) copied, 0.00350501 s, 292 MB/s\n'; <rc> = 0
VM Channels Listener::INFO::2015-12-22
23:03:59,527::guestagent::180::vm.Vm::(_handleAPIVersion)
vmId=`7067679e-43aa-43c0-b263-b0a711ade2e2`::Guest API version changed
from 2 to 1
Thread-245428::DEBUG::2015-12-22
23:03:59,718::libvirtconnection::151::root::(wrapper) Unknown
libvirterror: ecode: 80 edom: 20 level: 2 message: metadata not found:
Requested metadata element is not present
libvirtEventLoop::INFO::2015-12-22
23:04:00,447::vm::4982::vm.Vm::(_onIOError)
vmId=`376e98b7-7798-46e8-be03-5dddf6cfb54f`::abnormal vm stop device
virtio-disk0 error eother
libvirtEventLoop::DEBUG::2015-12-22
23:04:00,447::vm::5666::vm.Vm::(_onLibvirtLifecycleEvent)
vmId=`376e98b7-7798-46e8-be03-5dddf6cfb54f`::event Suspended detail 2
opaque None
libvirtEventLoop::INFO::2015-12-22
23:04:00,447::vm::4982::vm.Vm::(_onIOError)
vmId=`376e98b7-7798-46e8-be03-5dddf6cfb54f`::abnormal vm stop device
virtio-disk0 error eother
...
libvirtEventLoop::INFO::2015-12-22
23:04:00,843::vm::4982::vm.Vm::(_onIOError)
vmId=`97fbbf97-944b-4b77-b0bf-6a831f9090d8`::abnormal vm stop device
virtio-disk1 error eother
libvirtEventLoop::DEBUG::2015-12-22
23:04:00,844::vm::5666::vm.Vm::(_onLibvirtLifecycleEvent)
vmId=`97fbbf97-944b-4b77-b0bf-6a831f9090d8`::event Suspended detail 2
opaque None
Dummy-99::DEBUG::2015-12-22
23:04:00,973::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail)
dd if=/rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox
iflag=direct,fullblock count=1 bs=1024000 (cwd None)
Dummy-99::DEBUG::2015-12-22
23:04:00,983::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail)
FAILED: <err> = "dd: failed to open
'/rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox':
Transport endpoint is not connected\n"; <rc> = 1
Dummy-99::ERROR::2015-12-22
23:04:00,983::storage_mailbox::787::Storage.MailBox.SpmMailMonitor::(run)
Error checking for mail
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 785, in run
    self._checkForMail()
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 734, in _checkForMail
    "Could not read mailbox: %s" % self._inbox)
IOError: [Errno 5] _handleRequests._checkForMail - Could not read
mailbox: /rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox
Dummy-99::DEBUG::2015-12-22
23:04:02,987::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail)
dd if=/rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox
iflag=direct,fullblock count=1 bs=1024000 (cwd None)
Dummy-99::DEBUG::2015-12-22
23:04:02,994::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail)
FAILED: <err> = "dd: failed to open
'/rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox':
Transport endpoint is not connected\n"; <rc> = 1
Dummy-99::ERROR::2015-12-22
23:04:02,994::storage_mailbox::787::Storage.MailBox.SpmMailMonitor::(run)
Error checking for mail
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 785, in run
    self._checkForMail()
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 734, in _checkForMail
    "Could not read mailbox: %s" % self._inbox)
IOError: [Errno 5] _handleRequests._checkForMail - Could not read
mailbox: /rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox
Thread-563692::DEBUG::2015-12-22
23:04:03,539::__init__::481::jsonrpc.JsonRpcServer::(_serveRequest)
Calling 'StoragePool.getSpmStatus' in bridge with {u'storagepoolID':
u'f72ec125-69a1-4c1b-a5e1-313fcb70b6ff'}
Thread-563692::DEBUG::2015-12-22
23:04:03,540::task::595::Storage.TaskManager.Task::(_updateState)
Task=`6121360f-af29-48cf-8a8c-957480472a9d`::moving from state init ->
state preparing
Thread-563692::INFO::2015-12-22
23:04:03,540::logUtils::44::dispatcher::(wrapper) Run and protect:
getSpmStatus(spUUID=u'f72ec125-69a1-4c1b-a5e1-313fcb70b6ff',
options=None)
Thread-563692::ERROR::2015-12-22
23:04:03,541::task::866::Storage.TaskManager.Task::(_setError)
Task=`6121360f-af29-48cf-8a8c-957480472a9d`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 873, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 633, in getSpmStatus
    status = self._getSpmStatusInfo(pool)
  File "/usr/share/vdsm/storage/hsm.py", line 627, in _getSpmStatusInfo
    (pool.spmRole,) + pool.getSpmStatus()))
  File "/usr/share/vdsm/storage/sp.py", line 126, in getSpmStatus
    return self._backend.getSpmStatus()
  File "/usr/share/vdsm/storage/spbackends.py", line 416, in getSpmStatus
    lVer, spmId = self.masterDomain.inquireClusterLock()
  File "/usr/share/vdsm/storage/sd.py", line 515, in inquireClusterLock
    return self._clusterLock.inquire()
  File "/usr/share/vdsm/storage/clusterlock.py", line 321, in inquire
    resource = sanlock.read_resource(self._leasesPath, SDM_LEASE_OFFSET)
SanlockException: (107, 'Sanlock resource read failure', 'Transport
endpoint is not connected')
...

I noticed the gluster mount point was still shown running 'mount' but
when I tried to ls the mount point I got a "transport endpoint is not
connected" error.

I haven't been able to find anything more interested in the logs, and
the journal is blank before the recent host restart.

In the gluster client logs from the host I see:
...
[2015-12-22 22:52:23.238369] W [fuse-bridge.c:1261:fuse_err_cbk]
0-glusterfs-fuse: 31263720: REMOVEXATTR() /__DIRECT_IO_TEST__ => -1
(No dat
a available)
[2015-12-22 22:57:23.778018] W [fuse-bridge.c:1261:fuse_err_cbk]
0-glusterfs-fuse: 31316031: REMOVEXATTR() /__DIRECT_IO_TEST__ => -1
(No dat
a available)
[2015-12-22 23:02:24.376491] W [fuse-bridge.c:1261:fuse_err_cbk]
0-glusterfs-fuse: 31473552: REMOVEXATTR() /__DIRECT_IO_TEST__ => -1
(No dat
a available)
pending frames:
frame : type(1) op(FSTAT)
frame : type(1) op(READ)
frame : type(1) op(READ)
frame : type(1) op(READ)
frame : type(1) op(READ)
frame : type(1) op(READ)
frame : type(1) op(READ)
frame : type(1) op(READ)
frame : type(1) op(READ)
frame : type(1) op(READ)
frame : type(1) op(READ)
frame : type(1) op(READ)
frame : type(1) op(READ)
frame : type(1) op(READ)
frame : type(1) op(READ)
frame : type(1) op(READ)
frame : type(1) op(READ)
frame : type(1) op(READ)
frame : type(1) op(READ)
frame : type(0) op(0)
frame : type(0) op(0)
...
frame : type(1) op(READ)
frame : type(0) op(0)
frame : type(1) op(OPEN)
frame : type(1) op(OPEN)
frame : type(1) op(OPEN)
frame : type(1) op(OPEN)
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 6
time of crash:
2015-12-22 23:04:00
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.6.7
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb2)[0x7f0d091f6392]
/lib64/libglusterfs.so.0(gf_print_trace+0x32d)[0x7f0d0920d88d]
/lib64/libc.so.6(+0x35650)[0x7f0d0820f650]
/lib64/libc.so.6(gsignal+0x37)[0x7f0d0820f5d7]
/lib64/libc.so.6(abort+0x148)[0x7f0d08210cc8]
/lib64/libc.so.6(+0x75e07)[0x7f0d0824fe07]
/lib64/libc.so.6(+0x7d1fd)[0x7f0d082571fd]
/usr/lib64/glusterfs/3.6.7/xlator/protocol/client.so(client_local_wipe+0x39)[0x7f0cfe8acdf9]
/usr/lib64/glusterfs/3.6.7/xlator/protocol/client.so(client3_3_readv_cbk+0x487)[0x7f0cfe8c1197]
/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0x90)[0x7f0d08fca100]
/lib64/libgfrpc.so.0(rpc_clnt_notify+0x174)[0x7f0d08fca374]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f0d08fc62c3]
/usr/lib64/glusterfs/3.6.7/rpc-transport/socket.so(+0x8790)[0x7f0d047f3790]
/usr/lib64/glusterfs/3.6.7/rpc-transport/socket.so(+0xaf84)[0x7f0d047f5f84]
/lib64/libglusterfs.so.0(+0x767c2)[0x7f0d0924b7c2]
/usr/sbin/glusterfs(main+0x502)[0x7f0d096a0fe2]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f0d081fbaf5]
/usr/sbin/glusterfs(+0x6381)[0x7f0d096a1381]

In /var/log/messages:
...
Dec 22 23:00:49 compute2 journal: metadata not found: Requested
metadata element is not present
Dec 22 23:00:59 compute2 journal: metadata not found: Requested
metadata element is not present
Dec 22 23:01:01 compute2 systemd: Created slice user-0.slice.
Dec 22 23:01:01 compute2 systemd: Starting Session 1316 of user root.
Dec 22 23:01:01 compute2 systemd: Started Session 1316 of user root.
Dec 22 23:01:01 compute2 systemd: Failed to reset devices.list on
/machine.slice: Invalid argument
Dec 22 23:01:03 compute2 journal: metadata not found: Requested
metadata element is not present
...
Dec 22 23:04:00 compute2
rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: pending
frames:
Dec 22 23:04:00 compute2
rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: frame :
type(1) op(FSTAT)
Dec 22 23:04:00 compute2
rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: frame :
type(1) op(READ)
Dec 22 23:04:00 compute2
rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: frame :
type(1) op(READ)
...
Dec 22 23:04:00 compute2
rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: frame :
type(1) op(OPEN)
Dec 22 23:04:00 compute2
rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: frame :
type(0) op(0)
Dec 22 23:04:00 compute2
rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]:
patchset: git://git.gluster.com/glusterfs.git
Dec 22 23:04:00 compute2
rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: signal
received: 6
Dec 22 23:04:00 compute2
rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: time of
crash:
Dec 22 23:04:00 compute2
rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]:
2015-12-22 23:04:00
Dec 22 23:04:00 compute2
rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]:
configuration details:
Dec 22 23:04:00 compute2
rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: argp 1
Dec 22 23:04:00 compute2
rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]:
backtrace 1
Dec 22 23:04:00 compute2
rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: dlfcn 1
Dec 22 23:04:00 compute2
rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]:
libpthread 1
Dec 22 23:04:00 compute2
rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]:
llistxattr 1
Dec 22 23:04:00 compute2
rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: setfsid
1
Dec 22 23:04:00 compute2
rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]:
spinlock 1
Dec 22 23:04:00 compute2
rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: epoll.h
1
Dec 22 23:04:00 compute2
rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: xattr.h
1
Dec 22 23:04:00 compute2
rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]:
st_atim.tv_nsec 1
Dec 22 23:04:00 compute2
rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]:
package-string: glusterfs 3.6.7
Dec 22 23:04:00 compute2
rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]:
---------
Dec 22 23:04:03 compute2 sanlock[944]: 2015-12-22 23:04:03+0000 944341
[948]: open error -107
/rhev/data-center/mnt/glusterSD/10.0.231.50:_vm-storage/a5a83df1-47e2-4927-9add-079199ca7ef8/dom_md/leases
Dec 22 23:04:03 compute2 sanlock[944]: 2015-12-22 23:04:03+0000 944342
[944]: ci 8 fd 17 pid -1 recv errno 104
Dec 22 23:04:03 compute2 sanlock[944]: 2015-12-22 23:04:03+0000 944342
[949]: open error -107
/rhev/data-center/mnt/glusterSD/10.0.231.50:_vm-storage/a5a83df1-47e2-4927-9add-079199ca7ef8/dom_md/leases
Dec 22 23:04:03 compute2 sanlock[944]: 2015-12-22 23:04:03+0000 944342
[944]: ci 9 fd 18 pid -1 recv errno 104
Dec 22 23:04:03 compute2 sanlock[944]: 2015-12-22 23:04:03+0000 944342
[948]: open error -1
/rhev/data-center/mnt/glusterSD/10.0.231.50:_vm-storage/a5a83df1-47e2-4927-9add-079199ca7ef8/dom_md/leases
Dec 22 23:04:03 compute2 sanlock[944]: 2015-12-22 23:04:03+0000 944342
[948]: r1 release_token open error -107
Dec 22 23:04:03 compute2 journal: vdsm root ERROR Panic: Unrecoverable
errors during SPM stop process.
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sp.py", line 414, in stopSpm
    self.masterDomain.releaseClusterLock()
  File "/usr/share/vdsm/storage/sd.py", line 512, in releaseClusterLock
    self._clusterLock.release()
  File "/usr/share/vdsm/storage/clusterlock.py", line 342, in release
    raise se.ReleaseLockFailure(self._sdUUID, e)
ReleaseLockFailure: Cannot release lock:
(u'a5a83df1-47e2-4927-9add-079199ca7ef8', SanlockException(107,
'Sanlock resource not released', 'Transport endpoint is not
connected'))
Dec 22 23:04:03 compute2 journal: End of file while reading data:
Input/output error
Dec 22 23:04:03 compute2 journal: End of file while reading data:
Input/output error
Dec 22 23:04:03 compute2 systemd: vdsmd.service: main process exited,
code=killed, status=9/KILL
Dec 22 23:04:03 compute2 systemd: Failed to reset devices.list on
/machine.slice: Invalid argument
Dec 22 23:04:03 compute2 vdsmd_init_common.sh: vdsm: Running run_final_hooks
Dec 22 23:04:03 compute2 systemd: Unit vdsmd.service entered failed state.
Dec 22 23:04:03 compute2 systemd: vdsmd.service holdoff time over,
scheduling restart.
Dec 22 23:04:03 compute2 systemd: Stopping Virtual Desktop Server Manager...
Dec 22 23:04:03 compute2 systemd: Starting Virtual Desktop Server Manager...
Dec 22 23:04:03 compute2 systemd: Failed to reset devices.list on
/machine.slice: Invalid argument

At this point I tried restarting vdsmd but ended up having to restart
the host. There was network connectivity to gluster before I
restarted, and I was able to manually mount the vm-storage volume on
the host and ls without issue.

After restarting the host I'm up and running again but I'm sure I'll
get hit by this again if I can't sort out what the issue is.

Steve Dainard

Sahina Bose

Ravishankar N

Steve Dainard

Ravishankar N

tags

participants (3)