Ovirt 3.5 host gluster storage connection failure

I have two hosts, only one of them was running VM's at the time of this crash so I can't tell if this is a node specific problem. rpm -qa | egrep -i 'gluster|vdsm|libvirt' |sort glusterfs-3.6.7-1.el7.x86_64 glusterfs-api-3.6.7-1.el7.x86_64 glusterfs-cli-3.6.7-1.el7.x86_64 glusterfs-fuse-3.6.7-1.el7.x86_64 glusterfs-libs-3.6.7-1.el7.x86_64 glusterfs-rdma-3.6.7-1.el7.x86_64 libvirt-client-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-config-nwfilter-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-driver-interface-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-driver-network-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-driver-nodedev-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-driver-nwfilter-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-driver-qemu-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-driver-secret-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-driver-storage-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-kvm-1.2.8-16.el7_1.5.x86_64 libvirt-lock-sanlock-1.2.8-16.el7_1.5.x86_64 libvirt-python-1.2.8-7.el7_1.1.x86_64 vdsm-4.16.30-0.el7.centos.x86_64 vdsm-cli-4.16.30-0.el7.centos.noarch vdsm-jsonrpc-4.16.30-0.el7.centos.noarch vdsm-python-4.16.30-0.el7.centos.noarch vdsm-python-zombiereaper-4.16.30-0.el7.centos.noarch vdsm-xmlrpc-4.16.30-0.el7.centos.noarch vdsm-yajsonrpc-4.16.30-0.el7.centos.noarch VM's were in a paused state, with errors in UI: 2015-Dec-22, 15:06 VM pcic-apps has paused due to unknown storage error. 2015-Dec-22, 15:06 Host compute2 is not responding. It will stay in Connecting state for a grace period of 82 seconds and after that an attempt to fence the host will be issued. 2015-Dec-22, 15:03 Invalid status on Data Center EDC2. Setting Data Center status to Non Responsive (On host compute2, Error: General Exception). 2015-Dec-22, 15:03 VM pcic-storage has paused due to unknown storage error. 2015-Dec-22, 15:03 VM docker1 has paused due to unknown storage error. VDSM logs look normal until: Dummy-99::DEBUG::2015-12-22 23:03:58,949::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail) dd if=/rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None) Dummy-99::DEBUG::2015-12-22 23:03:58,963::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail) SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB) copied, 0.00350501 s, 292 MB/s\n'; <rc> = 0 VM Channels Listener::INFO::2015-12-22 23:03:59,527::guestagent::180::vm.Vm::(_handleAPIVersion) vmId=`7067679e-43aa-43c0-b263-b0a711ade2e2`::Guest API version changed from 2 to 1 Thread-245428::DEBUG::2015-12-22 23:03:59,718::libvirtconnection::151::root::(wrapper) Unknown libvirterror: ecode: 80 edom: 20 level: 2 message: metadata not found: Requested metadata element is not present libvirtEventLoop::INFO::2015-12-22 23:04:00,447::vm::4982::vm.Vm::(_onIOError) vmId=`376e98b7-7798-46e8-be03-5dddf6cfb54f`::abnormal vm stop device virtio-disk0 error eother libvirtEventLoop::DEBUG::2015-12-22 23:04:00,447::vm::5666::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`376e98b7-7798-46e8-be03-5dddf6cfb54f`::event Suspended detail 2 opaque None libvirtEventLoop::INFO::2015-12-22 23:04:00,447::vm::4982::vm.Vm::(_onIOError) vmId=`376e98b7-7798-46e8-be03-5dddf6cfb54f`::abnormal vm stop device virtio-disk0 error eother ... libvirtEventLoop::INFO::2015-12-22 23:04:00,843::vm::4982::vm.Vm::(_onIOError) vmId=`97fbbf97-944b-4b77-b0bf-6a831f9090d8`::abnormal vm stop device virtio-disk1 error eother libvirtEventLoop::DEBUG::2015-12-22 23:04:00,844::vm::5666::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`97fbbf97-944b-4b77-b0bf-6a831f9090d8`::event Suspended detail 2 opaque None Dummy-99::DEBUG::2015-12-22 23:04:00,973::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail) dd if=/rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None) Dummy-99::DEBUG::2015-12-22 23:04:00,983::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail) FAILED: <err> = "dd: failed to open '/rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox': Transport endpoint is not connected\n"; <rc> = 1 Dummy-99::ERROR::2015-12-22 23:04:00,983::storage_mailbox::787::Storage.MailBox.SpmMailMonitor::(run) Error checking for mail Traceback (most recent call last): File "/usr/share/vdsm/storage/storage_mailbox.py", line 785, in run self._checkForMail() File "/usr/share/vdsm/storage/storage_mailbox.py", line 734, in _checkForMail "Could not read mailbox: %s" % self._inbox) IOError: [Errno 5] _handleRequests._checkForMail - Could not read mailbox: /rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox Dummy-99::DEBUG::2015-12-22 23:04:02,987::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail) dd if=/rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None) Dummy-99::DEBUG::2015-12-22 23:04:02,994::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail) FAILED: <err> = "dd: failed to open '/rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox': Transport endpoint is not connected\n"; <rc> = 1 Dummy-99::ERROR::2015-12-22 23:04:02,994::storage_mailbox::787::Storage.MailBox.SpmMailMonitor::(run) Error checking for mail Traceback (most recent call last): File "/usr/share/vdsm/storage/storage_mailbox.py", line 785, in run self._checkForMail() File "/usr/share/vdsm/storage/storage_mailbox.py", line 734, in _checkForMail "Could not read mailbox: %s" % self._inbox) IOError: [Errno 5] _handleRequests._checkForMail - Could not read mailbox: /rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox Thread-563692::DEBUG::2015-12-22 23:04:03,539::__init__::481::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'StoragePool.getSpmStatus' in bridge with {u'storagepoolID': u'f72ec125-69a1-4c1b-a5e1-313fcb70b6ff'} Thread-563692::DEBUG::2015-12-22 23:04:03,540::task::595::Storage.TaskManager.Task::(_updateState) Task=`6121360f-af29-48cf-8a8c-957480472a9d`::moving from state init -> state preparing Thread-563692::INFO::2015-12-22 23:04:03,540::logUtils::44::dispatcher::(wrapper) Run and protect: getSpmStatus(spUUID=u'f72ec125-69a1-4c1b-a5e1-313fcb70b6ff', options=None) Thread-563692::ERROR::2015-12-22 23:04:03,541::task::866::Storage.TaskManager.Task::(_setError) Task=`6121360f-af29-48cf-8a8c-957480472a9d`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 873, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 45, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 633, in getSpmStatus status = self._getSpmStatusInfo(pool) File "/usr/share/vdsm/storage/hsm.py", line 627, in _getSpmStatusInfo (pool.spmRole,) + pool.getSpmStatus())) File "/usr/share/vdsm/storage/sp.py", line 126, in getSpmStatus return self._backend.getSpmStatus() File "/usr/share/vdsm/storage/spbackends.py", line 416, in getSpmStatus lVer, spmId = self.masterDomain.inquireClusterLock() File "/usr/share/vdsm/storage/sd.py", line 515, in inquireClusterLock return self._clusterLock.inquire() File "/usr/share/vdsm/storage/clusterlock.py", line 321, in inquire resource = sanlock.read_resource(self._leasesPath, SDM_LEASE_OFFSET) SanlockException: (107, 'Sanlock resource read failure', 'Transport endpoint is not connected') ... I noticed the gluster mount point was still shown running 'mount' but when I tried to ls the mount point I got a "transport endpoint is not connected" error. I haven't been able to find anything more interested in the logs, and the journal is blank before the recent host restart. In the gluster client logs from the host I see: ... [2015-12-22 22:52:23.238369] W [fuse-bridge.c:1261:fuse_err_cbk] 0-glusterfs-fuse: 31263720: REMOVEXATTR() /__DIRECT_IO_TEST__ => -1 (No dat a available) [2015-12-22 22:57:23.778018] W [fuse-bridge.c:1261:fuse_err_cbk] 0-glusterfs-fuse: 31316031: REMOVEXATTR() /__DIRECT_IO_TEST__ => -1 (No dat a available) [2015-12-22 23:02:24.376491] W [fuse-bridge.c:1261:fuse_err_cbk] 0-glusterfs-fuse: 31473552: REMOVEXATTR() /__DIRECT_IO_TEST__ => -1 (No dat a available) pending frames: frame : type(1) op(FSTAT) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(0) op(0) frame : type(0) op(0) ... frame : type(1) op(READ) frame : type(0) op(0) frame : type(1) op(OPEN) frame : type(1) op(OPEN) frame : type(1) op(OPEN) frame : type(1) op(OPEN) frame : type(0) op(0) patchset: git://git.gluster.com/glusterfs.git signal received: 6 time of crash: 2015-12-22 23:04:00 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.6.7 /lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb2)[0x7f0d091f6392] /lib64/libglusterfs.so.0(gf_print_trace+0x32d)[0x7f0d0920d88d] /lib64/libc.so.6(+0x35650)[0x7f0d0820f650] /lib64/libc.so.6(gsignal+0x37)[0x7f0d0820f5d7] /lib64/libc.so.6(abort+0x148)[0x7f0d08210cc8] /lib64/libc.so.6(+0x75e07)[0x7f0d0824fe07] /lib64/libc.so.6(+0x7d1fd)[0x7f0d082571fd] /usr/lib64/glusterfs/3.6.7/xlator/protocol/client.so(client_local_wipe+0x39)[0x7f0cfe8acdf9] /usr/lib64/glusterfs/3.6.7/xlator/protocol/client.so(client3_3_readv_cbk+0x487)[0x7f0cfe8c1197] /lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0x90)[0x7f0d08fca100] /lib64/libgfrpc.so.0(rpc_clnt_notify+0x174)[0x7f0d08fca374] /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f0d08fc62c3] /usr/lib64/glusterfs/3.6.7/rpc-transport/socket.so(+0x8790)[0x7f0d047f3790] /usr/lib64/glusterfs/3.6.7/rpc-transport/socket.so(+0xaf84)[0x7f0d047f5f84] /lib64/libglusterfs.so.0(+0x767c2)[0x7f0d0924b7c2] /usr/sbin/glusterfs(main+0x502)[0x7f0d096a0fe2] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f0d081fbaf5] /usr/sbin/glusterfs(+0x6381)[0x7f0d096a1381] In /var/log/messages: ... Dec 22 23:00:49 compute2 journal: metadata not found: Requested metadata element is not present Dec 22 23:00:59 compute2 journal: metadata not found: Requested metadata element is not present Dec 22 23:01:01 compute2 systemd: Created slice user-0.slice. Dec 22 23:01:01 compute2 systemd: Starting Session 1316 of user root. Dec 22 23:01:01 compute2 systemd: Started Session 1316 of user root. Dec 22 23:01:01 compute2 systemd: Failed to reset devices.list on /machine.slice: Invalid argument Dec 22 23:01:03 compute2 journal: metadata not found: Requested metadata element is not present ... Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: pending frames: Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: frame : type(1) op(FSTAT) Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: frame : type(1) op(READ) Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: frame : type(1) op(READ) ... Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: frame : type(1) op(OPEN) Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: frame : type(0) op(0) Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: patchset: git://git.gluster.com/glusterfs.git Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: signal received: 6 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: time of crash: Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: 2015-12-22 23:04:00 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: configuration details: Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: argp 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: backtrace 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: dlfcn 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: libpthread 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: llistxattr 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: setfsid 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: spinlock 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: epoll.h 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: xattr.h 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: st_atim.tv_nsec 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: package-string: glusterfs 3.6.7 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: --------- Dec 22 23:04:03 compute2 sanlock[944]: 2015-12-22 23:04:03+0000 944341 [948]: open error -107 /rhev/data-center/mnt/glusterSD/10.0.231.50:_vm-storage/a5a83df1-47e2-4927-9add-079199ca7ef8/dom_md/leases Dec 22 23:04:03 compute2 sanlock[944]: 2015-12-22 23:04:03+0000 944342 [944]: ci 8 fd 17 pid -1 recv errno 104 Dec 22 23:04:03 compute2 sanlock[944]: 2015-12-22 23:04:03+0000 944342 [949]: open error -107 /rhev/data-center/mnt/glusterSD/10.0.231.50:_vm-storage/a5a83df1-47e2-4927-9add-079199ca7ef8/dom_md/leases Dec 22 23:04:03 compute2 sanlock[944]: 2015-12-22 23:04:03+0000 944342 [944]: ci 9 fd 18 pid -1 recv errno 104 Dec 22 23:04:03 compute2 sanlock[944]: 2015-12-22 23:04:03+0000 944342 [948]: open error -1 /rhev/data-center/mnt/glusterSD/10.0.231.50:_vm-storage/a5a83df1-47e2-4927-9add-079199ca7ef8/dom_md/leases Dec 22 23:04:03 compute2 sanlock[944]: 2015-12-22 23:04:03+0000 944342 [948]: r1 release_token open error -107 Dec 22 23:04:03 compute2 journal: vdsm root ERROR Panic: Unrecoverable errors during SPM stop process. Traceback (most recent call last): File "/usr/share/vdsm/storage/sp.py", line 414, in stopSpm self.masterDomain.releaseClusterLock() File "/usr/share/vdsm/storage/sd.py", line 512, in releaseClusterLock self._clusterLock.release() File "/usr/share/vdsm/storage/clusterlock.py", line 342, in release raise se.ReleaseLockFailure(self._sdUUID, e) ReleaseLockFailure: Cannot release lock: (u'a5a83df1-47e2-4927-9add-079199ca7ef8', SanlockException(107, 'Sanlock resource not released', 'Transport endpoint is not connected')) Dec 22 23:04:03 compute2 journal: End of file while reading data: Input/output error Dec 22 23:04:03 compute2 journal: End of file while reading data: Input/output error Dec 22 23:04:03 compute2 systemd: vdsmd.service: main process exited, code=killed, status=9/KILL Dec 22 23:04:03 compute2 systemd: Failed to reset devices.list on /machine.slice: Invalid argument Dec 22 23:04:03 compute2 vdsmd_init_common.sh: vdsm: Running run_final_hooks Dec 22 23:04:03 compute2 systemd: Unit vdsmd.service entered failed state. Dec 22 23:04:03 compute2 systemd: vdsmd.service holdoff time over, scheduling restart. Dec 22 23:04:03 compute2 systemd: Stopping Virtual Desktop Server Manager... Dec 22 23:04:03 compute2 systemd: Starting Virtual Desktop Server Manager... Dec 22 23:04:03 compute2 systemd: Failed to reset devices.list on /machine.slice: Invalid argument At this point I tried restarting vdsmd but ended up having to restart the host. There was network connectivity to gluster before I restarted, and I was able to manually mount the vm-storage volume on the host and ls without issue. After restarting the host I'm up and running again but I'm sure I'll get hit by this again if I can't sort out what the issue is.

[+ Ravi, Pranith] On 12/23/2015 06:00 AM, Steve Dainard wrote:
I have two hosts, only one of them was running VM's at the time of this crash so I can't tell if this is a node specific problem.
rpm -qa | egrep -i 'gluster|vdsm|libvirt' |sort glusterfs-3.6.7-1.el7.x86_64 glusterfs-api-3.6.7-1.el7.x86_64 glusterfs-cli-3.6.7-1.el7.x86_64 glusterfs-fuse-3.6.7-1.el7.x86_64 glusterfs-libs-3.6.7-1.el7.x86_64 glusterfs-rdma-3.6.7-1.el7.x86_64 libvirt-client-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-config-nwfilter-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-driver-interface-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-driver-network-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-driver-nodedev-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-driver-nwfilter-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-driver-qemu-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-driver-secret-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-driver-storage-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-kvm-1.2.8-16.el7_1.5.x86_64 libvirt-lock-sanlock-1.2.8-16.el7_1.5.x86_64 libvirt-python-1.2.8-7.el7_1.1.x86_64 vdsm-4.16.30-0.el7.centos.x86_64 vdsm-cli-4.16.30-0.el7.centos.noarch vdsm-jsonrpc-4.16.30-0.el7.centos.noarch vdsm-python-4.16.30-0.el7.centos.noarch vdsm-python-zombiereaper-4.16.30-0.el7.centos.noarch vdsm-xmlrpc-4.16.30-0.el7.centos.noarch vdsm-yajsonrpc-4.16.30-0.el7.centos.noarch
VM's were in a paused state, with errors in UI:
2015-Dec-22, 15:06 VM pcic-apps has paused due to unknown storage error. 2015-Dec-22, 15:06 Host compute2 is not responding. It will stay in Connecting state for a grace period of 82 seconds and after that an attempt to fence the host will be issued. 2015-Dec-22, 15:03 Invalid status on Data Center EDC2. Setting Data Center status to Non Responsive (On host compute2, Error: General Exception). 2015-Dec-22, 15:03 VM pcic-storage has paused due to unknown storage error. 2015-Dec-22, 15:03 VM docker1 has paused due to unknown storage error.
VDSM logs look normal until: Dummy-99::DEBUG::2015-12-22 23:03:58,949::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail) dd if=/rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None) Dummy-99::DEBUG::2015-12-22 23:03:58,963::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail) SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB) copied, 0.00350501 s, 292 MB/s\n'; <rc> = 0 VM Channels Listener::INFO::2015-12-22 23:03:59,527::guestagent::180::vm.Vm::(_handleAPIVersion) vmId=`7067679e-43aa-43c0-b263-b0a711ade2e2`::Guest API version changed from 2 to 1 Thread-245428::DEBUG::2015-12-22 23:03:59,718::libvirtconnection::151::root::(wrapper) Unknown libvirterror: ecode: 80 edom: 20 level: 2 message: metadata not found: Requested metadata element is not present libvirtEventLoop::INFO::2015-12-22 23:04:00,447::vm::4982::vm.Vm::(_onIOError) vmId=`376e98b7-7798-46e8-be03-5dddf6cfb54f`::abnormal vm stop device virtio-disk0 error eother libvirtEventLoop::DEBUG::2015-12-22 23:04:00,447::vm::5666::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`376e98b7-7798-46e8-be03-5dddf6cfb54f`::event Suspended detail 2 opaque None libvirtEventLoop::INFO::2015-12-22 23:04:00,447::vm::4982::vm.Vm::(_onIOError) vmId=`376e98b7-7798-46e8-be03-5dddf6cfb54f`::abnormal vm stop device virtio-disk0 error eother ... libvirtEventLoop::INFO::2015-12-22 23:04:00,843::vm::4982::vm.Vm::(_onIOError) vmId=`97fbbf97-944b-4b77-b0bf-6a831f9090d8`::abnormal vm stop device virtio-disk1 error eother libvirtEventLoop::DEBUG::2015-12-22 23:04:00,844::vm::5666::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`97fbbf97-944b-4b77-b0bf-6a831f9090d8`::event Suspended detail 2 opaque None Dummy-99::DEBUG::2015-12-22 23:04:00,973::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail) dd if=/rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None) Dummy-99::DEBUG::2015-12-22 23:04:00,983::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail) FAILED: <err> = "dd: failed to open '/rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox': Transport endpoint is not connected\n"; <rc> = 1 Dummy-99::ERROR::2015-12-22 23:04:00,983::storage_mailbox::787::Storage.MailBox.SpmMailMonitor::(run) Error checking for mail Traceback (most recent call last): File "/usr/share/vdsm/storage/storage_mailbox.py", line 785, in run self._checkForMail() File "/usr/share/vdsm/storage/storage_mailbox.py", line 734, in _checkForMail "Could not read mailbox: %s" % self._inbox) IOError: [Errno 5] _handleRequests._checkForMail - Could not read mailbox: /rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox Dummy-99::DEBUG::2015-12-22 23:04:02,987::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail) dd if=/rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None) Dummy-99::DEBUG::2015-12-22 23:04:02,994::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail) FAILED: <err> = "dd: failed to open '/rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox': Transport endpoint is not connected\n"; <rc> = 1 Dummy-99::ERROR::2015-12-22 23:04:02,994::storage_mailbox::787::Storage.MailBox.SpmMailMonitor::(run) Error checking for mail Traceback (most recent call last): File "/usr/share/vdsm/storage/storage_mailbox.py", line 785, in run self._checkForMail() File "/usr/share/vdsm/storage/storage_mailbox.py", line 734, in _checkForMail "Could not read mailbox: %s" % self._inbox) IOError: [Errno 5] _handleRequests._checkForMail - Could not read mailbox: /rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox Thread-563692::DEBUG::2015-12-22 23:04:03,539::__init__::481::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'StoragePool.getSpmStatus' in bridge with {u'storagepoolID': u'f72ec125-69a1-4c1b-a5e1-313fcb70b6ff'} Thread-563692::DEBUG::2015-12-22 23:04:03,540::task::595::Storage.TaskManager.Task::(_updateState) Task=`6121360f-af29-48cf-8a8c-957480472a9d`::moving from state init -> state preparing Thread-563692::INFO::2015-12-22 23:04:03,540::logUtils::44::dispatcher::(wrapper) Run and protect: getSpmStatus(spUUID=u'f72ec125-69a1-4c1b-a5e1-313fcb70b6ff', options=None) Thread-563692::ERROR::2015-12-22 23:04:03,541::task::866::Storage.TaskManager.Task::(_setError) Task=`6121360f-af29-48cf-8a8c-957480472a9d`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 873, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 45, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 633, in getSpmStatus status = self._getSpmStatusInfo(pool) File "/usr/share/vdsm/storage/hsm.py", line 627, in _getSpmStatusInfo (pool.spmRole,) + pool.getSpmStatus())) File "/usr/share/vdsm/storage/sp.py", line 126, in getSpmStatus return self._backend.getSpmStatus() File "/usr/share/vdsm/storage/spbackends.py", line 416, in getSpmStatus lVer, spmId = self.masterDomain.inquireClusterLock() File "/usr/share/vdsm/storage/sd.py", line 515, in inquireClusterLock return self._clusterLock.inquire() File "/usr/share/vdsm/storage/clusterlock.py", line 321, in inquire resource = sanlock.read_resource(self._leasesPath, SDM_LEASE_OFFSET) SanlockException: (107, 'Sanlock resource read failure', 'Transport endpoint is not connected') ...
I noticed the gluster mount point was still shown running 'mount' but when I tried to ls the mount point I got a "transport endpoint is not connected" error.
I haven't been able to find anything more interested in the logs, and the journal is blank before the recent host restart.
In the gluster client logs from the host I see: ... [2015-12-22 22:52:23.238369] W [fuse-bridge.c:1261:fuse_err_cbk] 0-glusterfs-fuse: 31263720: REMOVEXATTR() /__DIRECT_IO_TEST__ => -1 (No dat a available) [2015-12-22 22:57:23.778018] W [fuse-bridge.c:1261:fuse_err_cbk] 0-glusterfs-fuse: 31316031: REMOVEXATTR() /__DIRECT_IO_TEST__ => -1 (No dat a available) [2015-12-22 23:02:24.376491] W [fuse-bridge.c:1261:fuse_err_cbk] 0-glusterfs-fuse: 31473552: REMOVEXATTR() /__DIRECT_IO_TEST__ => -1 (No dat a available) pending frames: frame : type(1) op(FSTAT) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(0) op(0) frame : type(0) op(0) ... frame : type(1) op(READ) frame : type(0) op(0) frame : type(1) op(OPEN) frame : type(1) op(OPEN) frame : type(1) op(OPEN) frame : type(1) op(OPEN) frame : type(0) op(0) patchset: git://git.gluster.com/glusterfs.git signal received: 6 time of crash: 2015-12-22 23:04:00 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.6.7 /lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb2)[0x7f0d091f6392] /lib64/libglusterfs.so.0(gf_print_trace+0x32d)[0x7f0d0920d88d] /lib64/libc.so.6(+0x35650)[0x7f0d0820f650] /lib64/libc.so.6(gsignal+0x37)[0x7f0d0820f5d7] /lib64/libc.so.6(abort+0x148)[0x7f0d08210cc8] /lib64/libc.so.6(+0x75e07)[0x7f0d0824fe07] /lib64/libc.so.6(+0x7d1fd)[0x7f0d082571fd] /usr/lib64/glusterfs/3.6.7/xlator/protocol/client.so(client_local_wipe+0x39)[0x7f0cfe8acdf9] /usr/lib64/glusterfs/3.6.7/xlator/protocol/client.so(client3_3_readv_cbk+0x487)[0x7f0cfe8c1197] /lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0x90)[0x7f0d08fca100] /lib64/libgfrpc.so.0(rpc_clnt_notify+0x174)[0x7f0d08fca374] /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f0d08fc62c3] /usr/lib64/glusterfs/3.6.7/rpc-transport/socket.so(+0x8790)[0x7f0d047f3790] /usr/lib64/glusterfs/3.6.7/rpc-transport/socket.so(+0xaf84)[0x7f0d047f5f84] /lib64/libglusterfs.so.0(+0x767c2)[0x7f0d0924b7c2] /usr/sbin/glusterfs(main+0x502)[0x7f0d096a0fe2] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f0d081fbaf5] /usr/sbin/glusterfs(+0x6381)[0x7f0d096a1381]
Could you provide the gluster mount logs from the client, and the gluster brick logs from the gluster servers?
In /var/log/messages: ... Dec 22 23:00:49 compute2 journal: metadata not found: Requested metadata element is not present Dec 22 23:00:59 compute2 journal: metadata not found: Requested metadata element is not present Dec 22 23:01:01 compute2 systemd: Created slice user-0.slice. Dec 22 23:01:01 compute2 systemd: Starting Session 1316 of user root. Dec 22 23:01:01 compute2 systemd: Started Session 1316 of user root. Dec 22 23:01:01 compute2 systemd: Failed to reset devices.list on /machine.slice: Invalid argument Dec 22 23:01:03 compute2 journal: metadata not found: Requested metadata element is not present ... Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: pending frames: Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: frame : type(1) op(FSTAT) Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: frame : type(1) op(READ) Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: frame : type(1) op(READ) ... Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: frame : type(1) op(OPEN) Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: frame : type(0) op(0) Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: patchset: git://git.gluster.com/glusterfs.git Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: signal received: 6 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: time of crash: Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: 2015-12-22 23:04:00 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: configuration details: Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: argp 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: backtrace 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: dlfcn 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: libpthread 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: llistxattr 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: setfsid 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: spinlock 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: epoll.h 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: xattr.h 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: st_atim.tv_nsec 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: package-string: glusterfs 3.6.7 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: --------- Dec 22 23:04:03 compute2 sanlock[944]: 2015-12-22 23:04:03+0000 944341 [948]: open error -107 /rhev/data-center/mnt/glusterSD/10.0.231.50:_vm-storage/a5a83df1-47e2-4927-9add-079199ca7ef8/dom_md/leases Dec 22 23:04:03 compute2 sanlock[944]: 2015-12-22 23:04:03+0000 944342 [944]: ci 8 fd 17 pid -1 recv errno 104 Dec 22 23:04:03 compute2 sanlock[944]: 2015-12-22 23:04:03+0000 944342 [949]: open error -107 /rhev/data-center/mnt/glusterSD/10.0.231.50:_vm-storage/a5a83df1-47e2-4927-9add-079199ca7ef8/dom_md/leases Dec 22 23:04:03 compute2 sanlock[944]: 2015-12-22 23:04:03+0000 944342 [944]: ci 9 fd 18 pid -1 recv errno 104 Dec 22 23:04:03 compute2 sanlock[944]: 2015-12-22 23:04:03+0000 944342 [948]: open error -1 /rhev/data-center/mnt/glusterSD/10.0.231.50:_vm-storage/a5a83df1-47e2-4927-9add-079199ca7ef8/dom_md/leases Dec 22 23:04:03 compute2 sanlock[944]: 2015-12-22 23:04:03+0000 944342 [948]: r1 release_token open error -107 Dec 22 23:04:03 compute2 journal: vdsm root ERROR Panic: Unrecoverable errors during SPM stop process. Traceback (most recent call last): File "/usr/share/vdsm/storage/sp.py", line 414, in stopSpm self.masterDomain.releaseClusterLock() File "/usr/share/vdsm/storage/sd.py", line 512, in releaseClusterLock self._clusterLock.release() File "/usr/share/vdsm/storage/clusterlock.py", line 342, in release raise se.ReleaseLockFailure(self._sdUUID, e) ReleaseLockFailure: Cannot release lock: (u'a5a83df1-47e2-4927-9add-079199ca7ef8', SanlockException(107, 'Sanlock resource not released', 'Transport endpoint is not connected')) Dec 22 23:04:03 compute2 journal: End of file while reading data: Input/output error Dec 22 23:04:03 compute2 journal: End of file while reading data: Input/output error Dec 22 23:04:03 compute2 systemd: vdsmd.service: main process exited, code=killed, status=9/KILL Dec 22 23:04:03 compute2 systemd: Failed to reset devices.list on /machine.slice: Invalid argument Dec 22 23:04:03 compute2 vdsmd_init_common.sh: vdsm: Running run_final_hooks Dec 22 23:04:03 compute2 systemd: Unit vdsmd.service entered failed state. Dec 22 23:04:03 compute2 systemd: vdsmd.service holdoff time over, scheduling restart. Dec 22 23:04:03 compute2 systemd: Stopping Virtual Desktop Server Manager... Dec 22 23:04:03 compute2 systemd: Starting Virtual Desktop Server Manager... Dec 22 23:04:03 compute2 systemd: Failed to reset devices.list on /machine.slice: Invalid argument
At this point I tried restarting vdsmd but ended up having to restart the host. There was network connectivity to gluster before I restarted, and I was able to manually mount the vm-storage volume on the host and ls without issue.
After restarting the host I'm up and running again but I'm sure I'll get hit by this again if I can't sort out what the issue is. _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

On 12/23/2015 11:44 AM, Sahina Bose wrote:
signal received: 6 time of crash: 2015-12-22 23:04:00 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.6.7 /lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb2)[0x7f0d091f6392] /lib64/libglusterfs.so.0(gf_print_trace+0x32d)[0x7f0d0920d88d] /lib64/libc.so.6(+0x35650)[0x7f0d0820f650] /lib64/libc.so.6(gsignal+0x37)[0x7f0d0820f5d7] /lib64/libc.so.6(abort+0x148)[0x7f0d08210cc8] /lib64/libc.so.6(+0x75e07)[0x7f0d0824fe07] /lib64/libc.so.6(+0x7d1fd)[0x7f0d082571fd] /usr/lib64/glusterfs/3.6.7/xlator/protocol/client.so(client_local_wipe+0x39)[0x7f0cfe8acdf9]
/usr/lib64/glusterfs/3.6.7/xlator/protocol/client.so(client3_3_readv_cbk+0x487)[0x7f0cfe8c1197]
/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0x90)[0x7f0d08fca100] /lib64/libgfrpc.so.0(rpc_clnt_notify+0x174)[0x7f0d08fca374] /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f0d08fc62c3] /usr/lib64/glusterfs/3.6.7/rpc-transport/socket.so(+0x8790)[0x7f0d047f3790]
/usr/lib64/glusterfs/3.6.7/rpc-transport/socket.so(+0xaf84)[0x7f0d047f5f84]
/lib64/libglusterfs.so.0(+0x767c2)[0x7f0d0924b7c2] /usr/sbin/glusterfs(main+0x502)[0x7f0d096a0fe2] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f0d081fbaf5] /usr/sbin/glusterfs(+0x6381)[0x7f0d096a1381]
Could you provide the gluster mount logs from the client, and the gluster brick logs from the gluster servers? Also, do you have a core file of the crash?
-Ravi

I've attached the client gluster log starting at the first log of the same day as failure. The brick logs are all 0 file size on all of the replica 3 nodes... I just set 'gluster volume set vm-storage diagnostics.brick-log-level WARNING' but I'm not immediately seeing any logging to disk. I've also attached the compute1 vdsm.log file, which over the same time period is able to dd successfully so perhaps this discounts a storage side issue? I've also attached compute2 (failed node) for comparison. Ravi - I'm not familiar with core files, would this be in a non-devel version of gluster? Or is this something I can enable? I don't mind enabling it now if it could help diagnose a future issue. On Tue, Dec 22, 2015 at 10:14 PM, Sahina Bose <sabose@redhat.com> wrote:
[+ Ravi, Pranith]
On 12/23/2015 06:00 AM, Steve Dainard wrote:
I have two hosts, only one of them was running VM's at the time of this crash so I can't tell if this is a node specific problem.
rpm -qa | egrep -i 'gluster|vdsm|libvirt' |sort glusterfs-3.6.7-1.el7.x86_64 glusterfs-api-3.6.7-1.el7.x86_64 glusterfs-cli-3.6.7-1.el7.x86_64 glusterfs-fuse-3.6.7-1.el7.x86_64 glusterfs-libs-3.6.7-1.el7.x86_64 glusterfs-rdma-3.6.7-1.el7.x86_64 libvirt-client-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-config-nwfilter-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-driver-interface-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-driver-network-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-driver-nodedev-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-driver-nwfilter-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-driver-qemu-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-driver-secret-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-driver-storage-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-kvm-1.2.8-16.el7_1.5.x86_64 libvirt-lock-sanlock-1.2.8-16.el7_1.5.x86_64 libvirt-python-1.2.8-7.el7_1.1.x86_64 vdsm-4.16.30-0.el7.centos.x86_64 vdsm-cli-4.16.30-0.el7.centos.noarch vdsm-jsonrpc-4.16.30-0.el7.centos.noarch vdsm-python-4.16.30-0.el7.centos.noarch vdsm-python-zombiereaper-4.16.30-0.el7.centos.noarch vdsm-xmlrpc-4.16.30-0.el7.centos.noarch vdsm-yajsonrpc-4.16.30-0.el7.centos.noarch
VM's were in a paused state, with errors in UI:
2015-Dec-22, 15:06 VM pcic-apps has paused due to unknown storage error. 2015-Dec-22, 15:06 Host compute2 is not responding. It will stay in Connecting state for a grace period of 82 seconds and after that an attempt to fence the host will be issued. 2015-Dec-22, 15:03 Invalid status on Data Center EDC2. Setting Data Center status to Non Responsive (On host compute2, Error: General Exception). 2015-Dec-22, 15:03 VM pcic-storage has paused due to unknown storage error. 2015-Dec-22, 15:03 VM docker1 has paused due to unknown storage error.
VDSM logs look normal until: Dummy-99::DEBUG::2015-12-22 23:03:58,949::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail) dd if=/rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None) Dummy-99::DEBUG::2015-12-22 23:03:58,963::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail) SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB) copied, 0.00350501 s, 292 MB/s\n'; <rc> = 0 VM Channels Listener::INFO::2015-12-22 23:03:59,527::guestagent::180::vm.Vm::(_handleAPIVersion) vmId=`7067679e-43aa-43c0-b263-b0a711ade2e2`::Guest API version changed from 2 to 1 Thread-245428::DEBUG::2015-12-22 23:03:59,718::libvirtconnection::151::root::(wrapper) Unknown libvirterror: ecode: 80 edom: 20 level: 2 message: metadata not found: Requested metadata element is not present libvirtEventLoop::INFO::2015-12-22 23:04:00,447::vm::4982::vm.Vm::(_onIOError) vmId=`376e98b7-7798-46e8-be03-5dddf6cfb54f`::abnormal vm stop device virtio-disk0 error eother libvirtEventLoop::DEBUG::2015-12-22 23:04:00,447::vm::5666::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`376e98b7-7798-46e8-be03-5dddf6cfb54f`::event Suspended detail 2 opaque None libvirtEventLoop::INFO::2015-12-22 23:04:00,447::vm::4982::vm.Vm::(_onIOError) vmId=`376e98b7-7798-46e8-be03-5dddf6cfb54f`::abnormal vm stop device virtio-disk0 error eother ... libvirtEventLoop::INFO::2015-12-22 23:04:00,843::vm::4982::vm.Vm::(_onIOError) vmId=`97fbbf97-944b-4b77-b0bf-6a831f9090d8`::abnormal vm stop device virtio-disk1 error eother libvirtEventLoop::DEBUG::2015-12-22 23:04:00,844::vm::5666::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`97fbbf97-944b-4b77-b0bf-6a831f9090d8`::event Suspended detail 2 opaque None Dummy-99::DEBUG::2015-12-22 23:04:00,973::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail) dd if=/rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None) Dummy-99::DEBUG::2015-12-22 23:04:00,983::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail) FAILED: <err> = "dd: failed to open
'/rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox': Transport endpoint is not connected\n"; <rc> = 1 Dummy-99::ERROR::2015-12-22 23:04:00,983::storage_mailbox::787::Storage.MailBox.SpmMailMonitor::(run) Error checking for mail Traceback (most recent call last): File "/usr/share/vdsm/storage/storage_mailbox.py", line 785, in run self._checkForMail() File "/usr/share/vdsm/storage/storage_mailbox.py", line 734, in _checkForMail "Could not read mailbox: %s" % self._inbox) IOError: [Errno 5] _handleRequests._checkForMail - Could not read mailbox: /rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox Dummy-99::DEBUG::2015-12-22 23:04:02,987::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail) dd if=/rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None) Dummy-99::DEBUG::2015-12-22 23:04:02,994::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail) FAILED: <err> = "dd: failed to open
'/rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox': Transport endpoint is not connected\n"; <rc> = 1 Dummy-99::ERROR::2015-12-22 23:04:02,994::storage_mailbox::787::Storage.MailBox.SpmMailMonitor::(run) Error checking for mail Traceback (most recent call last): File "/usr/share/vdsm/storage/storage_mailbox.py", line 785, in run self._checkForMail() File "/usr/share/vdsm/storage/storage_mailbox.py", line 734, in _checkForMail "Could not read mailbox: %s" % self._inbox) IOError: [Errno 5] _handleRequests._checkForMail - Could not read mailbox: /rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox Thread-563692::DEBUG::2015-12-22 23:04:03,539::__init__::481::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'StoragePool.getSpmStatus' in bridge with {u'storagepoolID': u'f72ec125-69a1-4c1b-a5e1-313fcb70b6ff'} Thread-563692::DEBUG::2015-12-22 23:04:03,540::task::595::Storage.TaskManager.Task::(_updateState) Task=`6121360f-af29-48cf-8a8c-957480472a9d`::moving from state init -> state preparing Thread-563692::INFO::2015-12-22 23:04:03,540::logUtils::44::dispatcher::(wrapper) Run and protect: getSpmStatus(spUUID=u'f72ec125-69a1-4c1b-a5e1-313fcb70b6ff', options=None) Thread-563692::ERROR::2015-12-22 23:04:03,541::task::866::Storage.TaskManager.Task::(_setError) Task=`6121360f-af29-48cf-8a8c-957480472a9d`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 873, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 45, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 633, in getSpmStatus status = self._getSpmStatusInfo(pool) File "/usr/share/vdsm/storage/hsm.py", line 627, in _getSpmStatusInfo (pool.spmRole,) + pool.getSpmStatus())) File "/usr/share/vdsm/storage/sp.py", line 126, in getSpmStatus return self._backend.getSpmStatus() File "/usr/share/vdsm/storage/spbackends.py", line 416, in getSpmStatus lVer, spmId = self.masterDomain.inquireClusterLock() File "/usr/share/vdsm/storage/sd.py", line 515, in inquireClusterLock return self._clusterLock.inquire() File "/usr/share/vdsm/storage/clusterlock.py", line 321, in inquire resource = sanlock.read_resource(self._leasesPath, SDM_LEASE_OFFSET) SanlockException: (107, 'Sanlock resource read failure', 'Transport endpoint is not connected') ...
I noticed the gluster mount point was still shown running 'mount' but when I tried to ls the mount point I got a "transport endpoint is not connected" error.
I haven't been able to find anything more interested in the logs, and the journal is blank before the recent host restart.
In the gluster client logs from the host I see: ... [2015-12-22 22:52:23.238369] W [fuse-bridge.c:1261:fuse_err_cbk] 0-glusterfs-fuse: 31263720: REMOVEXATTR() /__DIRECT_IO_TEST__ => -1 (No dat a available) [2015-12-22 22:57:23.778018] W [fuse-bridge.c:1261:fuse_err_cbk] 0-glusterfs-fuse: 31316031: REMOVEXATTR() /__DIRECT_IO_TEST__ => -1 (No dat a available) [2015-12-22 23:02:24.376491] W [fuse-bridge.c:1261:fuse_err_cbk] 0-glusterfs-fuse: 31473552: REMOVEXATTR() /__DIRECT_IO_TEST__ => -1 (No dat a available) pending frames: frame : type(1) op(FSTAT) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(0) op(0) frame : type(0) op(0) ... frame : type(1) op(READ) frame : type(0) op(0) frame : type(1) op(OPEN) frame : type(1) op(OPEN) frame : type(1) op(OPEN) frame : type(1) op(OPEN) frame : type(0) op(0) patchset: git://git.gluster.com/glusterfs.git signal received: 6 time of crash: 2015-12-22 23:04:00 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.6.7 /lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb2)[0x7f0d091f6392] /lib64/libglusterfs.so.0(gf_print_trace+0x32d)[0x7f0d0920d88d] /lib64/libc.so.6(+0x35650)[0x7f0d0820f650] /lib64/libc.so.6(gsignal+0x37)[0x7f0d0820f5d7] /lib64/libc.so.6(abort+0x148)[0x7f0d08210cc8] /lib64/libc.so.6(+0x75e07)[0x7f0d0824fe07] /lib64/libc.so.6(+0x7d1fd)[0x7f0d082571fd]
/usr/lib64/glusterfs/3.6.7/xlator/protocol/client.so(client_local_wipe+0x39)[0x7f0cfe8acdf9]
/usr/lib64/glusterfs/3.6.7/xlator/protocol/client.so(client3_3_readv_cbk+0x487)[0x7f0cfe8c1197] /lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0x90)[0x7f0d08fca100] /lib64/libgfrpc.so.0(rpc_clnt_notify+0x174)[0x7f0d08fca374] /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f0d08fc62c3]
/usr/lib64/glusterfs/3.6.7/rpc-transport/socket.so(+0x8790)[0x7f0d047f3790]
/usr/lib64/glusterfs/3.6.7/rpc-transport/socket.so(+0xaf84)[0x7f0d047f5f84] /lib64/libglusterfs.so.0(+0x767c2)[0x7f0d0924b7c2] /usr/sbin/glusterfs(main+0x502)[0x7f0d096a0fe2] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f0d081fbaf5] /usr/sbin/glusterfs(+0x6381)[0x7f0d096a1381]
Could you provide the gluster mount logs from the client, and the gluster brick logs from the gluster servers?
In /var/log/messages: ... Dec 22 23:00:49 compute2 journal: metadata not found: Requested metadata element is not present Dec 22 23:00:59 compute2 journal: metadata not found: Requested metadata element is not present Dec 22 23:01:01 compute2 systemd: Created slice user-0.slice. Dec 22 23:01:01 compute2 systemd: Starting Session 1316 of user root. Dec 22 23:01:01 compute2 systemd: Started Session 1316 of user root. Dec 22 23:01:01 compute2 systemd: Failed to reset devices.list on /machine.slice: Invalid argument Dec 22 23:01:03 compute2 journal: metadata not found: Requested metadata element is not present ... Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: pending frames: Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: frame : type(1) op(FSTAT) Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: frame : type(1) op(READ) Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: frame : type(1) op(READ) ... Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: frame : type(1) op(OPEN) Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: frame : type(0) op(0) Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: patchset: git://git.gluster.com/glusterfs.git Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: signal received: 6 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: time of crash: Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: 2015-12-22 23:04:00 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: configuration details: Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: argp 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: backtrace 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: dlfcn 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: libpthread 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: llistxattr 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: setfsid 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: spinlock 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: epoll.h 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: xattr.h 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: st_atim.tv_nsec 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: package-string: glusterfs 3.6.7 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: --------- Dec 22 23:04:03 compute2 sanlock[944]: 2015-12-22 23:04:03+0000 944341 [948]: open error -107
/rhev/data-center/mnt/glusterSD/10.0.231.50:_vm-storage/a5a83df1-47e2-4927-9add-079199ca7ef8/dom_md/leases Dec 22 23:04:03 compute2 sanlock[944]: 2015-12-22 23:04:03+0000 944342 [944]: ci 8 fd 17 pid -1 recv errno 104 Dec 22 23:04:03 compute2 sanlock[944]: 2015-12-22 23:04:03+0000 944342 [949]: open error -107
/rhev/data-center/mnt/glusterSD/10.0.231.50:_vm-storage/a5a83df1-47e2-4927-9add-079199ca7ef8/dom_md/leases Dec 22 23:04:03 compute2 sanlock[944]: 2015-12-22 23:04:03+0000 944342 [944]: ci 9 fd 18 pid -1 recv errno 104 Dec 22 23:04:03 compute2 sanlock[944]: 2015-12-22 23:04:03+0000 944342 [948]: open error -1
/rhev/data-center/mnt/glusterSD/10.0.231.50:_vm-storage/a5a83df1-47e2-4927-9add-079199ca7ef8/dom_md/leases Dec 22 23:04:03 compute2 sanlock[944]: 2015-12-22 23:04:03+0000 944342 [948]: r1 release_token open error -107 Dec 22 23:04:03 compute2 journal: vdsm root ERROR Panic: Unrecoverable errors during SPM stop process. Traceback (most recent call last): File "/usr/share/vdsm/storage/sp.py", line 414, in stopSpm self.masterDomain.releaseClusterLock() File "/usr/share/vdsm/storage/sd.py", line 512, in releaseClusterLock self._clusterLock.release() File "/usr/share/vdsm/storage/clusterlock.py", line 342, in release raise se.ReleaseLockFailure(self._sdUUID, e) ReleaseLockFailure: Cannot release lock: (u'a5a83df1-47e2-4927-9add-079199ca7ef8', SanlockException(107, 'Sanlock resource not released', 'Transport endpoint is not connected')) Dec 22 23:04:03 compute2 journal: End of file while reading data: Input/output error Dec 22 23:04:03 compute2 journal: End of file while reading data: Input/output error Dec 22 23:04:03 compute2 systemd: vdsmd.service: main process exited, code=killed, status=9/KILL Dec 22 23:04:03 compute2 systemd: Failed to reset devices.list on /machine.slice: Invalid argument Dec 22 23:04:03 compute2 vdsmd_init_common.sh: vdsm: Running run_final_hooks Dec 22 23:04:03 compute2 systemd: Unit vdsmd.service entered failed state. Dec 22 23:04:03 compute2 systemd: vdsmd.service holdoff time over, scheduling restart. Dec 22 23:04:03 compute2 systemd: Stopping Virtual Desktop Server Manager... Dec 22 23:04:03 compute2 systemd: Starting Virtual Desktop Server Manager... Dec 22 23:04:03 compute2 systemd: Failed to reset devices.list on /machine.slice: Invalid argument
At this point I tried restarting vdsmd but ended up having to restart the host. There was network connectivity to gluster before I restarted, and I was able to manually mount the vm-storage volume on the host and ls without issue.
After restarting the host I'm up and running again but I'm sure I'll get hit by this again if I can't sort out what the issue is. _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

On 12/23/2015 11:00 PM, Steve Dainard wrote:
I've attached the client gluster log starting at the first log of the same day as failure. Nothing significant in the client log after the crash and subsequent remount. The ENODATA warnings can be ignored. There was a patch (http://review.gluster.org/#/c/12015/) to change the log level, let me check if it has made it to a recent release version.
The brick logs are all 0 file size on all of the replica 3 nodes... I just set 'gluster volume set vm-storage diagnostics.brick-log-level WARNING' but I'm not immediately seeing any logging to disk.
I've also attached the compute1 vdsm.log file, which over the same time period is able to dd successfully so perhaps this discounts a storage side issue? I've also attached compute2 (failed node) for comparison.
Ravi - I'm not familiar with core files, would this be in a non-devel version of gluster? Or is this something I can enable? I don't mind enabling it now if it could help diagnose a future issue.
You should get the core files even on the non-devel versions. I think the method to enable core files and its location is distribution specific (depending on whether it uses abrt, systemd etc.) but you can check for ulimit, and /proc/sys/kernel/core_pattern settings in general. On my Fedora 23 machine, I get them on /core.pid_of_the_process>. The timestamp from the logs should also help in identifying the core file: frame : type(0) op(0) patchset: git://git.gluster.com/glusterfs.git signal received: 6 time of crash: 2015-12-22 23:04:00 Regards, Ravi
participants (3)
-
Ravishankar N
-
Sahina Bose
-
Steve Dainard