
[+ Ravi, Pranith] On 12/23/2015 06:00 AM, Steve Dainard wrote:
I have two hosts, only one of them was running VM's at the time of this crash so I can't tell if this is a node specific problem.
rpm -qa | egrep -i 'gluster|vdsm|libvirt' |sort glusterfs-3.6.7-1.el7.x86_64 glusterfs-api-3.6.7-1.el7.x86_64 glusterfs-cli-3.6.7-1.el7.x86_64 glusterfs-fuse-3.6.7-1.el7.x86_64 glusterfs-libs-3.6.7-1.el7.x86_64 glusterfs-rdma-3.6.7-1.el7.x86_64 libvirt-client-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-config-nwfilter-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-driver-interface-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-driver-network-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-driver-nodedev-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-driver-nwfilter-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-driver-qemu-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-driver-secret-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-driver-storage-1.2.8-16.el7_1.5.x86_64 libvirt-daemon-kvm-1.2.8-16.el7_1.5.x86_64 libvirt-lock-sanlock-1.2.8-16.el7_1.5.x86_64 libvirt-python-1.2.8-7.el7_1.1.x86_64 vdsm-4.16.30-0.el7.centos.x86_64 vdsm-cli-4.16.30-0.el7.centos.noarch vdsm-jsonrpc-4.16.30-0.el7.centos.noarch vdsm-python-4.16.30-0.el7.centos.noarch vdsm-python-zombiereaper-4.16.30-0.el7.centos.noarch vdsm-xmlrpc-4.16.30-0.el7.centos.noarch vdsm-yajsonrpc-4.16.30-0.el7.centos.noarch
VM's were in a paused state, with errors in UI:
2015-Dec-22, 15:06 VM pcic-apps has paused due to unknown storage error. 2015-Dec-22, 15:06 Host compute2 is not responding. It will stay in Connecting state for a grace period of 82 seconds and after that an attempt to fence the host will be issued. 2015-Dec-22, 15:03 Invalid status on Data Center EDC2. Setting Data Center status to Non Responsive (On host compute2, Error: General Exception). 2015-Dec-22, 15:03 VM pcic-storage has paused due to unknown storage error. 2015-Dec-22, 15:03 VM docker1 has paused due to unknown storage error.
VDSM logs look normal until: Dummy-99::DEBUG::2015-12-22 23:03:58,949::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail) dd if=/rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None) Dummy-99::DEBUG::2015-12-22 23:03:58,963::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail) SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB) copied, 0.00350501 s, 292 MB/s\n'; <rc> = 0 VM Channels Listener::INFO::2015-12-22 23:03:59,527::guestagent::180::vm.Vm::(_handleAPIVersion) vmId=`7067679e-43aa-43c0-b263-b0a711ade2e2`::Guest API version changed from 2 to 1 Thread-245428::DEBUG::2015-12-22 23:03:59,718::libvirtconnection::151::root::(wrapper) Unknown libvirterror: ecode: 80 edom: 20 level: 2 message: metadata not found: Requested metadata element is not present libvirtEventLoop::INFO::2015-12-22 23:04:00,447::vm::4982::vm.Vm::(_onIOError) vmId=`376e98b7-7798-46e8-be03-5dddf6cfb54f`::abnormal vm stop device virtio-disk0 error eother libvirtEventLoop::DEBUG::2015-12-22 23:04:00,447::vm::5666::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`376e98b7-7798-46e8-be03-5dddf6cfb54f`::event Suspended detail 2 opaque None libvirtEventLoop::INFO::2015-12-22 23:04:00,447::vm::4982::vm.Vm::(_onIOError) vmId=`376e98b7-7798-46e8-be03-5dddf6cfb54f`::abnormal vm stop device virtio-disk0 error eother ... libvirtEventLoop::INFO::2015-12-22 23:04:00,843::vm::4982::vm.Vm::(_onIOError) vmId=`97fbbf97-944b-4b77-b0bf-6a831f9090d8`::abnormal vm stop device virtio-disk1 error eother libvirtEventLoop::DEBUG::2015-12-22 23:04:00,844::vm::5666::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`97fbbf97-944b-4b77-b0bf-6a831f9090d8`::event Suspended detail 2 opaque None Dummy-99::DEBUG::2015-12-22 23:04:00,973::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail) dd if=/rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None) Dummy-99::DEBUG::2015-12-22 23:04:00,983::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail) FAILED: <err> = "dd: failed to open '/rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox': Transport endpoint is not connected\n"; <rc> = 1 Dummy-99::ERROR::2015-12-22 23:04:00,983::storage_mailbox::787::Storage.MailBox.SpmMailMonitor::(run) Error checking for mail Traceback (most recent call last): File "/usr/share/vdsm/storage/storage_mailbox.py", line 785, in run self._checkForMail() File "/usr/share/vdsm/storage/storage_mailbox.py", line 734, in _checkForMail "Could not read mailbox: %s" % self._inbox) IOError: [Errno 5] _handleRequests._checkForMail - Could not read mailbox: /rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox Dummy-99::DEBUG::2015-12-22 23:04:02,987::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail) dd if=/rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None) Dummy-99::DEBUG::2015-12-22 23:04:02,994::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail) FAILED: <err> = "dd: failed to open '/rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox': Transport endpoint is not connected\n"; <rc> = 1 Dummy-99::ERROR::2015-12-22 23:04:02,994::storage_mailbox::787::Storage.MailBox.SpmMailMonitor::(run) Error checking for mail Traceback (most recent call last): File "/usr/share/vdsm/storage/storage_mailbox.py", line 785, in run self._checkForMail() File "/usr/share/vdsm/storage/storage_mailbox.py", line 734, in _checkForMail "Could not read mailbox: %s" % self._inbox) IOError: [Errno 5] _handleRequests._checkForMail - Could not read mailbox: /rhev/data-center/f72ec125-69a1-4c1b-a5e1-313fcb70b6ff/mastersd/dom_md/inbox Thread-563692::DEBUG::2015-12-22 23:04:03,539::__init__::481::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'StoragePool.getSpmStatus' in bridge with {u'storagepoolID': u'f72ec125-69a1-4c1b-a5e1-313fcb70b6ff'} Thread-563692::DEBUG::2015-12-22 23:04:03,540::task::595::Storage.TaskManager.Task::(_updateState) Task=`6121360f-af29-48cf-8a8c-957480472a9d`::moving from state init -> state preparing Thread-563692::INFO::2015-12-22 23:04:03,540::logUtils::44::dispatcher::(wrapper) Run and protect: getSpmStatus(spUUID=u'f72ec125-69a1-4c1b-a5e1-313fcb70b6ff', options=None) Thread-563692::ERROR::2015-12-22 23:04:03,541::task::866::Storage.TaskManager.Task::(_setError) Task=`6121360f-af29-48cf-8a8c-957480472a9d`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 873, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 45, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 633, in getSpmStatus status = self._getSpmStatusInfo(pool) File "/usr/share/vdsm/storage/hsm.py", line 627, in _getSpmStatusInfo (pool.spmRole,) + pool.getSpmStatus())) File "/usr/share/vdsm/storage/sp.py", line 126, in getSpmStatus return self._backend.getSpmStatus() File "/usr/share/vdsm/storage/spbackends.py", line 416, in getSpmStatus lVer, spmId = self.masterDomain.inquireClusterLock() File "/usr/share/vdsm/storage/sd.py", line 515, in inquireClusterLock return self._clusterLock.inquire() File "/usr/share/vdsm/storage/clusterlock.py", line 321, in inquire resource = sanlock.read_resource(self._leasesPath, SDM_LEASE_OFFSET) SanlockException: (107, 'Sanlock resource read failure', 'Transport endpoint is not connected') ...
I noticed the gluster mount point was still shown running 'mount' but when I tried to ls the mount point I got a "transport endpoint is not connected" error.
I haven't been able to find anything more interested in the logs, and the journal is blank before the recent host restart.
In the gluster client logs from the host I see: ... [2015-12-22 22:52:23.238369] W [fuse-bridge.c:1261:fuse_err_cbk] 0-glusterfs-fuse: 31263720: REMOVEXATTR() /__DIRECT_IO_TEST__ => -1 (No dat a available) [2015-12-22 22:57:23.778018] W [fuse-bridge.c:1261:fuse_err_cbk] 0-glusterfs-fuse: 31316031: REMOVEXATTR() /__DIRECT_IO_TEST__ => -1 (No dat a available) [2015-12-22 23:02:24.376491] W [fuse-bridge.c:1261:fuse_err_cbk] 0-glusterfs-fuse: 31473552: REMOVEXATTR() /__DIRECT_IO_TEST__ => -1 (No dat a available) pending frames: frame : type(1) op(FSTAT) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(1) op(READ) frame : type(0) op(0) frame : type(0) op(0) ... frame : type(1) op(READ) frame : type(0) op(0) frame : type(1) op(OPEN) frame : type(1) op(OPEN) frame : type(1) op(OPEN) frame : type(1) op(OPEN) frame : type(0) op(0) patchset: git://git.gluster.com/glusterfs.git signal received: 6 time of crash: 2015-12-22 23:04:00 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.6.7 /lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb2)[0x7f0d091f6392] /lib64/libglusterfs.so.0(gf_print_trace+0x32d)[0x7f0d0920d88d] /lib64/libc.so.6(+0x35650)[0x7f0d0820f650] /lib64/libc.so.6(gsignal+0x37)[0x7f0d0820f5d7] /lib64/libc.so.6(abort+0x148)[0x7f0d08210cc8] /lib64/libc.so.6(+0x75e07)[0x7f0d0824fe07] /lib64/libc.so.6(+0x7d1fd)[0x7f0d082571fd] /usr/lib64/glusterfs/3.6.7/xlator/protocol/client.so(client_local_wipe+0x39)[0x7f0cfe8acdf9] /usr/lib64/glusterfs/3.6.7/xlator/protocol/client.so(client3_3_readv_cbk+0x487)[0x7f0cfe8c1197] /lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0x90)[0x7f0d08fca100] /lib64/libgfrpc.so.0(rpc_clnt_notify+0x174)[0x7f0d08fca374] /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f0d08fc62c3] /usr/lib64/glusterfs/3.6.7/rpc-transport/socket.so(+0x8790)[0x7f0d047f3790] /usr/lib64/glusterfs/3.6.7/rpc-transport/socket.so(+0xaf84)[0x7f0d047f5f84] /lib64/libglusterfs.so.0(+0x767c2)[0x7f0d0924b7c2] /usr/sbin/glusterfs(main+0x502)[0x7f0d096a0fe2] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f0d081fbaf5] /usr/sbin/glusterfs(+0x6381)[0x7f0d096a1381]
Could you provide the gluster mount logs from the client, and the gluster brick logs from the gluster servers?
In /var/log/messages: ... Dec 22 23:00:49 compute2 journal: metadata not found: Requested metadata element is not present Dec 22 23:00:59 compute2 journal: metadata not found: Requested metadata element is not present Dec 22 23:01:01 compute2 systemd: Created slice user-0.slice. Dec 22 23:01:01 compute2 systemd: Starting Session 1316 of user root. Dec 22 23:01:01 compute2 systemd: Started Session 1316 of user root. Dec 22 23:01:01 compute2 systemd: Failed to reset devices.list on /machine.slice: Invalid argument Dec 22 23:01:03 compute2 journal: metadata not found: Requested metadata element is not present ... Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: pending frames: Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: frame : type(1) op(FSTAT) Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: frame : type(1) op(READ) Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: frame : type(1) op(READ) ... Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: frame : type(1) op(OPEN) Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: frame : type(0) op(0) Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: patchset: git://git.gluster.com/glusterfs.git Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: signal received: 6 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: time of crash: Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: 2015-12-22 23:04:00 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: configuration details: Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: argp 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: backtrace 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: dlfcn 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: libpthread 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: llistxattr 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: setfsid 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: spinlock 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: epoll.h 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: xattr.h 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: st_atim.tv_nsec 1 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: package-string: glusterfs 3.6.7 Dec 22 23:04:00 compute2 rhev-data-center-mnt-glusterSD-10.0.231.50:_vm-storage[14098]: --------- Dec 22 23:04:03 compute2 sanlock[944]: 2015-12-22 23:04:03+0000 944341 [948]: open error -107 /rhev/data-center/mnt/glusterSD/10.0.231.50:_vm-storage/a5a83df1-47e2-4927-9add-079199ca7ef8/dom_md/leases Dec 22 23:04:03 compute2 sanlock[944]: 2015-12-22 23:04:03+0000 944342 [944]: ci 8 fd 17 pid -1 recv errno 104 Dec 22 23:04:03 compute2 sanlock[944]: 2015-12-22 23:04:03+0000 944342 [949]: open error -107 /rhev/data-center/mnt/glusterSD/10.0.231.50:_vm-storage/a5a83df1-47e2-4927-9add-079199ca7ef8/dom_md/leases Dec 22 23:04:03 compute2 sanlock[944]: 2015-12-22 23:04:03+0000 944342 [944]: ci 9 fd 18 pid -1 recv errno 104 Dec 22 23:04:03 compute2 sanlock[944]: 2015-12-22 23:04:03+0000 944342 [948]: open error -1 /rhev/data-center/mnt/glusterSD/10.0.231.50:_vm-storage/a5a83df1-47e2-4927-9add-079199ca7ef8/dom_md/leases Dec 22 23:04:03 compute2 sanlock[944]: 2015-12-22 23:04:03+0000 944342 [948]: r1 release_token open error -107 Dec 22 23:04:03 compute2 journal: vdsm root ERROR Panic: Unrecoverable errors during SPM stop process. Traceback (most recent call last): File "/usr/share/vdsm/storage/sp.py", line 414, in stopSpm self.masterDomain.releaseClusterLock() File "/usr/share/vdsm/storage/sd.py", line 512, in releaseClusterLock self._clusterLock.release() File "/usr/share/vdsm/storage/clusterlock.py", line 342, in release raise se.ReleaseLockFailure(self._sdUUID, e) ReleaseLockFailure: Cannot release lock: (u'a5a83df1-47e2-4927-9add-079199ca7ef8', SanlockException(107, 'Sanlock resource not released', 'Transport endpoint is not connected')) Dec 22 23:04:03 compute2 journal: End of file while reading data: Input/output error Dec 22 23:04:03 compute2 journal: End of file while reading data: Input/output error Dec 22 23:04:03 compute2 systemd: vdsmd.service: main process exited, code=killed, status=9/KILL Dec 22 23:04:03 compute2 systemd: Failed to reset devices.list on /machine.slice: Invalid argument Dec 22 23:04:03 compute2 vdsmd_init_common.sh: vdsm: Running run_final_hooks Dec 22 23:04:03 compute2 systemd: Unit vdsmd.service entered failed state. Dec 22 23:04:03 compute2 systemd: vdsmd.service holdoff time over, scheduling restart. Dec 22 23:04:03 compute2 systemd: Stopping Virtual Desktop Server Manager... Dec 22 23:04:03 compute2 systemd: Starting Virtual Desktop Server Manager... Dec 22 23:04:03 compute2 systemd: Failed to reset devices.list on /machine.slice: Invalid argument
At this point I tried restarting vdsmd but ended up having to restart the host. There was network connectivity to gluster before I restarted, and I was able to manually mount the vm-storage volume on the host and ls without issue.
After restarting the host I'm up and running again but I'm sure I'll get hit by this again if I can't sort out what the issue is. _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users