
This is a multi-part message in MIME format. --------------040602070308070206010302 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Hi Nir / Liron, On 29-04-2015 16:41, Nir Soffer wrote:
[...] Your probably have storage issues, revealed by sanlock - it reads and write to all storage domains every 10 seconds, so flaky storage will cause failure to acquire a host id. Please attach to the bug these logs: Hypervisor: /var/log/sanlock.log /var/log/messages /var/log/glusterfs/<glusterhost>:_<volumename>.log Gluster server: server logs showing what happened when sanlock failed access the gluster volume.
We have 3 hosts with a 1 GbE LAN connection (idle) so it's probably soft related. On host H5, we are constantly receiving sanlock warnings: May 1 17:26:37 h5 sanlock[637]: 2015-05-01 17:26:37-0300 9118 [643]: s909 add_lockspace fail result -5 May 1 17:26:46 h5 sanlock[637]: 2015-05-01 17:26:46-0300 9128 [12892]: read_sectors delta_leader offset 512 rv -5 /rhev/data-center/mnt/glusterSD/h4.imatronix.com:vdisks/ba7be27f-aee5-4436-ae9a-0764f551f9a7/dom_md/ids I thought it was normal, but want to be sure. A Gluster Statedump reveals this locks: [xlator.features.locks.vdisks-locks.inode] path=/ba7be27f-aee5-4436-ae9a-0764f551f9a7/dom_md/ids mandatory=0 conn.1.id=<Host H5>-3016-2015/05/01-17:54:57:109200-vdisks-client-0-0-0 conn.1.ref=1 conn.1.bound_xl=/mnt/disk1/gluster-bricks/vdisks conn.2.id=<Host H6>-3369-2015/04/30-05:40:59:928550-vdisks-client-0-0-0 conn.2.ref=1 conn.2.bound_xl=/mnt/disk1/gluster-bricks/vdisks conn.3.id=<Host H4>-31780-2015/04/30-05:57:15:152009-vdisks-client-0-0-0 conn.3.ref=1 conn.3.bound_xl=/mnt/disk1/gluster-bricks/vdisks conn.4.id=<Host H6>-16034-2015/04/30-16:40:26:355759-vdisks-client-0-0-0 conn.4.ref=1 conn.4.bound_xl=/mnt/disk1/gluster-bricks/vdisks The first lease is hold by sanlock: # ps aux | grep 637 sanlock 637 0.0 0.0 545240 19172 ? SLsl 14:54 0:01 sanlock daemon -U sanlock -G sanlock Host is up since 14:54:39, so this lease was taken after the boot. Threads: |-sanlock(637)-+-sanlock-helper(641) | |-{sanlock}(642) | |-{sanlock}(643) | |-{sanlock}(644) | |-{sanlock}(645) | |-{sanlock}(3281) | `-{sanlock}(3428) Strace: h5:~# strace -p 637 Process 637 attached restart_syscall(<... resuming interrupted call ...>) = 0 poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 0 (Timeout) poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 0 (Timeout) poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 0 (Timeout) poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 0 (Timeout) poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 0 (Timeout) poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 0 (Timeout) poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 1 ([{fd=7, revents=POLLIN}]) accept(7, 0, NULL) = 11 setsockopt(11, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0 poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11, events=POLLIN}, {fd=-1}], 4, 1000) = 1 ([{fd=11, revents=POLLIN}]) recvfrom(11, "\20 (\4\1\0\0\0\20\0\0\0\0\0\0\0p\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 32, MSG_WAITALL, NULL, NULL) = 32 futex(0x7f0f3f99442c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f0f3f994428, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7f0f3f994400, FUTEX_WAKE_PRIVATE, 1) = 1 poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11, events=POLLIN}, {fd=-1}], 4, 1000) = 1 ([{fd=11, revents=POLLIN|POLLHUP}]) recvfrom(11, "", 32, MSG_WAITALL, NULL, NULL) = 0 close(11) = 0 poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 1 ([{fd=7, revents=POLLIN}]) accept(7, 0, NULL) = 11 setsockopt(11, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0 poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11, events=POLLIN}, {fd=-1}], 4, 1000) = 1 ([{fd=11, revents=POLLIN}]) recvfrom(11, "\20 (\4\1\0\0\0\2\0\0\0\1\0\0\0p\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 32, MSG_WAITALL, NULL, NULL) = 32 futex(0x7f0f3f99442c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f0f3f994428, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7f0f3f994400, FUTEX_WAKE_PRIVATE, 1) = 1 poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 0 (Timeout) futex(0x7f0f3fa97640, FUTEX_WAKE_PRIVATE, 1) = 1 poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11, events=POLLIN}, {fd=-1}], 4, 1000) = 1 ([{fd=11, revents=POLLIN|POLLHUP}]) recvfrom(11, "", 32, MSG_WAITALL, NULL, NULL) = 0 close(11) = 0 poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 0 (Timeout) poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 0 (Timeout) poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 1 ([{fd=7, revents=POLLIN}]) accept(7, 0, NULL) = 11 setsockopt(11, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0 poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11, events=POLLIN}, {fd=-1}], 4, 824) = 1 ([{fd=11, revents=POLLIN}]) recvfrom(11, "\20 (\4\1\0\0\0\20\0\0\0\0\0\0\0p\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 32, MSG_WAITALL, NULL, NULL) = 32 futex(0x7f0f3f99442c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f0f3f994428, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7f0f3f994400, FUTEX_WAKE_PRIVATE, 1) = 1 poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11, events=POLLIN}, {fd=-1}], 4, 824) = 1 ([{fd=11, revents=POLLIN|POLLHUP}]) recvfrom(11, "", 32, MSG_WAITALL, NULL, NULL) = 0 close(11) = 0 poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 824) = 0 (Timeout) poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000^CProcess 637 detached Maybe I'm doing something wrong? Do you have any idea? The funny thing is that this CentOS 7 host is getting constantly in a state where the whole root FS is not accesible any more. All you have is a bash returning I/O errors for any stored command (eg: "ls"). I thought it was hardware related (2 x 500 GB SSD disks), but maybe we are lucky and found something kernel related. Best regards, Christopher. --------------040602070308070206010302 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 8bit <html> <head> <meta content="text/html; charset=utf-8" http-equiv="Content-Type"> </head> <body bgcolor="#FFFFFF" text="#000000"> Hi Nir / Liron,<br> <br> <div class="moz-cite-prefix">On 29-04-2015 16:41, Nir Soffer wrote:<br> </div> <blockquote cite="mid:733821506.7219611.1430336460762.JavaMail.zimbra@redhat.com" type="cite"> [...]<br> Your probably have storage issues, revealed by sanlock - it reads and write to all storage domains every 10 seconds, so flaky storage will cause failure to acquire a host id. Please attach to the bug these logs: Hypervisor: /var/log/sanlock.log /var/log/messages /var/log/glusterfs/<glusterhost>:_<volumename>.log Gluster server: server logs showing what happened when sanlock failed access the gluster volume. </blockquote> <br> We have 3 hosts with a 1 GbE LAN connection (idle) so it's probably soft related.<br> <br> On host H5, we are constantly receiving sanlock warnings:<br> <blockquote>May 1 17:26:37 h5 sanlock[637]: 2015-05-01 17:26:37-0300 9118 [643]: s909 add_lockspace fail result -5<br> May 1 17:26:46 h5 sanlock[637]: 2015-05-01 17:26:46-0300 9128 [12892]: read_sectors delta_leader offset 512 rv -5 /rhev/data-center/mnt/glusterSD/h4.imatronix.com:vdisks/ba7be27f-aee5-4436-ae9a-0764f551f9a7/dom_md/ids<br> </blockquote> I thought it was normal, but want to be sure.<br> <br> A Gluster Statedump reveals this locks:<br> <blockquote>[xlator.features.locks.vdisks-locks.inode]<br> path=/ba7be27f-aee5-4436-ae9a-0764f551f9a7/dom_md/ids<br> mandatory=0<br> conn.1.id=<Host H5>-3016-2015/05/01-17:54:57:109200-vdisks-client-0-0-0<br> conn.1.ref=1<br> conn.1.bound_xl=/mnt/disk1/gluster-bricks/vdisks<br> conn.2.id=<Host H6>-3369-2015/04/30-05:40:59:928550-vdisks-client-0-0-0<br> conn.2.ref=1<br> conn.2.bound_xl=/mnt/disk1/gluster-bricks/vdisks<br> conn.3.id=<Host H4>-31780-2015/04/30-05:57:15:152009-vdisks-client-0-0-0<br> conn.3.ref=1<br> conn.3.bound_xl=/mnt/disk1/gluster-bricks/vdisks<br> conn.4.id=<Host H6>-16034-2015/04/30-16:40:26:355759-vdisks-client-0-0-0<br> conn.4.ref=1<br> conn.4.bound_xl=/mnt/disk1/gluster-bricks/vdisks<br> </blockquote> The first lease is hold by sanlock:<br> <blockquote># ps aux | grep 637<br> sanlock 637 0.0 0.0 545240 19172 ? SLsl 14:54 0:01 sanlock daemon -U sanlock -G sanlock<br> </blockquote> Host is up since 14:54:39, so this lease was taken after the boot.<br> <br> Threads:<br> <br> |-sanlock(637)-+-sanlock-helper(641)<br> | |-{sanlock}(642)<br> | |-{sanlock}(643)<br> | |-{sanlock}(644)<br> | |-{sanlock}(645)<br> | |-{sanlock}(3281)<br> | `-{sanlock}(3428)<br> <br> Strace:<br> <br> <blockquote>h5:~# strace -p 637<br> Process 637 attached<br> restart_syscall(<... resuming interrupted call ...>) = 0<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 0 (Timeout)<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 0 (Timeout)<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 0 (Timeout)<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 0 (Timeout)<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 0 (Timeout)<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 0 (Timeout)<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 1 ([{fd=7, revents=POLLIN}])<br> accept(7, 0, NULL) = 11<br> setsockopt(11, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11, events=POLLIN}, {fd=-1}], 4, 1000) = 1 ([{fd=11, revents=POLLIN}])<br> recvfrom(11, "\20 (\4\1\0\0\0\20\0\0\0\0\0\0\0p\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 32, MSG_WAITALL, NULL, NULL) = 32<br> futex(0x7f0f3f99442c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f0f3f994428, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1<br> futex(0x7f0f3f994400, FUTEX_WAKE_PRIVATE, 1) = 1<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11, events=POLLIN}, {fd=-1}], 4, 1000) = 1 ([{fd=11, revents=POLLIN|POLLHUP}])<br> recvfrom(11, "", 32, MSG_WAITALL, NULL, NULL) = 0<br> close(11) = 0<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 1 ([{fd=7, revents=POLLIN}])<br> accept(7, 0, NULL) = 11<br> setsockopt(11, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11, events=POLLIN}, {fd=-1}], 4, 1000) = 1 ([{fd=11, revents=POLLIN}])<br> recvfrom(11, "\20 (\4\1\0\0\0\2\0\0\0\1\0\0\0p\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 32, MSG_WAITALL, NULL, NULL) = 32<br> futex(0x7f0f3f99442c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f0f3f994428, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1<br> futex(0x7f0f3f994400, FUTEX_WAKE_PRIVATE, 1) = 1<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 0 (Timeout)<br> futex(0x7f0f3fa97640, FUTEX_WAKE_PRIVATE, 1) = 1<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11, events=POLLIN}, {fd=-1}], 4, 1000) = 1 ([{fd=11, revents=POLLIN|POLLHUP}])<br> recvfrom(11, "", 32, MSG_WAITALL, NULL, NULL) = 0<br> close(11) = 0<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 0 (Timeout)<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 0 (Timeout)<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000) = 1 ([{fd=7, revents=POLLIN}])<br> accept(7, 0, NULL) = 11<br> setsockopt(11, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11, events=POLLIN}, {fd=-1}], 4, 824) = 1 ([{fd=11, revents=POLLIN}])<br> recvfrom(11, "\20 (\4\1\0\0\0\20\0\0\0\0\0\0\0p\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 32, MSG_WAITALL, NULL, NULL) = 32<br> futex(0x7f0f3f99442c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f0f3f994428, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1<br> futex(0x7f0f3f994400, FUTEX_WAKE_PRIVATE, 1) = 1<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11, events=POLLIN}, {fd=-1}], 4, 824) = 1 ([{fd=11, revents=POLLIN|POLLHUP}])<br> recvfrom(11, "", 32, MSG_WAITALL, NULL, NULL) = 0<br> close(11) = 0<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 824) = 0 (Timeout)<br> poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=-1}], 4, 1000^CProcess 637 detached<br> </blockquote> Maybe I'm doing something wrong?<br> Do you have any idea?<br> <br> The funny thing is that this CentOS 7 host is getting constantly in a state where the whole root FS is not accesible any more.<br> All you have is a bash returning I/O errors for any stored command (eg: "ls").<br> I thought it was hardware related (2 x 500 GB SSD disks), but maybe we are lucky and found something kernel related.<br> <br> Best regards,<br> <br> Christopher.<br> </body> </html> --------------040602070308070206010302--