This is a multi-part message in MIME format.
--------------040602070308070206010302
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Hi Nir / Liron,
On 29-04-2015 16:41, Nir Soffer wrote:
[...]
Your probably have storage issues, revealed by sanlock - it reads and
write to all storage domains every 10 seconds, so flaky storage will
cause failure to acquire a host id. Please attach to the bug these
logs: Hypervisor: /var/log/sanlock.log /var/log/messages
/var/log/glusterfs/<glusterhost>:_<volumename>.log Gluster server:
server logs showing what happened when sanlock failed access the
gluster volume.
We have 3 hosts with a 1 GbE LAN connection (idle) so it's probably soft
related.
On host H5, we are constantly receiving sanlock warnings:
May 1 17:26:37 h5 sanlock[637]: 2015-05-01 17:26:37-0300 9118
[643]: s909 add_lockspace fail result -5
May 1 17:26:46 h5 sanlock[637]: 2015-05-01 17:26:46-0300 9128
[12892]: read_sectors delta_leader offset 512 rv -5
/rhev/data-center/mnt/glusterSD/h4.imatronix.com:vdisks/ba7be27f-aee5-4436-ae9a-0764f551f9a7/dom_md/ids
I thought it was normal, but want to be sure.
A Gluster Statedump reveals this locks:
[xlator.features.locks.vdisks-locks.inode]
path=/ba7be27f-aee5-4436-ae9a-0764f551f9a7/dom_md/ids
mandatory=0
conn.1.id=<Host H5>-3016-2015/05/01-17:54:57:109200-vdisks-client-0-0-0
conn.1.ref=1
conn.1.bound_xl=/mnt/disk1/gluster-bricks/vdisks
conn.2.id=<Host H6>-3369-2015/04/30-05:40:59:928550-vdisks-client-0-0-0
conn.2.ref=1
conn.2.bound_xl=/mnt/disk1/gluster-bricks/vdisks
conn.3.id=<Host H4>-31780-2015/04/30-05:57:15:152009-vdisks-client-0-0-0
conn.3.ref=1
conn.3.bound_xl=/mnt/disk1/gluster-bricks/vdisks
conn.4.id=<Host H6>-16034-2015/04/30-16:40:26:355759-vdisks-client-0-0-0
conn.4.ref=1
conn.4.bound_xl=/mnt/disk1/gluster-bricks/vdisks
The first lease is hold by sanlock:
# ps aux | grep 637
sanlock 637 0.0 0.0 545240 19172 ? SLsl 14:54 0:01
sanlock daemon -U sanlock -G sanlock
Host is up since 14:54:39, so this lease was taken after the boot.
Threads:
|-sanlock(637)-+-sanlock-helper(641)
| |-{sanlock}(642)
| |-{sanlock}(643)
| |-{sanlock}(644)
| |-{sanlock}(645)
| |-{sanlock}(3281)
| `-{sanlock}(3428)
Strace:
h5:~# strace -p 637
Process 637 attached
restart_syscall(<... resuming interrupted call ...>) = 0
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
{fd=-1}], 4, 1000) = 0 (Timeout)
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
{fd=-1}], 4, 1000) = 0 (Timeout)
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
{fd=-1}], 4, 1000) = 0 (Timeout)
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
{fd=-1}], 4, 1000) = 0 (Timeout)
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
{fd=-1}], 4, 1000) = 0 (Timeout)
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
{fd=-1}], 4, 1000) = 0 (Timeout)
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
{fd=-1}], 4, 1000) = 1 ([{fd=7, revents=POLLIN}])
accept(7, 0, NULL) = 11
setsockopt(11, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11,
events=POLLIN}, {fd=-1}], 4, 1000) = 1 ([{fd=11, revents=POLLIN}])
recvfrom(11, "\20
(\4\1\0\0\0\20\0\0\0\0\0\0\0p\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 32,
MSG_WAITALL, NULL, NULL) = 32
futex(0x7f0f3f99442c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f0f3f994428,
{FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x7f0f3f994400, FUTEX_WAKE_PRIVATE, 1) = 1
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11,
events=POLLIN}, {fd=-1}], 4, 1000) = 1 ([{fd=11,
revents=POLLIN|POLLHUP}])
recvfrom(11, "", 32, MSG_WAITALL, NULL, NULL) = 0
close(11) = 0
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
{fd=-1}], 4, 1000) = 1 ([{fd=7, revents=POLLIN}])
accept(7, 0, NULL) = 11
setsockopt(11, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11,
events=POLLIN}, {fd=-1}], 4, 1000) = 1 ([{fd=11, revents=POLLIN}])
recvfrom(11, "\20
(\4\1\0\0\0\2\0\0\0\1\0\0\0p\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 32,
MSG_WAITALL, NULL, NULL) = 32
futex(0x7f0f3f99442c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f0f3f994428,
{FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x7f0f3f994400, FUTEX_WAKE_PRIVATE, 1) = 1
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
{fd=-1}], 4, 1000) = 0 (Timeout)
futex(0x7f0f3fa97640, FUTEX_WAKE_PRIVATE, 1) = 1
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11,
events=POLLIN}, {fd=-1}], 4, 1000) = 1 ([{fd=11,
revents=POLLIN|POLLHUP}])
recvfrom(11, "", 32, MSG_WAITALL, NULL, NULL) = 0
close(11) = 0
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
{fd=-1}], 4, 1000) = 0 (Timeout)
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
{fd=-1}], 4, 1000) = 0 (Timeout)
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
{fd=-1}], 4, 1000) = 1 ([{fd=7, revents=POLLIN}])
accept(7, 0, NULL) = 11
setsockopt(11, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11,
events=POLLIN}, {fd=-1}], 4, 824) = 1 ([{fd=11, revents=POLLIN}])
recvfrom(11, "\20
(\4\1\0\0\0\20\0\0\0\0\0\0\0p\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 32,
MSG_WAITALL, NULL, NULL) = 32
futex(0x7f0f3f99442c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f0f3f994428,
{FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x7f0f3f994400, FUTEX_WAKE_PRIVATE, 1) = 1
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11,
events=POLLIN}, {fd=-1}], 4, 824) = 1 ([{fd=11,
revents=POLLIN|POLLHUP}])
recvfrom(11, "", 32, MSG_WAITALL, NULL, NULL) = 0
close(11) = 0
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
{fd=-1}], 4, 824) = 0 (Timeout)
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
{fd=-1}], 4, 1000^CProcess 637 detached
Maybe I'm doing something wrong?
Do you have any idea?
The funny thing is that this CentOS 7 host is getting constantly in a
state where the whole root FS is not accesible any more.
All you have is a bash returning I/O errors for any stored command (eg:
"ls").
I thought it was hardware related (2 x 500 GB SSD disks), but maybe we
are lucky and found something kernel related.
Best regards,
Christopher.
--------------040602070308070206010302
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 8bit
<html>
<head>
<meta content="text/html; charset=utf-8"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
Hi Nir / Liron,<br>
<br>
<div class="moz-cite-prefix">On 29-04-2015 16:41, Nir Soffer
wrote:<br>
</div>
<blockquote
cite="mid:733821506.7219611.1430336460762.JavaMail.zimbra@redhat.com"
type="cite">
[...]<br>
Your probably have storage issues, revealed by sanlock - it reads
and write
to all storage domains every 10 seconds, so flaky storage will
cause failure
to acquire a host id.
Please attach to the bug these logs:
Hypervisor:
/var/log/sanlock.log
/var/log/messages
/var/log/glusterfs/<glusterhost>:_<volumename>.log
Gluster server:
server logs showing what happened when sanlock failed access the
gluster volume.
</blockquote>
<br>
We have 3 hosts with a 1 GbE LAN connection (idle) so it's probably
soft related.<br>
<br>
On host H5, we are constantly receiving sanlock warnings:<br>
<blockquote>May 1 17:26:37 h5 sanlock[637]: 2015-05-01
17:26:37-0300 9118 [643]: s909 add_lockspace fail result -5<br>
May 1 17:26:46 h5 sanlock[637]: 2015-05-01 17:26:46-0300 9128
[12892]: read_sectors delta_leader offset 512 rv -5
/rhev/data-center/mnt/glusterSD/h4.imatronix.com:vdisks/ba7be27f-aee5-4436-ae9a-0764f551f9a7/dom_md/ids<br>
</blockquote>
I thought it was normal, but want to be sure.<br>
<br>
A Gluster Statedump reveals this locks:<br>
<blockquote>[xlator.features.locks.vdisks-locks.inode]<br>
path=/ba7be27f-aee5-4436-ae9a-0764f551f9a7/dom_md/ids<br>
mandatory=0<br>
conn.1.id=<Host
H5>-3016-2015/05/01-17:54:57:109200-vdisks-client-0-0-0<br>
conn.1.ref=1<br>
conn.1.bound_xl=/mnt/disk1/gluster-bricks/vdisks<br>
conn.2.id=<Host
H6>-3369-2015/04/30-05:40:59:928550-vdisks-client-0-0-0<br>
conn.2.ref=1<br>
conn.2.bound_xl=/mnt/disk1/gluster-bricks/vdisks<br>
conn.3.id=<Host
H4>-31780-2015/04/30-05:57:15:152009-vdisks-client-0-0-0<br>
conn.3.ref=1<br>
conn.3.bound_xl=/mnt/disk1/gluster-bricks/vdisks<br>
conn.4.id=<Host
H6>-16034-2015/04/30-16:40:26:355759-vdisks-client-0-0-0<br>
conn.4.ref=1<br>
conn.4.bound_xl=/mnt/disk1/gluster-bricks/vdisks<br>
</blockquote>
The first lease is hold by sanlock:<br>
<blockquote># ps aux | grep 637<br>
sanlock 637 0.0 0.0 545240 19172 ? SLsl 14:54 0:01
sanlock daemon -U sanlock -G sanlock<br>
</blockquote>
Host is up since 14:54:39, so this lease was taken after the boot.<br>
<br>
Threads:<br>
<br>
|-sanlock(637)-+-sanlock-helper(641)<br>
| |-{sanlock}(642)<br>
| |-{sanlock}(643)<br>
| |-{sanlock}(644)<br>
| |-{sanlock}(645)<br>
| |-{sanlock}(3281)<br>
| `-{sanlock}(3428)<br>
<br>
Strace:<br>
<br>
<blockquote>h5:~# strace -p 637<br>
Process 637 attached<br>
restart_syscall(<... resuming interrupted call ...>) = 0<br>
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
{fd=-1}], 4, 1000) = 0 (Timeout)<br>
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
{fd=-1}], 4, 1000) = 0 (Timeout)<br>
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
{fd=-1}], 4, 1000) = 0 (Timeout)<br>
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
{fd=-1}], 4, 1000) = 0 (Timeout)<br>
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
{fd=-1}], 4, 1000) = 0 (Timeout)<br>
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
{fd=-1}], 4, 1000) = 0 (Timeout)<br>
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
{fd=-1}], 4, 1000) = 1 ([{fd=7, revents=POLLIN}])<br>
accept(7, 0, NULL) = 11<br>
setsockopt(11, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0<br>
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11,
events=POLLIN}, {fd=-1}], 4, 1000) = 1 ([{fd=11, revents=POLLIN}])<br>
recvfrom(11, "\20
(\4\1\0\0\0\20\0\0\0\0\0\0\0p\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 32,
MSG_WAITALL, NULL, NULL) = 32<br>
futex(0x7f0f3f99442c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f0f3f994428,
{FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1<br>
futex(0x7f0f3f994400, FUTEX_WAKE_PRIVATE, 1) = 1<br>
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11,
events=POLLIN}, {fd=-1}], 4, 1000) = 1 ([{fd=11,
revents=POLLIN|POLLHUP}])<br>
recvfrom(11, "", 32, MSG_WAITALL, NULL, NULL) = 0<br>
close(11) = 0<br>
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
{fd=-1}], 4, 1000) = 1 ([{fd=7, revents=POLLIN}])<br>
accept(7, 0, NULL) = 11<br>
setsockopt(11, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0<br>
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11,
events=POLLIN}, {fd=-1}], 4, 1000) = 1 ([{fd=11, revents=POLLIN}])<br>
recvfrom(11, "\20
(\4\1\0\0\0\2\0\0\0\1\0\0\0p\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 32,
MSG_WAITALL, NULL, NULL) = 32<br>
futex(0x7f0f3f99442c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f0f3f994428,
{FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1<br>
futex(0x7f0f3f994400, FUTEX_WAKE_PRIVATE, 1) = 1<br>
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
{fd=-1}], 4, 1000) = 0 (Timeout)<br>
futex(0x7f0f3fa97640, FUTEX_WAKE_PRIVATE, 1) = 1<br>
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11,
events=POLLIN}, {fd=-1}], 4, 1000) = 1 ([{fd=11,
revents=POLLIN|POLLHUP}])<br>
recvfrom(11, "", 32, MSG_WAITALL, NULL, NULL) = 0<br>
close(11) = 0<br>
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
{fd=-1}], 4, 1000) = 0 (Timeout)<br>
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
{fd=-1}], 4, 1000) = 0 (Timeout)<br>
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
{fd=-1}], 4, 1000) = 1 ([{fd=7, revents=POLLIN}])<br>
accept(7, 0, NULL) = 11<br>
setsockopt(11, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0<br>
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11,
events=POLLIN}, {fd=-1}], 4, 824) = 1 ([{fd=11, revents=POLLIN}])<br>
recvfrom(11, "\20
(\4\1\0\0\0\20\0\0\0\0\0\0\0p\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 32,
MSG_WAITALL, NULL, NULL) = 32<br>
futex(0x7f0f3f99442c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f0f3f994428,
{FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1<br>
futex(0x7f0f3f994400, FUTEX_WAKE_PRIVATE, 1) = 1<br>
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11,
events=POLLIN}, {fd=-1}], 4, 824) = 1 ([{fd=11,
revents=POLLIN|POLLHUP}])<br>
recvfrom(11, "", 32, MSG_WAITALL, NULL, NULL) = 0<br>
close(11) = 0<br>
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
{fd=-1}], 4, 824) = 0 (Timeout)<br>
poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
{fd=-1}], 4, 1000^CProcess 637 detached<br>
</blockquote>
Maybe I'm doing something wrong?<br>
Do you have any idea?<br>
<br>
The funny thing is that this CentOS 7 host is getting constantly in
a state where the whole root FS is not accesible any more.<br>
All you have is a bash returning I/O errors for any stored command
(eg: "ls").<br>
I thought it was hardware related (2 x 500 GB SSD disks), but maybe
we are lucky and found something kernel related.<br>
<br>
Best regards,<br>
<br>
Christopher.<br>
</body>
</html>
--------------040602070308070206010302--