[ovirt-devel] Sanlock - "Add_lockspace fail" - WAS: "Please activate the master Storage Domain first"

Christopher Pereira kripper at imatronix.cl
Fri May 1 20:51:40 UTC 2015


Hi Nir / Liron,

On 29-04-2015 16:41, Nir Soffer wrote:
> [...]
> Your probably have storage issues, revealed by sanlock - it reads and 
> write to all storage domains every 10 seconds, so flaky storage will 
> cause failure to acquire a host id. Please attach to the bug these 
> logs: Hypervisor: /var/log/sanlock.log /var/log/messages 
> /var/log/glusterfs/<glusterhost>:_<volumename>.log Gluster server: 
> server logs showing what happened when sanlock failed access the 
> gluster volume. 

We have 3 hosts with a 1 GbE LAN connection (idle) so it's probably soft 
related.

On host H5, we are constantly receiving sanlock warnings:

    May  1 17:26:37 h5 sanlock[637]: 2015-05-01 17:26:37-0300 9118
    [643]: s909 add_lockspace fail result -5
    May  1 17:26:46 h5 sanlock[637]: 2015-05-01 17:26:46-0300 9128
    [12892]: read_sectors delta_leader offset 512 rv -5
    /rhev/data-center/mnt/glusterSD/h4.imatronix.com:vdisks/ba7be27f-aee5-4436-ae9a-0764f551f9a7/dom_md/ids

I thought it was normal, but want to be sure.

A Gluster Statedump reveals this locks:

    [xlator.features.locks.vdisks-locks.inode]
    path=/ba7be27f-aee5-4436-ae9a-0764f551f9a7/dom_md/ids
    mandatory=0
    conn.1.id=<Host H5>-3016-2015/05/01-17:54:57:109200-vdisks-client-0-0-0
    conn.1.ref=1
    conn.1.bound_xl=/mnt/disk1/gluster-bricks/vdisks
    conn.2.id=<Host H6>-3369-2015/04/30-05:40:59:928550-vdisks-client-0-0-0
    conn.2.ref=1
    conn.2.bound_xl=/mnt/disk1/gluster-bricks/vdisks
    conn.3.id=<Host H4>-31780-2015/04/30-05:57:15:152009-vdisks-client-0-0-0
    conn.3.ref=1
    conn.3.bound_xl=/mnt/disk1/gluster-bricks/vdisks
    conn.4.id=<Host H6>-16034-2015/04/30-16:40:26:355759-vdisks-client-0-0-0
    conn.4.ref=1
    conn.4.bound_xl=/mnt/disk1/gluster-bricks/vdisks

The first lease is hold by sanlock:

    # ps aux | grep 637
    sanlock    637  0.0  0.0 545240 19172 ?        SLsl 14:54   0:01
    sanlock daemon -U sanlock -G sanlock

Host is up since 14:54:39, so this lease was taken after the boot.

Threads:

            |-sanlock(637)-+-sanlock-helper(641)
            |              |-{sanlock}(642)
            |              |-{sanlock}(643)
            |              |-{sanlock}(644)
            |              |-{sanlock}(645)
            |              |-{sanlock}(3281)
            |              `-{sanlock}(3428)

Strace:

    h5:~# strace -p 637
    Process 637 attached
    restart_syscall(<... resuming interrupted call ...>) = 0
    poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
    {fd=-1}], 4, 1000) = 0 (Timeout)
    poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
    {fd=-1}], 4, 1000) = 0 (Timeout)
    poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
    {fd=-1}], 4, 1000) = 0 (Timeout)
    poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
    {fd=-1}], 4, 1000) = 0 (Timeout)
    poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
    {fd=-1}], 4, 1000) = 0 (Timeout)
    poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
    {fd=-1}], 4, 1000) = 0 (Timeout)
    poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
    {fd=-1}], 4, 1000) = 1 ([{fd=7, revents=POLLIN}])
    accept(7, 0, NULL)                      = 11
    setsockopt(11, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0
    poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11,
    events=POLLIN}, {fd=-1}], 4, 1000) = 1 ([{fd=11, revents=POLLIN}])
    recvfrom(11, "\20
    (\4\1\0\0\0\20\0\0\0\0\0\0\0p\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 32,
    MSG_WAITALL, NULL, NULL) = 32
    futex(0x7f0f3f99442c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f0f3f994428,
    {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
    futex(0x7f0f3f994400, FUTEX_WAKE_PRIVATE, 1) = 1
    poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11,
    events=POLLIN}, {fd=-1}], 4, 1000) = 1 ([{fd=11,
    revents=POLLIN|POLLHUP}])
    recvfrom(11, "", 32, MSG_WAITALL, NULL, NULL) = 0
    close(11)                               = 0
    poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
    {fd=-1}], 4, 1000) = 1 ([{fd=7, revents=POLLIN}])
    accept(7, 0, NULL)                      = 11
    setsockopt(11, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0
    poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11,
    events=POLLIN}, {fd=-1}], 4, 1000) = 1 ([{fd=11, revents=POLLIN}])
    recvfrom(11, "\20
    (\4\1\0\0\0\2\0\0\0\1\0\0\0p\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 32,
    MSG_WAITALL, NULL, NULL) = 32
    futex(0x7f0f3f99442c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f0f3f994428,
    {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
    futex(0x7f0f3f994400, FUTEX_WAKE_PRIVATE, 1) = 1
    poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
    {fd=-1}], 4, 1000) = 0 (Timeout)
    futex(0x7f0f3fa97640, FUTEX_WAKE_PRIVATE, 1) = 1
    poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11,
    events=POLLIN}, {fd=-1}], 4, 1000) = 1 ([{fd=11,
    revents=POLLIN|POLLHUP}])
    recvfrom(11, "", 32, MSG_WAITALL, NULL, NULL) = 0
    close(11)                               = 0
    poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
    {fd=-1}], 4, 1000) = 0 (Timeout)
    poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
    {fd=-1}], 4, 1000) = 0 (Timeout)
    poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
    {fd=-1}], 4, 1000) = 1 ([{fd=7, revents=POLLIN}])
    accept(7, 0, NULL)                      = 11
    setsockopt(11, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0
    poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11,
    events=POLLIN}, {fd=-1}], 4, 824) = 1 ([{fd=11, revents=POLLIN}])
    recvfrom(11, "\20
    (\4\1\0\0\0\20\0\0\0\0\0\0\0p\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 32,
    MSG_WAITALL, NULL, NULL) = 32
    futex(0x7f0f3f99442c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f0f3f994428,
    {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
    futex(0x7f0f3f994400, FUTEX_WAKE_PRIVATE, 1) = 1
    poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11,
    events=POLLIN}, {fd=-1}], 4, 824) = 1 ([{fd=11,
    revents=POLLIN|POLLHUP}])
    recvfrom(11, "", 32, MSG_WAITALL, NULL, NULL) = 0
    close(11)                               = 0
    poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
    {fd=-1}], 4, 824) = 0 (Timeout)
    poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1},
    {fd=-1}], 4, 1000^CProcess 637 detached

Maybe I'm doing something wrong?
Do you have any idea?

The funny thing is that this CentOS 7 host is getting constantly in a 
state where the whole root FS is not accesible any more.
All you have is a bash returning I/O errors for any stored command (eg: 
"ls").
I thought it was hardware related (2 x 500 GB SSD disks), but maybe we 
are lucky and found something kernel related.

Best regards,

Christopher.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/devel/attachments/20150501/69f66740/attachment-0001.html>


More information about the Devel mailing list