Re: [ovirt-devel] Sanlock - "Add_lockspace fail" - WAS: "Please activate the master Storage Domain first"

2 May 2015

      ...
On 29-04-2015 16:41, Nir Soffer wrote:
...
[...]
Your probably have storage issues, revealed by sanlock - it reads and
write to all storage domains every 10 seconds, so flaky storage will
cause failure to acquire a host id. Please attach to the bug these
logs: Hypervisor: /var/log/sanlock.log /var/log/messages
/var/log/glusterfs/<glusterhost>:_<volumename>.log Gluster server:
server logs showing what happened when sanlock failed access the
gluster volume.
Do you plan to provide the requested logs? Or maybe explain why they are
not needed?
...
We have 3 hosts with a 1 GbE LAN connection (idle) so it's probably soft
related.
On host H5, we are constantly receiving sanlock warnings:
May  1 17:26:37 h5 sanlock[637]: 2015-05-01 17:26:37-0300 9118
    [643]: s909 add_lockspace fail result -5
    May  1 17:26:46 h5 sanlock[637]: 2015-05-01 17:26:46-0300 9128
    [12892]: read_sectors delta_leader offset 512 rv -5
    /rhev/data-center/mnt/glusterSD/h4.imatronix.com:vdisks/ba7be27f-aee5-4436-ae9a-0764f551f9a7/dom_md/ids
David, can you explain what is the meaning of "rv -5"?
...
I thought it was normal, but want to be sure.
Any error in sanlock is not normal, and getting it constantly is bad sign.
...
A Gluster Statedump reveals this locks:
[xlator.features.locks.vdisks-locks.inode]
    path=/ba7be27f-aee5-4436-ae9a-0764f551f9a7/dom_md/ids
    mandatory=0
    conn.1.id=<Host H5>-3016-2015/05/01-17:54:57:109200-vdisks-client-0-0-0
    conn.1.ref=1
    conn.1.bound_xl=/mnt/disk1/gluster-bricks/vdisks
    conn.2.id=<Host H6>-3369-2015/04/30-05:40:59:928550-vdisks-client-0-0-0
    conn.2.ref=1
    conn.2.bound_xl=/mnt/disk1/gluster-bricks/vdisks
    conn.3.id=<Host H4>-31780-2015/04/30-05:57:15:152009-vdisks-client-0-0-0
    conn.3.ref=1
    conn.3.bound_xl=/mnt/disk1/gluster-bricks/vdisks
    conn.4.id=<Host H6>-16034-2015/04/30-16:40:26:355759-vdisks-client-0-0-0
    conn.4.ref=1
    conn.4.bound_xl=/mnt/disk1/gluster-bricks/vdisks
All hosts are maintaining the host lease on the dom_md/ids file, so this is
expected that gluster detect that all hosts are accessing the ids file.
...
Host is up since 14:54:39, so this lease was taken after the boot.
A lease is lost after reboot, or if host cannot access the storage for about 80
seconds (depending on sanlock configuration).

sanlock update the lease very 10 seconds. If a lease is not updated, other
hosts consider the lease as DEAD. On the machine holding the lease, sanlock
will terminate the process holding the lease, or reboot the machine if the
process cannot be terminated.
...
The funny thing is that this CentOS 7 host is getting constantly in a
state where the whole root FS is not accesible any more.
All you have is a bash returning I/O errors for any stored command (eg:
"ls").
I thought it was hardware related (2 x 500 GB SSD disks), but maybe we
are lucky and found something kernel related.
And the local storage on this host is used for gluster brick - right?

If you are using replica 3, I would expect gluster to work reliably even when this
host loose access to its storage, but I guess that gluster is not designed for such
flaky storage.

Did you open a glusterfs bug for this?

You should fix this host so it does not loose access to its storage. I don't think
that setup where storage is lost constantly is useful for anything.

Nir