[ovirt-devel] Sanlock - "Add_lockspace fail" - WAS: "Please activate the master Storage Domain first"

Sat May 2 12:27:57 UTC 2015

> On 29-04-2015 16:41, Nir Soffer wrote:
> > [...]
> > Your probably have storage issues, revealed by sanlock - it reads and
> > write to all storage domains every 10 seconds, so flaky storage will
> > cause failure to acquire a host id. Please attach to the bug these
> > logs: Hypervisor: /var/log/sanlock.log /var/log/messages
> > /var/log/glusterfs/<glusterhost>:_<volumename>.log Gluster server:
> > server logs showing what happened when sanlock failed access the
> > gluster volume.

Do you plan to provide the requested logs? Or maybe explain why they are
not needed?

> We have 3 hosts with a 1 GbE LAN connection (idle) so it's probably soft
> related.
> 
> On host H5, we are constantly receiving sanlock warnings:
> 
>     May  1 17:26:37 h5 sanlock[637]: 2015-05-01 17:26:37-0300 9118
>     [643]: s909 add_lockspace fail result -5
>     May  1 17:26:46 h5 sanlock[637]: 2015-05-01 17:26:46-0300 9128
>     [12892]: read_sectors delta_leader offset 512 rv -5
>     /rhev/data-center/mnt/glusterSD/h4.imatronix.com:vdisks/ba7be27f-aee5-4436-ae9a-0764f551f9a7/dom_md/ids

David, can you explain what is the meaning of "rv -5"?

> 
> I thought it was normal, but want to be sure.

Any error in sanlock is not normal, and getting it constantly is bad sign.

> A Gluster Statedump reveals this locks:
> 
>     [xlator.features.locks.vdisks-locks.inode]
>     path=/ba7be27f-aee5-4436-ae9a-0764f551f9a7/dom_md/ids
>     mandatory=0
>     conn.1.id=<Host H5>-3016-2015/05/01-17:54:57:109200-vdisks-client-0-0-0
>     conn.1.ref=1
>     conn.1.bound_xl=/mnt/disk1/gluster-bricks/vdisks
>     conn.2.id=<Host H6>-3369-2015/04/30-05:40:59:928550-vdisks-client-0-0-0
>     conn.2.ref=1
>     conn.2.bound_xl=/mnt/disk1/gluster-bricks/vdisks
>     conn.3.id=<Host H4>-31780-2015/04/30-05:57:15:152009-vdisks-client-0-0-0
>     conn.3.ref=1
>     conn.3.bound_xl=/mnt/disk1/gluster-bricks/vdisks
>     conn.4.id=<Host H6>-16034-2015/04/30-16:40:26:355759-vdisks-client-0-0-0
>     conn.4.ref=1
>     conn.4.bound_xl=/mnt/disk1/gluster-bricks/vdisks

All hosts are maintaining the host lease on the dom_md/ids file, so this is
expected that gluster detect that all hosts are accessing the ids file.

> Host is up since 14:54:39, so this lease was taken after the boot.

A lease is lost after reboot, or if host cannot access the storage for about 80
seconds (depending on sanlock configuration).

sanlock update the lease very 10 seconds. If a lease is not updated, other
hosts consider the lease as DEAD. On the machine holding the lease, sanlock
will terminate the process holding the lease, or reboot the machine if the
process cannot be terminated.

> The funny thing is that this CentOS 7 host is getting constantly in a
> state where the whole root FS is not accesible any more.
> All you have is a bash returning I/O errors for any stored command (eg:
> "ls").
> I thought it was hardware related (2 x 500 GB SSD disks), but maybe we
> are lucky and found something kernel related.

And the local storage on this host is used for gluster brick - right?

If you are using replica 3, I would expect gluster to work reliably even when this
host loose access to its storage, but I guess that gluster is not designed for such
flaky storage.

Did you open a glusterfs bug for this?

You should fix this host so it does not loose access to its storage. I don't think
that setup where storage is lost constantly is useful for anything.

Nir