[ovirt-users] Re: storage high latency, sanlock errors, cluster instability

Sunday, 29 May 2022

On Sun, May 29, 2022 at 7:50 PM Jonathan Baecker <jonbae77(a)gmail.com&gt; wrote:
...

 Hello everybody,

 we run a 3 node self hosted cluster with GlusterFS. I had a lot of problem upgrading
ovirt from 4.4.10 to 4.5.0.2 and now we have cluster instability.

 First I will write down the problems I had with upgrading, so you get a bigger picture:

 engine update when fine
 But nodes I could not update because of wrong version of imgbase, so I did a manual
update to 4.5.0.1 and later to 4.5.0.2. First time after updating it was still booting
into 4.4.10, so I did a reinstall.
 Then after second reboot I ended up in the emergency mode. After a long searching I
figure out that lvm.conf using use_devicesfile now but there it uses the wrong filters. So
I comment out this and add the old filters back. This procedure I have done on all 3
nodes. 
When use_devicesfile (default in 4.5) is enabled, lvm filter is not
used. During installation
the old lvm filter is removed.

Can you share more info on why it does not work for you?
- output of lsblk
- The old lvm filter used, and why it was needed
- output of vdsm-tool config-lvm-filter

If using lvm devices does not work for you, you can enable the lvm
filter in vdsm configuration
by adding a drop-in file:

$ cat /etc/vdsm/vdsm.conf.d/99-local.conf
[lvm]
config_method = filter

And run:

    vdsm-tool config-lvm-filter

to configure the lvm filter in the best way for vdsm. If this does not create
the right filter we would like to know why, but in general you should use
lvm devices since it avoids the trouble of maintaining the filter and dealing
with upgrades and user edited lvm filter.

If  you disable use_devicesfile, the next vdsm upgrade will enable it
back unless
you change the configuration.

Also even if you disable use_devicesfile in lvm.conf, vdsm still use
--devices instead
of filter when running lvm commands, and lvm commands run by vdsm ignore your
lvm filter since the --devices option overrides the system settings.

...
...
 I notice some unsync volume warning, but because I had this in the
past to, after upgrading, I though after some time they will disappear. The next day there
still where there, so I decided to put the nodes again in the maintenance mode and restart
the glusterd service. After some time the sync warnings where gone. 
Not clear what these warnings are, I guess Gluster warning?

...
 So now the actual problem:

 Since this time the cluster is unstable. I get different errors and warning, like:

 VM [name] is not responding
 out of nothing HA VM gets migrated
 VM migration can fail
 VM backup with snapshoting and export take very long 
How do you backup the vms? do you sue a backup application? how is it
configured?

...
 VMs are getting very slow some times
 Storage domain vmstore experienced a high latency of 9.14251
 ovs|00001|db_ctl_base|ERR|no key "dpdk-init" in Open_vSwitch record
"." column other_config
 489279 [1064359]: s8 renewal error -202 delta_length 10 last_success 489249
 444853 [2243175]: s27 delta_renew read timeout 10 sec offset 0
/rhev/data-center/mnt/glusterSD/onode1.example.org:_vmstore/3cf83851-1cc8-4f97-8960-08a60b9e25db/dom_md/ids
 471099 [2243175]: s27 delta_renew read timeout 10 sec offset 0
/rhev/data-center/mnt/glusterSD/onode1.example.org:_vmstore/3cf83851-1cc8-4f97-8960-08a60b9e25db/dom_md/ids
 many of: 424035 [2243175]: s27 delta_renew long write time XX sec 
All these issues tell use that your storage is not working correctly.

sanlock.log is full of renewal errors form May:

$ grep 2022-05- sanlock.log | wc -l
4844

$ grep 2022-05- sanlock.log | grep 'renewal error' | wc -l
631

But there is lot of trouble from earlier months:

$ grep 2022-04- sanlock.log | wc -l
844
$ grep 2022-04- sanlock.log | grep 'renewal error' | wc -l
29

$ grep 2022-03- sanlock.log | wc -l
1609
$ grep 2022-03- sanlock.log | grep 'renewal error' | wc -l
483

$ grep 2022-02- sanlock.log | wc -l
826
$ grep 2022-02- sanlock.log | grep 'renewal error' | wc -l
242

Here sanlock log looks healthy:

$ grep 2022-01- sanlock.log | wc -l
3
$ grep 2022-01- sanlock.log | grep 'renewal error' | wc -l
0

$ grep 2021-12- sanlock.log | wc -l
48
$ grep 2021-12- sanlock.log | grep 'renewal error' | wc -l
0

vdsm log shows that 2 domains are not accessible:

$ grep ERROR vdsm.log
2022-05-29 15:07:19,048+0200 ERROR (check/loop) [storage.monitor]
Error checking path
/rhev/data-center/mnt/glusterSD/onode1.example.org:_data/de5f4123-0fac-4238-abcf-a329c142bd47/dom_md/metadata
(monitor:511)
2022-05-29 16:33:59,049+0200 ERROR (check/loop) [storage.monitor]
Error checking path
/rhev/data-center/mnt/glusterSD/onode1.example.org:_data/de5f4123-0fac-4238-abcf-a329c142bd47/dom_md/metadata
(monitor:511)
2022-05-29 16:34:39,049+0200 ERROR (check/loop) [storage.monitor]
Error checking path
/rhev/data-center/mnt/glusterSD/onode1.example.org:_data/de5f4123-0fac-4238-abcf-a329c142bd47/dom_md/metadata
(monitor:511)
2022-05-29 17:21:39,050+0200 ERROR (check/loop) [storage.monitor]
Error checking path
/rhev/data-center/mnt/glusterSD/onode1.example.org:_data/de5f4123-0fac-4238-abcf-a329c142bd47/dom_md/metadata
(monitor:511)
2022-05-29 17:55:59,712+0200 ERROR (check/loop) [storage.monitor]
Error checking path
/rhev/data-center/mnt/glusterSD/onode1.example.org:_vmstore/3cf83851-1cc8-4f97-8960-08a60b9e25db/dom_md/metadata
(monitor:511)
2022-05-29 17:56:19,711+0200 ERROR (check/loop) [storage.monitor]
Error checking path
/rhev/data-center/mnt/glusterSD/onode1.example.org:_vmstore/3cf83851-1cc8-4f97-8960-08a60b9e25db/dom_md/metadata
(monitor:511)
2022-05-29 17:56:39,050+0200 ERROR (check/loop) [storage.monitor]
Error checking path
/rhev/data-center/mnt/glusterSD/onode1.example.org:_data/de5f4123-0fac-4238-abcf-a329c142bd47/dom_md/metadata
(monitor:511)
2022-05-29 17:56:39,711+0200 ERROR (check/loop) [storage.monitor]
Error checking path
/rhev/data-center/mnt/glusterSD/onode1.example.org:_vmstore/3cf83851-1cc8-4f97-8960-08a60b9e25db/dom_md/metadata
(monitor:511)

You need to find what is the issue with your Gluster storage.

I hope that Ritesh can help debug the issue with Gluster.

Nir

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

[ovirt-users] Re: storage high latency, sanlock errors, cluster instability