Usually I would blacklist the Gluster devices by creating the necessary stanzas in /etc/multipath/conf.d/blacklist.conf

This way you will keep the situation simple.


For your problem, it's hard to odentify the problem based on the e-mails.What are your symptoms ?

To debug GlusterFS, it is good to start from the brick logs (/var/log/gluster/bricks) and the current heal status. On a 3-way replica volume heals should be resolved by GlusterFS - if not, there is a bug.

Best Regards,
Strahil Nikolov


Best Regards,
Strahil Nikolov

On Tue, May 31, 2022 at 16:32, jb
<jonbae77@gmail.com> wrote:
I still have the same problems, but it look like that the errors comes a
bit less often.

I'm starting now to migrate the disk images to a NFS storage. When there
is no other way, I would recreate the glusterFS cluster.

The problem I have is, that I don't know where is the root of this
problem and if recreating would fix the issue in longer terms.

Am 29.05.22 um 20:26 schrieb Nir Soffer:
> On Sun, May 29, 2022 at 9:03 PM Jonathan Baecker <jonbae77@gmail.com> wrote:
>> Am 29.05.22 um 19:24 schrieb Nir Soffer:
>>
>> On Sun, May 29, 2022 at 7:50 PM Jonathan Baecker <jonbae77@gmail.com> wrote:
>>
>> Hello everybody,
>>
>> we run a 3 node self hosted cluster with GlusterFS. I had a lot of problem upgrading ovirt from 4.4.10 to 4.5.0.2 and now we have cluster instability.
>>
>> First I will write down the problems I had with upgrading, so you get a bigger picture:
>>
>> engine update when fine
>> But nodes I could not update because of wrong version of imgbase, so I did a manual update to 4.5.0.1 and later to 4.5.0.2. First time after updating it was still booting into 4.4.10, so I did a reinstall.
>> Then after second reboot I ended up in the emergency mode. After a long searching I figure out that lvm.conf using use_devicesfile now but there it uses the wrong filters. So I comment out this and add the old filters back. This procedure I have done on all 3 nodes.
>>
>> When use_devicesfile (default in 4.5) is enabled, lvm filter is not
>> used. During installation
>> the old lvm filter is removed.
>>
>> Can you share more info on why it does not work for you?
>>
>> The problem was, that the node could not mount the gluster volumes anymore and ended up in emergency mode.
>>
>> - output of lsblk
>>
>> NAME                                                      MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
>> sda                                                          8:0    0  1.8T  0 disk
>> `-XA1920LE10063_HKS028AV                                  253:0    0  1.8T  0 mpath
>>    |-gluster_vg_sda-gluster_thinpool_gluster_vg_sda_tmeta  253:16  0    9G  0 lvm
>>    | `-gluster_vg_sda-gluster_thinpool_gluster_vg_sda-tpool 253:18  0  1.7T  0 lvm
>>    |  |-gluster_vg_sda-gluster_thinpool_gluster_vg_sda    253:19  0  1.7T  1 lvm
>>    |  |-gluster_vg_sda-gluster_lv_data                    253:20  0  100G  0 lvm  /gluster_bricks/data
>>    |  `-gluster_vg_sda-gluster_lv_vmstore                  253:21  0  1.6T  0 lvm  /gluster_bricks/vmstore
>>    `-gluster_vg_sda-gluster_thinpool_gluster_vg_sda_tdata  253:17  0  1.7T  0 lvm
>>      `-gluster_vg_sda-gluster_thinpool_gluster_vg_sda-tpool 253:18  0  1.7T  0 lvm
>>        |-gluster_vg_sda-gluster_thinpool_gluster_vg_sda    253:19  0  1.7T  1 lvm
>>        |-gluster_vg_sda-gluster_lv_data                    253:20  0  100G  0 lvm  /gluster_bricks/data
>>        `-gluster_vg_sda-gluster_lv_vmstore                  253:21  0  1.6T  0 lvm  /gluster_bricks/vmstore
>> sr0                                                        11:0    1  1024M  0 rom
>> nvme0n1                                                    259:0    0 238.5G  0 disk
>> |-nvme0n1p1                                                259:1    0    1G  0 part  /boot
>> |-nvme0n1p2                                                259:2    0  134G  0 part
>> | |-onn-pool00_tmeta                                      253:1    0    1G  0 lvm
>> | | `-onn-pool00-tpool                                    253:3    0    87G  0 lvm
>> | |  |-onn-ovirt--node--ng--4.5.0.2--0.20220513.0+1      253:4    0    50G  0 lvm  /
>> | |  |-onn-pool00                                        253:7    0    87G  1 lvm
>> | |  |-onn-home                                          253:8    0    1G  0 lvm  /home
>> | |  |-onn-tmp                                            253:9    0    1G  0 lvm  /tmp
>> | |  |-onn-var                                            253:10  0    15G  0 lvm  /var
>> | |  |-onn-var_crash                                      253:11  0    10G  0 lvm  /var/crash
>> | |  |-onn-var_log                                        253:12  0    8G  0 lvm  /var/log
>> | |  |-onn-var_log_audit                                  253:13  0    2G  0 lvm  /var/log/audit
>> | |  |-onn-ovirt--node--ng--4.5.0.1--0.20220511.0+1      253:14  0    50G  0 lvm
>> | |  `-onn-var_tmp                                        253:15  0    10G  0 lvm  /var/tmp
>> | |-onn-pool00_tdata                                      253:2    0    87G  0 lvm
>> | | `-onn-pool00-tpool                                    253:3    0    87G  0 lvm
>> | |  |-onn-ovirt--node--ng--4.5.0.2--0.20220513.0+1      253:4    0    50G  0 lvm  /
>> | |  |-onn-pool00                                        253:7    0    87G  1 lvm
>> | |  |-onn-home                                          253:8    0    1G  0 lvm  /home
>> | |  |-onn-tmp                                            253:9    0    1G  0 lvm  /tmp
>> | |  |-onn-var                                            253:10  0    15G  0 lvm  /var
>> | |  |-onn-var_crash                                      253:11  0    10G  0 lvm  /var/crash
>> | |  |-onn-var_log                                        253:12  0    8G  0 lvm  /var/log
>> | |  |-onn-var_log_audit                                  253:13  0    2G  0 lvm  /var/log/audit
>> | |  |-onn-ovirt--node--ng--4.5.0.1--0.20220511.0+1      253:14  0    50G  0 lvm
>> | |  `-onn-var_tmp                                        253:15  0    10G  0 lvm  /var/tmp
>> | `-onn-swap                                              253:5    0    20G  0 lvm  [SWAP]
>> `-nvme0n1p3                                                259:3    0    95G  0 part
>>    `-gluster_vg_nvme0n1p3-gluster_lv_engine                253:6    0    94G  0 lvm  /gluster_bricks/engine
>  >
>> - The old lvm filter used, and why it was needed
>>
>> filter = ["a|^/dev/disk/by-id/lvm-pv-uuid-Nn7tZl-TFdY-BujO-VZG5-EaGW-5YFd-Lo5pwa$|", "a|^/dev/disk/by-id/lvm-pv-uuid-Wcbxnx-2RhC-s1Re-s148-nLj9-Tr3f-jj4VvE$|", "a|^/dev/disk/by-id/lvm-pv-uuid-lX51wm-H7V4-3CTn-qYob-Rkpx-Tptd-t94jNL$|", "r|.*|"]
>>
>> I don't remember exactly any more why it was needed, but without the node was not working correctly. I think I even used vdsm-tool config-lvm-filter.
> I think that if  you list the devices in this filter:
>
>      ls -lh /dev/disk/by-id/lvm-pv-uuid-Nn7tZl-TFdY-BujO-VZG5-EaGW-5YFd-Lo5pwa \
>              /dev/disk/by-id/lvm-pv-uuid-Wcbxnx-2RhC-s1Re-s148-nLj9-Tr3f-jj4VvE
> \
>              /dev/disk/by-id/lvm-pv-uuid-lX51wm-H7V4-3CTn-qYob-Rkpx-Tptd-t94jNL
>
> You will see that these are the devices used by these vgs:
>
>      gluster_vg_sda, gluster_vg_nvme0n1p3, onn
>
>> - output of vdsm-tool config-lvm-filter
>>
>> Analyzing host...
>> Found these mounted logical volumes on this host:
>>
>>    logical volume:  /dev/mapper/gluster_vg_nvme0n1p3-gluster_lv_engine
>>    mountpoint:      /gluster_bricks/engine
>>    devices:        /dev/nvme0n1p3
>>
>>    logical volume:  /dev/mapper/gluster_vg_sda-gluster_lv_data
>>    mountpoint:      /gluster_bricks/data
>>    devices:        /dev/mapper/XA1920LE10063_HKS028AV
>>
>>    logical volume:  /dev/mapper/gluster_vg_sda-gluster_lv_vmstore
>>    mountpoint:      /gluster_bricks/vmstore
>>    devices:        /dev/mapper/XA1920LE10063_HKS028AV
>>
>>    logical volume:  /dev/mapper/onn-home
>>    mountpoint:      /home
>>    devices:        /dev/nvme0n1p2
>>
>>    logical volume:  /dev/mapper/onn-ovirt--node--ng--4.5.0.2--0.20220513.0+1
>>    mountpoint:      /
>>    devices:        /dev/nvme0n1p2
>>
>>    logical volume:  /dev/mapper/onn-swap
>>    mountpoint:      [SWAP]
>>    devices:        /dev/nvme0n1p2
>>
>>    logical volume:  /dev/mapper/onn-tmp
>>    mountpoint:      /tmp
>>    devices:        /dev/nvme0n1p2
>>
>>    logical volume:  /dev/mapper/onn-var
>>    mountpoint:      /var
>>    devices:        /dev/nvme0n1p2
>>
>>    logical volume:  /dev/mapper/onn-var_crash
>>    mountpoint:      /var/crash
>>    devices:        /dev/nvme0n1p2
>>
>>    logical volume:  /dev/mapper/onn-var_log
>>    mountpoint:      /var/log
>>    devices:        /dev/nvme0n1p2
>>
>>    logical volume:  /dev/mapper/onn-var_log_audit
>>    mountpoint:      /var/log/audit
>>    devices:        /dev/nvme0n1p2
>>
>>    logical volume:  /dev/mapper/onn-var_tmp
>>    mountpoint:      /var/tmp
>>    devices:        /dev/nvme0n1p2
>>
>> Configuring LVM system.devices.
>> Devices for following VGs will be imported:
>>
>>  gluster_vg_sda, gluster_vg_nvme0n1p3, onn
>>
>> To properly configure the host, we need to add multipath
>> blacklist in /etc/multipath/conf.d/vdsm_blacklist.conf:
>>
>>    blacklist {
>>        wwid "eui.0025388901b1e26f"
>>    }
>>
>>
>> Configure host? [yes,NO]
> If you run "vdsm-tool config-lvm-filter" and confirm with "yes", I
> think all the vgs
> will be imported properly into lvm devices file.
>
> I don't think it will solve the storage issues you have since Feb
> 2022, but at least
> you will have a standard configuration and the next upgrade will not revert your
> local settings.
>
>> If using lvm devices does not work for you, you can enable the lvm
>> filter in vdsm configuration
>> by adding a drop-in file:
>>
>> $ cat /etc/vdsm/vdsm.conf.d/99-local.conf
>> [lvm]
>> config_method = filter
>>
>> And run:
>>
>>      vdsm-tool config-lvm-filter
>>
>> to configure the lvm filter in the best way for vdsm. If this does not create
>> the right filter we would like to know why, but in general you should use
>> lvm devices since it avoids the trouble of maintaining the filter and dealing
>> with upgrades and user edited lvm filter.
>>
>> If  you disable use_devicesfile, the next vdsm upgrade will enable it
>> back unless
>> you change the configuration.
>>
>> I would be happy to just use the default, when there is a way to make use_devicesfile to wok.
>>
>> Also even if you disable use_devicesfile in lvm.conf, vdsm still use
>> --devices instead
>> of filter when running lvm commands, and lvm commands run by vdsm ignore your
>> lvm filter since the --devices option overrides the system settings.
>>
>> ...
>>
>> I notice some unsync volume warning, but because I had this in the past to, after upgrading, I though after some time they will disappear. The next day there still where there, so I decided to put the nodes again in the maintenance mode and restart the glusterd service. After some time the sync warnings where gone.
>>
>> Not clear what these warnings are, I guess Gluster warning?
>>
>> Yes was Gluster warnings under Storage -> Volumes it was saying that some entries are unsync.
>>
>> So now the actual problem:
>>
>> Since this time the cluster is unstable. I get different errors and warning, like:
>>
>> VM [name] is not responding
>> out of nothing HA VM gets migrated
>> VM migration can fail
>> VM backup with snapshoting and export take very long
>>
>> How do you backup the vms? do you sue a backup application? how is it
>> configured?
>>
>> I use a self made plython script, which uses the rest api. I create a snapshot from the VM, build a new VM from that snapshot and move the new one to the export domain.
> This is not very efficient - this copy the entire vm at the point of
> time of the snapshot
> and then copy it again to the export domain.
>
> If you use a backup application supporting the incremental backup API,
> the first full backup
> will copy the entire vm once, but later incremental backup will copy
> only the changes
> since the last backup.
>
>> VMs are getting very slow some times
>> Storage domain vmstore experienced a high latency of 9.14251
>> ovs|00001|db_ctl_base|ERR|no key "dpdk-init" in Open_vSwitch record "." column other_config
>> 489279 [1064359]: s8 renewal error -202 delta_length 10 last_success 489249
>> 444853 [2243175]: s27 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/glusterSD/onode1.example.org:_vmstore/3cf83851-1cc8-4f97-8960-08a60b9e25db/dom_md/ids
>> 471099 [2243175]: s27 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/glusterSD/onode1.example.org:_vmstore/3cf83851-1cc8-4f97-8960-08a60b9e25db/dom_md/ids
>> many of: 424035 [2243175]: s27 delta_renew long write time XX sec
>>
>> All these issues tell use that your storage is not working correctly.
>>
>> sanlock.log is full of renewal errors form May:
>>
>> $ grep 2022-05- sanlock.log | wc -l
>> 4844
>>
>> $ grep 2022-05- sanlock.log | grep 'renewal error' | wc -l
>> 631
>>
>> But there is lot of trouble from earlier months:
>>
>> $ grep 2022-04- sanlock.log | wc -l
>> 844
>> $ grep 2022-04- sanlock.log | grep 'renewal error' | wc -l
>> 29
>>
>> $ grep 2022-03- sanlock.log | wc -l
>> 1609
>> $ grep 2022-03- sanlock.log | grep 'renewal error' | wc -l
>> 483
>>
>> $ grep 2022-02- sanlock.log | wc -l
>> 826
>> $ grep 2022-02- sanlock.log | grep 'renewal error' | wc -l
>> 242
>>
>> Here sanlock log looks healthy:
>>
>> $ grep 2022-01- sanlock.log | wc -l
>> 3
>> $ grep 2022-01- sanlock.log | grep 'renewal error' | wc -l
>> 0
>>
>> $ grep 2021-12- sanlock.log | wc -l
>> 48
>> $ grep 2021-12- sanlock.log | grep 'renewal error' | wc -l
>> 0
>>
>> vdsm log shows that 2 domains are not accessible:
>>
>> $ grep ERROR vdsm.log
>> 2022-05-29 15:07:19,048+0200 ERROR (check/loop) [storage.monitor]
>> Error checking path
>> /rhev/data-center/mnt/glusterSD/onode1.example.org:_data/de5f4123-0fac-4238-abcf-a329c142bd47/dom_md/metadata
>> (monitor:511)
>> 2022-05-29 16:33:59,049+0200 ERROR (check/loop) [storage.monitor]
>> Error checking path
>> /rhev/data-center/mnt/glusterSD/onode1.example.org:_data/de5f4123-0fac-4238-abcf-a329c142bd47/dom_md/metadata
>> (monitor:511)
>> 2022-05-29 16:34:39,049+0200 ERROR (check/loop) [storage.monitor]
>> Error checking path
>> /rhev/data-center/mnt/glusterSD/onode1.example.org:_data/de5f4123-0fac-4238-abcf-a329c142bd47/dom_md/metadata
>> (monitor:511)
>> 2022-05-29 17:21:39,050+0200 ERROR (check/loop) [storage.monitor]
>> Error checking path
>> /rhev/data-center/mnt/glusterSD/onode1.example.org:_data/de5f4123-0fac-4238-abcf-a329c142bd47/dom_md/metadata
>> (monitor:511)
>> 2022-05-29 17:55:59,712+0200 ERROR (check/loop) [storage.monitor]
>> Error checking path
>> /rhev/data-center/mnt/glusterSD/onode1.example.org:_vmstore/3cf83851-1cc8-4f97-8960-08a60b9e25db/dom_md/metadata
>> (monitor:511)
>> 2022-05-29 17:56:19,711+0200 ERROR (check/loop) [storage.monitor]
>> Error checking path
>> /rhev/data-center/mnt/glusterSD/onode1.example.org:_vmstore/3cf83851-1cc8-4f97-8960-08a60b9e25db/dom_md/metadata
>> (monitor:511)
>> 2022-05-29 17:56:39,050+0200 ERROR (check/loop) [storage.monitor]
>> Error checking path
>> /rhev/data-center/mnt/glusterSD/onode1.example.org:_data/de5f4123-0fac-4238-abcf-a329c142bd47/dom_md/metadata
>> (monitor:511)
>> 2022-05-29 17:56:39,711+0200 ERROR (check/loop) [storage.monitor]
>> Error checking path
>> /rhev/data-center/mnt/glusterSD/onode1.example.org:_vmstore/3cf83851-1cc8-4f97-8960-08a60b9e25db/dom_md/metadata
>> (monitor:511)
>>
>> You need to find what is the issue with your Gluster storage.
>>
>> I hope that Ritesh can help debug the issue with Gluster.
>>
>> Nir
>>
>> I'm worry that I do something, that it makes it even more worst, and I hove not idea what's the problem. To me it looks not exactly like a problem with data inconsistencies.
> The problem is that your Gluster storage is not healthy, and reading
> and writing to it times out.
>
> Please keep users@ovirt.org CC when you reply. Gluster storage is very
> popular in this mailing list
> and you may get useful help from other users.
>
> Nir
>
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
List Archives:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/VBYRVRQPXXDZTDFG46LEECHLRDWDWZ37/