The problem was, that the node could not mount the gluster volumes anymore and ended up in emergency mode.On Sun, May 29, 2022 at 7:50 PM Jonathan Baecker <jonbae77@gmail.com> wrote:Hello everybody, we run a 3 node self hosted cluster with GlusterFS. I had a lot of problem upgrading ovirt from 4.4.10 to 4.5.0.2 and now we have cluster instability. First I will write down the problems I had with upgrading, so you get a bigger picture: engine update when fine But nodes I could not update because of wrong version of imgbase, so I did a manual update to 4.5.0.1 and later to 4.5.0.2. First time after updating it was still booting into 4.4.10, so I did a reinstall. Then after second reboot I ended up in the emergency mode. After a long searching I figure out that lvm.conf using use_devicesfile now but there it uses the wrong filters. So I comment out this and add the old filters back. This procedure I have done on all 3 nodes.When use_devicesfile (default in 4.5) is enabled, lvm filter is not used. During installation the old lvm filter is removed. Can you share more info on why it does not work for you?
- output of lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 1.8T 0 disk
`-XA1920LE10063_HKS028AV 253:0 0 1.8T 0 mpath
|-gluster_vg_sda-gluster_thinpool_gluster_vg_sda_tmeta 253:16 0 9G 0 lvm
| `-gluster_vg_sda-gluster_thinpool_gluster_vg_sda-tpool 253:18 0 1.7T 0 lvm
| |-gluster_vg_sda-gluster_thinpool_gluster_vg_sda 253:19 0 1.7T 1 lvm
| |-gluster_vg_sda-gluster_lv_data 253:20 0 100G 0 lvm /gluster_bricks/data
| `-gluster_vg_sda-gluster_lv_vmstore 253:21 0 1.6T 0 lvm /gluster_bricks/vmstore
`-gluster_vg_sda-gluster_thinpool_gluster_vg_sda_tdata 253:17 0 1.7T 0 lvm
`-gluster_vg_sda-gluster_thinpool_gluster_vg_sda-tpool 253:18 0 1.7T 0 lvm
|-gluster_vg_sda-gluster_thinpool_gluster_vg_sda 253:19 0 1.7T 1 lvm
|-gluster_vg_sda-gluster_lv_data 253:20 0 100G 0 lvm /gluster_bricks/data
`-gluster_vg_sda-gluster_lv_vmstore 253:21 0 1.6T 0 lvm /gluster_bricks/vmstore
sr0 11:0 1 1024M 0 rom
nvme0n1 259:0 0 238.5G 0 disk
|-nvme0n1p1 259:1 0 1G 0 part /boot
|-nvme0n1p2 259:2 0 134G 0 part
| |-onn-pool00_tmeta 253:1 0 1G 0 lvm
| | `-onn-pool00-tpool 253:3 0 87G 0 lvm
| | |-onn-ovirt--node--ng--4.5.0.2--0.20220513.0+1 253:4 0 50G 0 lvm /
| | |-onn-pool00 253:7 0 87G 1 lvm
| | |-onn-home 253:8 0 1G 0 lvm /home
| | |-onn-tmp 253:9 0 1G 0 lvm /tmp
| | |-onn-var 253:10 0 15G 0 lvm /var
| | |-onn-var_crash 253:11 0 10G 0 lvm /var/crash
| | |-onn-var_log 253:12 0 8G 0 lvm /var/log
| | |-onn-var_log_audit 253:13 0 2G 0 lvm /var/log/audit
| | |-onn-ovirt--node--ng--4.5.0.1--0.20220511.0+1 253:14 0 50G 0 lvm
| | `-onn-var_tmp 253:15 0 10G 0 lvm /var/tmp
| |-onn-pool00_tdata 253:2 0 87G 0 lvm
| | `-onn-pool00-tpool 253:3 0 87G 0 lvm
| | |-onn-ovirt--node--ng--4.5.0.2--0.20220513.0+1 253:4 0 50G 0 lvm /
| | |-onn-pool00 253:7 0 87G 1 lvm
| | |-onn-home 253:8 0 1G 0 lvm /home
| | |-onn-tmp 253:9 0 1G 0 lvm /tmp
| | |-onn-var 253:10 0 15G 0 lvm /var
| | |-onn-var_crash 253:11 0 10G 0 lvm /var/crash
| | |-onn-var_log 253:12 0 8G 0 lvm /var/log
| | |-onn-var_log_audit 253:13 0 2G 0 lvm /var/log/audit
| | |-onn-ovirt--node--ng--4.5.0.1--0.20220511.0+1 253:14 0 50G 0 lvm
| | `-onn-var_tmp 253:15 0 10G 0 lvm /var/tmp
| `-onn-swap 253:5 0 20G 0 lvm [SWAP]
`-nvme0n1p3 259:3 0 95G 0 part
`-gluster_vg_nvme0n1p3-gluster_lv_engine 253:6 0 94G 0 lvm /gluster_bricks/engine
- The old lvm filter used, and why it was needed
filter = ["a|^/dev/disk/by-id/lvm-pv-uuid-Nn7tZl-TFdY-BujO-VZG5-EaGW-5YFd-Lo5pwa$|", "a|^/dev/disk/by-id/lvm-pv-uuid-Wcbxnx-2RhC-s1Re-s148-nLj9-Tr3f-jj4VvE$|", "a|^/dev/disk/by-id/lvm-pv-uuid-lX51wm-H7V4-3CTn-qYob-Rkpx-Tptd-t94jNL$|", "r|.*|"]I don't remember exactly any more why it was needed, but without the node was not working correctly. I think I even used vdsm-tool config-lvm-filter.
- output of vdsm-tool config-lvm-filter
Analyzing host...
Found these mounted logical volumes on this host:
logical volume: /dev/mapper/gluster_vg_nvme0n1p3-gluster_lv_engine
mountpoint: /gluster_bricks/engine
devices: /dev/nvme0n1p3
logical volume: /dev/mapper/gluster_vg_sda-gluster_lv_data
mountpoint: /gluster_bricks/data
devices: /dev/mapper/XA1920LE10063_HKS028AV
logical volume: /dev/mapper/gluster_vg_sda-gluster_lv_vmstore
mountpoint: /gluster_bricks/vmstore
devices: /dev/mapper/XA1920LE10063_HKS028AV
logical volume: /dev/mapper/onn-home
mountpoint: /home
devices: /dev/nvme0n1p2
logical volume: /dev/mapper/onn-ovirt--node--ng--4.5.0.2--0.20220513.0+1
mountpoint: /
devices: /dev/nvme0n1p2
logical volume: /dev/mapper/onn-swap
mountpoint: [SWAP]
devices: /dev/nvme0n1p2
logical volume: /dev/mapper/onn-tmp
mountpoint: /tmp
devices: /dev/nvme0n1p2
logical volume: /dev/mapper/onn-var
mountpoint: /var
devices: /dev/nvme0n1p2
logical volume: /dev/mapper/onn-var_crash
mountpoint: /var/crash
devices: /dev/nvme0n1p2
logical volume: /dev/mapper/onn-var_log
mountpoint: /var/log
devices: /dev/nvme0n1p2
logical volume: /dev/mapper/onn-var_log_audit
mountpoint: /var/log/audit
devices: /dev/nvme0n1p2
logical volume: /dev/mapper/onn-var_tmp
mountpoint: /var/tmp
devices: /dev/nvme0n1p2
Configuring LVM system.devices.
Devices for following VGs will be imported:
gluster_vg_sda, gluster_vg_nvme0n1p3, onn
To properly configure the host, we need to add multipath
blacklist in /etc/multipath/conf.d/vdsm_blacklist.conf:
blacklist {
wwid "eui.0025388901b1e26f"
}
Configure host? [yes,NO]
I would be happy to just use the default, when there is a way to make use_devicesfile to wok.If using lvm devices does not work for you, you can enable the lvm filter in vdsm configuration by adding a drop-in file: $ cat /etc/vdsm/vdsm.conf.d/99-local.conf [lvm] config_method = filter And run: vdsm-tool config-lvm-filter to configure the lvm filter in the best way for vdsm. If this does not create the right filter we would like to know why, but in general you should use lvm devices since it avoids the trouble of maintaining the filter and dealing with upgrades and user edited lvm filter. If you disable use_devicesfile, the next vdsm upgrade will enable it back unless you change the configuration.
Yes was Gluster warnings under Storage -> Volumes it was saying that some entries are unsync.Also even if you disable use_devicesfile in lvm.conf, vdsm still use --devices instead of filter when running lvm commands, and lvm commands run by vdsm ignore your lvm filter since the --devices option overrides the system settings. ...I notice some unsync volume warning, but because I had this in the past to, after upgrading, I though after some time they will disappear. The next day there still where there, so I decided to put the nodes again in the maintenance mode and restart the glusterd service. After some time the sync warnings where gone.Not clear what these warnings are, I guess Gluster warning?
I use a self made plython script, which uses the rest api. I create a snapshot from the VM, build a new VM from that snapshot and move the new one to the export domain.So now the actual problem: Since this time the cluster is unstable. I get different errors and warning, like: VM [name] is not responding out of nothing HA VM gets migrated VM migration can fail VM backup with snapshoting and export take very longHow do you backup the vms? do you sue a backup application? how is it configured?
VMs are getting very slow some times Storage domain vmstore experienced a high latency of 9.14251 ovs|00001|db_ctl_base|ERR|no key "dpdk-init" in Open_vSwitch record "." column other_config 489279 [1064359]: s8 renewal error -202 delta_length 10 last_success 489249 444853 [2243175]: s27 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/glusterSD/onode1.example.org:_vmstore/3cf83851-1cc8-4f97-8960-08a60b9e25db/dom_md/ids 471099 [2243175]: s27 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/glusterSD/onode1.example.org:_vmstore/3cf83851-1cc8-4f97-8960-08a60b9e25db/dom_md/ids many of: 424035 [2243175]: s27 delta_renew long write time XX secAll these issues tell use that your storage is not working correctly. sanlock.log is full of renewal errors form May: $ grep 2022-05- sanlock.log | wc -l 4844 $ grep 2022-05- sanlock.log | grep 'renewal error' | wc -l 631 But there is lot of trouble from earlier months: $ grep 2022-04- sanlock.log | wc -l 844 $ grep 2022-04- sanlock.log | grep 'renewal error' | wc -l 29 $ grep 2022-03- sanlock.log | wc -l 1609 $ grep 2022-03- sanlock.log | grep 'renewal error' | wc -l 483 $ grep 2022-02- sanlock.log | wc -l 826 $ grep 2022-02- sanlock.log | grep 'renewal error' | wc -l 242 Here sanlock log looks healthy: $ grep 2022-01- sanlock.log | wc -l 3 $ grep 2022-01- sanlock.log | grep 'renewal error' | wc -l 0 $ grep 2021-12- sanlock.log | wc -l 48 $ grep 2021-12- sanlock.log | grep 'renewal error' | wc -l 0 vdsm log shows that 2 domains are not accessible: $ grep ERROR vdsm.log 2022-05-29 15:07:19,048+0200 ERROR (check/loop) [storage.monitor] Error checking path /rhev/data-center/mnt/glusterSD/onode1.example.org:_data/de5f4123-0fac-4238-abcf-a329c142bd47/dom_md/metadata (monitor:511) 2022-05-29 16:33:59,049+0200 ERROR (check/loop) [storage.monitor] Error checking path /rhev/data-center/mnt/glusterSD/onode1.example.org:_data/de5f4123-0fac-4238-abcf-a329c142bd47/dom_md/metadata (monitor:511) 2022-05-29 16:34:39,049+0200 ERROR (check/loop) [storage.monitor] Error checking path /rhev/data-center/mnt/glusterSD/onode1.example.org:_data/de5f4123-0fac-4238-abcf-a329c142bd47/dom_md/metadata (monitor:511) 2022-05-29 17:21:39,050+0200 ERROR (check/loop) [storage.monitor] Error checking path /rhev/data-center/mnt/glusterSD/onode1.example.org:_data/de5f4123-0fac-4238-abcf-a329c142bd47/dom_md/metadata (monitor:511) 2022-05-29 17:55:59,712+0200 ERROR (check/loop) [storage.monitor] Error checking path /rhev/data-center/mnt/glusterSD/onode1.example.org:_vmstore/3cf83851-1cc8-4f97-8960-08a60b9e25db/dom_md/metadata (monitor:511) 2022-05-29 17:56:19,711+0200 ERROR (check/loop) [storage.monitor] Error checking path /rhev/data-center/mnt/glusterSD/onode1.example.org:_vmstore/3cf83851-1cc8-4f97-8960-08a60b9e25db/dom_md/metadata (monitor:511) 2022-05-29 17:56:39,050+0200 ERROR (check/loop) [storage.monitor] Error checking path /rhev/data-center/mnt/glusterSD/onode1.example.org:_data/de5f4123-0fac-4238-abcf-a329c142bd47/dom_md/metadata (monitor:511) 2022-05-29 17:56:39,711+0200 ERROR (check/loop) [storage.monitor] Error checking path /rhev/data-center/mnt/glusterSD/onode1.example.org:_vmstore/3cf83851-1cc8-4f97-8960-08a60b9e25db/dom_md/metadata (monitor:511) You need to find what is the issue with your Gluster storage. I hope that Ritesh can help debug the issue with Gluster. Nir
I'm worry that I do something, that it makes it even more
worst, and I hove not idea what's the problem. To me it looks
not exactly like a problem with data inconsistencies.
Jonathan