[ovirt-users] storage high latency, sanlock errors, cluster instability

29 May 2022

      Hello everybody,

we run a 3 node self hosted cluster with GlusterFS. I had a lot of 
problem upgrading ovirt from 4.4.10 to 4.5.0.2 and now we have cluster 
instability.

First I will write down the problems I had with upgrading, so you get a 
bigger picture:

  * engine update when fine
  * But nodes I could not update because of wrong version of imgbase, so
    I did a manual update to 4.5.0.1 and later to 4.5.0.2. First time
    after updating it was still booting into 4.4.10, so I did a reinstall.
  * Then after second reboot I ended up in the emergency mode. After a
    long searching I figure out that lvm.conf using *use_devicesfile
    *now but there it uses the wrong filters. So I comment out this and
    add the old filters back. This procedure I have done on all 3 nodes.
  * Then in cockpit on all nodes I see errors about:
    |ovs|00077|stream_ssl|ERR|Private key must be configured to use SSL|
    to fix that I run *vdsm-tool ovn-config [engine IP] ovirtmgmt, *and
    later in then web interface I choice for every node: enroll certificate.
  * between upgrading the nodes, I was a bit to fast to migrate all
    running VMs inclusive the HostedEngine, from one host to another and
    then hosted engine crashes one time. But it came back after some
    minutes and since this the engine runs normal.
  * Then I finish the installation with updating the cluster
    compatibility version to 4.7.
  * I notice some unsync volume warning, but because I had this in the
    past to, after upgrading, I though after some time they will
    disappear. The next day there still where there, so I decided to put
    the nodes again in the maintenance mode and restart the glusterd
    service. After some time the sync warnings where gone.

So now the actual problem:

Since this time the cluster is unstable. I get different errors and 
warning, like:

  * VM [name] is not responding
  * out of nothing HA VM gets migrated
  * VM migration can fail
  * VM backup with snapshoting and export take very long
  * VMs are getting very slow some times
  * Storage domain vmstore experienced a high latency of 9.14251
  *
    ovs|00001|db_ctl_base|ERR|no key "dpdk-init" in Open_vSwitch record
    "." column other_config
  * 489279 [1064359]: s8 renewal error -202 delta_length 10 last_success
    489249
  * 444853 [2243175]: s27 delta_renew read timeout 10 sec offset 0
    /rhev/data-center/mnt/glusterSD/onode1.example.org:_vmstore/3cf83851-1cc8-4f97-8960-08a60b9e25db/dom_md/ids
  * 471099 [2243175]: s27 delta_renew read timeout 10 sec offset 0
    /rhev/data-center/mnt/glusterSD/onode1.example.org:_vmstore/3cf83851-1cc8-4f97-8960-08a60b9e25db/dom_md/ids
  * many of: 424035 [2243175]: s27 delta_renew long write time XX sec

I will put here the sanlock.log messages and vdsm.log.

Is there a way that I can fix this issues?

Regards!

Jonathan

[ovirt-users] storage high latency, sanlock errors, cluster instability

Jonathan Baecker