Hello everybody,
we run a 3 node self hosted cluster with GlusterFS. I had a lot of
problem upgrading ovirt from 4.4.10 to 4.5.0.2 and now we have cluster
instability.
First I will write down the problems I had with upgrading, so you get a
bigger picture:
* engine update when fine
* But nodes I could not update because of wrong version of imgbase, so
I did a manual update to 4.5.0.1 and later to 4.5.0.2. First time
after updating it was still booting into 4.4.10, so I did a reinstall.
* Then after second reboot I ended up in the emergency mode. After a
long searching I figure out that lvm.conf using *use_devicesfile
*now but there it uses the wrong filters. So I comment out this and
add the old filters back. This procedure I have done on all 3 nodes.
* Then in cockpit on all nodes I see errors about:
|ovs|00077|stream_ssl|ERR|Private key must be configured to use SSL|
to fix that I run *vdsm-tool ovn-config [engine IP] ovirtmgmt, *and
later in then web interface I choice for every node: enroll certificate.
* between upgrading the nodes, I was a bit to fast to migrate all
running VMs inclusive the HostedEngine, from one host to another and
then hosted engine crashes one time. But it came back after some
minutes and since this the engine runs normal.
* Then I finish the installation with updating the cluster
compatibility version to 4.7.
* I notice some unsync volume warning, but because I had this in the
past to, after upgrading, I though after some time they will
disappear. The next day there still where there, so I decided to put
the nodes again in the maintenance mode and restart the glusterd
service. After some time the sync warnings where gone.
So now the actual problem:
Since this time the cluster is unstable. I get different errors and
warning, like:
* VM [name] is not responding
* out of nothing HA VM gets migrated
* VM migration can fail
* VM backup with snapshoting and export take very long
* VMs are getting very slow some times
* Storage domain vmstore experienced a high latency of 9.14251
*
ovs|00001|db_ctl_base|ERR|no key "dpdk-init" in Open_vSwitch record
"." column other_config
* 489279 [1064359]: s8 renewal error -202 delta_length 10 last_success
489249
* 444853 [2243175]: s27 delta_renew read timeout 10 sec offset 0
/rhev/data-center/mnt/glusterSD/onode1.example.org:_vmstore/3cf83851-1cc8-4f97-8960-08a60b9e25db/dom_md/ids
* 471099 [2243175]: s27 delta_renew read timeout 10 sec offset 0
/rhev/data-center/mnt/glusterSD/onode1.example.org:_vmstore/3cf83851-1cc8-4f97-8960-08a60b9e25db/dom_md/ids
* many of: 424035 [2243175]: s27 delta_renew long write time XX sec
I will put here the sanlock.log messages and vdsm.log.
Is there a way that I can fix this issues?
Regards!
Jonathan