new cluster, 6 nodes
by bpbp@fastmail.com
Hi all, planning a new 6 node hyper-converged cluster. Have a couple of questions
1) storage - I think we want to do 2x replicas and 1 arbiter, in the chained configuration seen here (example 5.7)
https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.5...
any suggestions on how that looks from the bottom up? for example does each host have all their disks in a single hardware raid6 volume, and then the bricks are thinly provisioned via LVM on top so each node has 2 data and 1 arbiter bricks. or is something else recommended?
2) setup - Do I start with a 3 node pool and extend to 6 or use ansible to set up 6 from the start?
Thanks
1 year, 11 months
Re: Getting error on oVirt installation
by dwayne.morton@cment.com
I've installed ovirt node (fresh 4.4) and am trying to deploy engine and it fails each time. seems to be in a loop. I seen this in RHV 4.3 and it was due to IPV6 being enabled. Any help us appreciated.
1 year, 11 months
storage high latency, sanlock errors, cluster instability
by Jonathan Baecker
Hello everybody,
we run a 3 node self hosted cluster with GlusterFS. I had a lot of
problem upgrading ovirt from 4.4.10 to 4.5.0.2 and now we have cluster
instability.
First I will write down the problems I had with upgrading, so you get a
bigger picture:
* engine update when fine
* But nodes I could not update because of wrong version of imgbase, so
I did a manual update to 4.5.0.1 and later to 4.5.0.2. First time
after updating it was still booting into 4.4.10, so I did a reinstall.
* Then after second reboot I ended up in the emergency mode. After a
long searching I figure out that lvm.conf using *use_devicesfile
*now but there it uses the wrong filters. So I comment out this and
add the old filters back. This procedure I have done on all 3 nodes.
* Then in cockpit on all nodes I see errors about:
|ovs|00077|stream_ssl|ERR|Private key must be configured to use SSL|
to fix that I run *vdsm-tool ovn-config [engine IP] ovirtmgmt, *and
later in then web interface I choice for every node: enroll certificate.
* between upgrading the nodes, I was a bit to fast to migrate all
running VMs inclusive the HostedEngine, from one host to another and
then hosted engine crashes one time. But it came back after some
minutes and since this the engine runs normal.
* Then I finish the installation with updating the cluster
compatibility version to 4.7.
* I notice some unsync volume warning, but because I had this in the
past to, after upgrading, I though after some time they will
disappear. The next day there still where there, so I decided to put
the nodes again in the maintenance mode and restart the glusterd
service. After some time the sync warnings where gone.
So now the actual problem:
Since this time the cluster is unstable. I get different errors and
warning, like:
* VM [name] is not responding
* out of nothing HA VM gets migrated
* VM migration can fail
* VM backup with snapshoting and export take very long
* VMs are getting very slow some times
* Storage domain vmstore experienced a high latency of 9.14251
*
ovs|00001|db_ctl_base|ERR|no key "dpdk-init" in Open_vSwitch record
"." column other_config
* 489279 [1064359]: s8 renewal error -202 delta_length 10 last_success
489249
* 444853 [2243175]: s27 delta_renew read timeout 10 sec offset 0
/rhev/data-center/mnt/glusterSD/onode1.example.org:_vmstore/3cf83851-1cc8-4f97-8960-08a60b9e25db/dom_md/ids
* 471099 [2243175]: s27 delta_renew read timeout 10 sec offset 0
/rhev/data-center/mnt/glusterSD/onode1.example.org:_vmstore/3cf83851-1cc8-4f97-8960-08a60b9e25db/dom_md/ids
* many of: 424035 [2243175]: s27 delta_renew long write time XX sec
I will put here the sanlock.log messages and vdsm.log.
Is there a way that I can fix this issues?
Regards!
Jonathan
1 year, 11 months
Install OKD 4.10 with Custom oVirt Certificate
by Fredrik Arneving
Hi,
I've setup and ran Installer Provisioned Installation of OKD on several occations with OKD versions 4.4 - 4.8 on my oVirt (4.3?)/4.4 platform. However, after installing a Custom certificate for my self-hosted ovirt engine I've got problems getting the installation of OKD 4.10 (and 4.8) to complete. Is this a known problem with a known solution I can read up on somewhere?
The install takes three times as long as the working ones did before and when I look at pods and cluster operators the "authentication" ones are in a bad state. I can use the KUBECONFIG environment variable to list pods and interact with the environment but the "oc login" fails with "unknown issuer".
I had the choice of a "full install" of my custom cert or just the GUI/Web and I chose the latter. When installing the custom cert I followed the official RHV documentation that was pointed to by some oVirt user in a forum. Whatever certs I didn't change seemed to have worked before so I would be surprised if the solution is to go for the "full install". In all other cases (like my Foreman server and my freeIPA server) the oVirt works just fine with it's custom cert.
Since I've made it before I'm pretty sure I've correctly followed the OKD installation instructions. What's new is the custom ovirt hosted-engine cert. Is there a detailed documentation on exactly what certificates from my oVirt installation that should be added to my "additionalTrustBundle" in OKD to make it work? In my previous working installations I added the custom root CA since I needed it for other purposes but maybe I need to add some other internal ovirt CA?
I'm currently running oVirt version "4.4.10.7-1.el8" on CentOS Stream release 8 and OKD version "4.10.0-0.okd-2022-03-07-131213". No hardware changes between working installations and failed ones.
Any hints on how to solve this would be appreciated
1 year, 11 months
why so many such logs ?
by tommy
In the new version of 4.5, we can see a lot of OVN synchronization items in the engine logs, very frequently, which was not seen in previous versions.
Is it a new feature?
1 year, 11 months
about the bridge of the host
by tommy
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master ovirtmgmt state UP group default qlen 1000
link/ether 08:00:27:94:4d:e8 brd ff:ff:ff:ff:ff:ff
3: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 9e:5d:8f:94:00:86 brd ff:ff:ff:ff:ff:ff
4: br-int: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether ea:20:e5:c3:d6:31 brd ff:ff:ff:ff:ff:ff
5: ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 08:00:27:94:4d:e8 brd ff:ff:ff:ff:ff:ff
inet 10.1.1.7/24 brd 10.1.1.255 scope global noprefixroute ovirtmgmt
valid_lft forever preferred_lft forever
21: ip_vti0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000
link/ipip 0.0.0.0 brd 0.0.0.0
22: ;vdsmdummy;: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 1e:cb:bf:02:f7:33 brd ff:ff:ff:ff:ff:ff
what use of the 3/4/5/21/22 ? ( I know the item 5 )
are they all the bridge ?
The out put of the brctl show appears only the ovirtmgmt and ;vdsmdummy' are bridge.
[root@host1 ~]# brctl show
bridge name bridge id STP enabled interfaces
;vdsmdummy; 8000.000000000000 no
ovirtmgmt 8000.080027944de8 no enp0s3
[root@host1 ~]#
1 year, 11 months
infiniband for VM traffic
by Roberto Bertucci
Hi all,
i am trying to use a mellanox 100g infiniband interface (EoIB) for VM usage.
Acuallu, trying to configure hosts to user it, i have an error and in vsdm.log i see:
The bridge ovirtib cannot use IP over InfiniBand interface ib0 as port. Please use RoCE interface instead.
ib0 is configured with an ip address and it is correctly working, used to mount nfs directoryes on cluster nodes.
Did anybody face this issue?
Thank you all for help.
1 year, 11 months
VM HostedEngine is down with error
by souvaliotimaria@mail.com
Hello everyone,
I have a replica 2 + arbiter installation and this morning the Hosted Engine gave the following error on the UI and resumed on a different node (node3) than the one it was originally running(node1). (The original node has more memory than the one it ended up, but it had a better memory usage percentage at the time). Also, the only way I discovered the migration had happened and there was an Error in Events, was because I logged in the web interface of ovirt for a routine inspection. Βesides that, everything was working properly and still is.
The error that popped is the following:
VM HostedEngine is down with error. Exit message: internal error: qemu unexpectedly closed the monitor:
2020-09-01T06:49:20.749126Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config, ability to start up with partial NUMA mappings is obsoleted and will be removed in future
2020-09-01T06:49:20.927274Z qemu-kvm: -device virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.0,addr=0x7,drive=drive-ua-d5de54b6-9f8e-4fba-819b-ebf6780757d2,id=ua-d5de54b6-9f8e-4fba-819b-ebf6780757d2,bootindex=1,write-cache=on: Failed to get "write" lock
Is another process using the image?.
Which from what I could gather concerns the following snippet from the HostedEngine.xml and it's the virtio disk of the Hosted Engine:
<disk type='file' device='disk' snapshot='no'>
<driver name='qemu' type='raw' cache='none' error_policy='stop' io='threads' iothread='1'/>
<source file='/var/run/vdsm/storage/80f6e393-9718-4738-a14a-64cf43c3d8c2/d5de54b6-9f8e-4fba-819b-ebf6780757d2/a48555f4-be23-4467-8a54-400ae7baf9d7'>
<seclabel model='dac' relabel='no'/>
</source>
<target dev='vda' bus='virtio'/>
<serial>d5de54b6-9f8e-4fba-819b-ebf6780757d2</serial>
<alias name='ua-d5de54b6-9f8e-4fba-819b-ebf6780757d2'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
</disk>
I've tried looking into the logs and the sar command but I couldn't find anything to relate with the above errors and determining the reason for it to happen. Is this a Gluster or a QEMU problem?
The Hosted Engine was manually migrated five days before on node1.
Is there a standard practice I could follow to determine what happened and secure my system?
Thank you very much for your time,
Maria Souvalioti
1 year, 11 months