storage high latency, sanlock errors, cluster instability
by Jonathan Baecker
Hello everybody,
we run a 3 node self hosted cluster with GlusterFS. I had a lot of
problem upgrading ovirt from 4.4.10 to 4.5.0.2 and now we have cluster
instability.
First I will write down the problems I had with upgrading, so you get a
bigger picture:
* engine update when fine
* But nodes I could not update because of wrong version of imgbase, so
I did a manual update to 4.5.0.1 and later to 4.5.0.2. First time
after updating it was still booting into 4.4.10, so I did a reinstall.
* Then after second reboot I ended up in the emergency mode. After a
long searching I figure out that lvm.conf using *use_devicesfile
*now but there it uses the wrong filters. So I comment out this and
add the old filters back. This procedure I have done on all 3 nodes.
* Then in cockpit on all nodes I see errors about:
|ovs|00077|stream_ssl|ERR|Private key must be configured to use SSL|
to fix that I run *vdsm-tool ovn-config [engine IP] ovirtmgmt, *and
later in then web interface I choice for every node: enroll certificate.
* between upgrading the nodes, I was a bit to fast to migrate all
running VMs inclusive the HostedEngine, from one host to another and
then hosted engine crashes one time. But it came back after some
minutes and since this the engine runs normal.
* Then I finish the installation with updating the cluster
compatibility version to 4.7.
* I notice some unsync volume warning, but because I had this in the
past to, after upgrading, I though after some time they will
disappear. The next day there still where there, so I decided to put
the nodes again in the maintenance mode and restart the glusterd
service. After some time the sync warnings where gone.
So now the actual problem:
Since this time the cluster is unstable. I get different errors and
warning, like:
* VM [name] is not responding
* out of nothing HA VM gets migrated
* VM migration can fail
* VM backup with snapshoting and export take very long
* VMs are getting very slow some times
* Storage domain vmstore experienced a high latency of 9.14251
*
ovs|00001|db_ctl_base|ERR|no key "dpdk-init" in Open_vSwitch record
"." column other_config
* 489279 [1064359]: s8 renewal error -202 delta_length 10 last_success
489249
* 444853 [2243175]: s27 delta_renew read timeout 10 sec offset 0
/rhev/data-center/mnt/glusterSD/onode1.example.org:_vmstore/3cf83851-1cc8-4f97-8960-08a60b9e25db/dom_md/ids
* 471099 [2243175]: s27 delta_renew read timeout 10 sec offset 0
/rhev/data-center/mnt/glusterSD/onode1.example.org:_vmstore/3cf83851-1cc8-4f97-8960-08a60b9e25db/dom_md/ids
* many of: 424035 [2243175]: s27 delta_renew long write time XX sec
I will put here the sanlock.log messages and vdsm.log.
Is there a way that I can fix this issues?
Regards!
Jonathan
2 years, 10 months
Install OKD 4.10 with Custom oVirt Certificate
by Fredrik Arneving
Hi,
I've setup and ran Installer Provisioned Installation of OKD on several occations with OKD versions 4.4 - 4.8 on my oVirt (4.3?)/4.4 platform. However, after installing a Custom certificate for my self-hosted ovirt engine I've got problems getting the installation of OKD 4.10 (and 4.8) to complete. Is this a known problem with a known solution I can read up on somewhere?
The install takes three times as long as the working ones did before and when I look at pods and cluster operators the "authentication" ones are in a bad state. I can use the KUBECONFIG environment variable to list pods and interact with the environment but the "oc login" fails with "unknown issuer".
I had the choice of a "full install" of my custom cert or just the GUI/Web and I chose the latter. When installing the custom cert I followed the official RHV documentation that was pointed to by some oVirt user in a forum. Whatever certs I didn't change seemed to have worked before so I would be surprised if the solution is to go for the "full install". In all other cases (like my Foreman server and my freeIPA server) the oVirt works just fine with it's custom cert.
Since I've made it before I'm pretty sure I've correctly followed the OKD installation instructions. What's new is the custom ovirt hosted-engine cert. Is there a detailed documentation on exactly what certificates from my oVirt installation that should be added to my "additionalTrustBundle" in OKD to make it work? In my previous working installations I added the custom root CA since I needed it for other purposes but maybe I need to add some other internal ovirt CA?
I'm currently running oVirt version "4.4.10.7-1.el8" on CentOS Stream release 8 and OKD version "4.10.0-0.okd-2022-03-07-131213". No hardware changes between working installations and failed ones.
Any hints on how to solve this would be appreciated
2 years, 10 months
why so many such logs ?
by tommy
In the new version of 4.5, we can see a lot of OVN synchronization items in the engine logs, very frequently, which was not seen in previous versions.
Is it a new feature?
2 years, 10 months
about the bridge of the host
by tommy
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master ovirtmgmt state UP group default qlen 1000
link/ether 08:00:27:94:4d:e8 brd ff:ff:ff:ff:ff:ff
3: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 9e:5d:8f:94:00:86 brd ff:ff:ff:ff:ff:ff
4: br-int: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether ea:20:e5:c3:d6:31 brd ff:ff:ff:ff:ff:ff
5: ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 08:00:27:94:4d:e8 brd ff:ff:ff:ff:ff:ff
inet 10.1.1.7/24 brd 10.1.1.255 scope global noprefixroute ovirtmgmt
valid_lft forever preferred_lft forever
21: ip_vti0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000
link/ipip 0.0.0.0 brd 0.0.0.0
22: ;vdsmdummy;: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 1e:cb:bf:02:f7:33 brd ff:ff:ff:ff:ff:ff
what use of the 3/4/5/21/22 ? ( I know the item 5 )
are they all the bridge ?
The out put of the brctl show appears only the ovirtmgmt and ;vdsmdummy' are bridge.
[root@host1 ~]# brctl show
bridge name bridge id STP enabled interfaces
;vdsmdummy; 8000.000000000000 no
ovirtmgmt 8000.080027944de8 no enp0s3
[root@host1 ~]#
2 years, 10 months
infiniband for VM traffic
by Roberto Bertucci
Hi all,
i am trying to use a mellanox 100g infiniband interface (EoIB) for VM usage.
Acuallu, trying to configure hosts to user it, i have an error and in vsdm.log i see:
The bridge ovirtib cannot use IP over InfiniBand interface ib0 as port. Please use RoCE interface instead.
ib0 is configured with an ip address and it is correctly working, used to mount nfs directoryes on cluster nodes.
Did anybody face this issue?
Thank you all for help.
2 years, 10 months
VM HostedEngine is down with error
by souvaliotimaria@mail.com
Hello everyone,
I have a replica 2 + arbiter installation and this morning the Hosted Engine gave the following error on the UI and resumed on a different node (node3) than the one it was originally running(node1). (The original node has more memory than the one it ended up, but it had a better memory usage percentage at the time). Also, the only way I discovered the migration had happened and there was an Error in Events, was because I logged in the web interface of ovirt for a routine inspection. Βesides that, everything was working properly and still is.
The error that popped is the following:
VM HostedEngine is down with error. Exit message: internal error: qemu unexpectedly closed the monitor:
2020-09-01T06:49:20.749126Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config, ability to start up with partial NUMA mappings is obsoleted and will be removed in future
2020-09-01T06:49:20.927274Z qemu-kvm: -device virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.0,addr=0x7,drive=drive-ua-d5de54b6-9f8e-4fba-819b-ebf6780757d2,id=ua-d5de54b6-9f8e-4fba-819b-ebf6780757d2,bootindex=1,write-cache=on: Failed to get "write" lock
Is another process using the image?.
Which from what I could gather concerns the following snippet from the HostedEngine.xml and it's the virtio disk of the Hosted Engine:
<disk type='file' device='disk' snapshot='no'>
<driver name='qemu' type='raw' cache='none' error_policy='stop' io='threads' iothread='1'/>
<source file='/var/run/vdsm/storage/80f6e393-9718-4738-a14a-64cf43c3d8c2/d5de54b6-9f8e-4fba-819b-ebf6780757d2/a48555f4-be23-4467-8a54-400ae7baf9d7'>
<seclabel model='dac' relabel='no'/>
</source>
<target dev='vda' bus='virtio'/>
<serial>d5de54b6-9f8e-4fba-819b-ebf6780757d2</serial>
<alias name='ua-d5de54b6-9f8e-4fba-819b-ebf6780757d2'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
</disk>
I've tried looking into the logs and the sar command but I couldn't find anything to relate with the above errors and determining the reason for it to happen. Is this a Gluster or a QEMU problem?
The Hosted Engine was manually migrated five days before on node1.
Is there a standard practice I could follow to determine what happened and secure my system?
Thank you very much for your time,
Maria Souvalioti
2 years, 10 months
Local Disk Usage
by mert tuncsav
Hello All,
We have performance issues about i/o for some system on oVirt. Disk type is a nfs for shared datacenter. We need to use local disk as a secondary data-domain to deploy vm in shared datacenter. Is there a any chance configure that ? We couldn't find any solutions. Do you have suggestions about it ?
Regards
2 years, 10 months
adding host to cluster with ovs switch failing
by ravi k
Hello all,
I'm facing a strange error. I was able to add a host to a linux bridge based cluster. However if I try adding the host to a cluster with OVS switch it is failing. I can see that nmstate was able to create the ovirtmgmt bridge as well. At that point of time both the ovirtmgmt and the bond0.vlan interfaces have the ip assigned. It then fails and rolls back the config. A workaround that I found to be working was to add the host to linux bridge cluster first and then change the cluster to OVS cluster.
Here's a background about the setup. The host is an AMD EPYC with OEL 8.6 installed. The OLVM manager is a standalone VM at 4.4.8. We have a bond0 and ip is assigned to bond0.1222 interface. The interfaces are in a LACP bond on the switch as well. I enabled debug in NetworkManager in the hope of finding some clues, but couldn't.
I know 4.4 is EOL. As this is a user mailing list, I thought I'll reach out in hope if someone has seen any similar issue.
supervdsm log
MainProcess|jsonrpc/3::DEBUG::2022-05-26 14:32:25,336::plugin::172::root::(apply_changes) Nispor: desired network state {'name': 'bond0', 'type': 'bond', 'state': 'up', 'mac-address': 'e4:3d:1a:82:9f:c0', 'link-aggregation': {'port': ['ens10f0np0', 'ens5f0np0'], 'options': {'ad_actor_sys_prio': 65535, 'ad_actor_system': '00:00:00:00:00:00', 'ad_select': 'stable', 'ad_user_port_key': 0, 'all_slaves_active': 'dropped', 'arp_all_targets': 'any', 'arp_interval': 0, 'arp_validate': 'none', 'downdelay': 0, 'lacp_rate': 'slow', 'miimon': 100, 'min_links': 0, 'updelay': 0, 'use_carrier': True, 'xmit_hash_policy': 'layer2', 'arp_ip_target': ''}, 'mode': '802.3ad'}, 'ipv4': {'enabled': False}, 'ipv6': {'enabled': False}, 'mtu': 1500, 'lldp': {'enabled': False}, 'accept-all-mac-addresses': False, '_brport_options': {'name': 'bond0'}, '_controller': 'vdsmbr_6SMdIi3B', '_controller_type': 'ovs-bridge'}
MainProcess|jsonrpc/3::DEBUG::2022-05-26 14:32:25,336::plugin::172::root::(apply_changes) Nispor: desired network state {'name': 'ovirtmgmt', 'type': 'ovs-interface', 'state': 'up', 'mtu': 1500, 'ipv4': {'enabled': True, 'address': [{'ip': '10.129.221.19', 'prefix-length': 24}], 'dhcp': False, '_dns': {'server': ['10.150.5.100', '10.229.0.60'], 'search': [], '_priority': 0}, '_routes': [{'table-id': 329647082, 'destination': '0.0.0.0/0', 'next-hop-address': '10.129.221.1', 'next-hop-interface': 'ovirtmgmt'}, {'table-id': 329647082, 'destination': '10.129.221.0/24', 'next-hop-address': '10.129.221.19', 'next-hop-interface': 'ovirtmgmt'}, {'table-id': 254, 'destination': '0.0.0.0/0', 'next-hop-address': '10.129.221.1', 'next-hop-interface': 'ovirtmgmt'}], '_route_rules': [{'ip-from': '', 'ip-to': '10.129.221.0/24', 'priority': 3200, 'route-table': 329647082}, {'ip-from': '10.129.221.0/24', 'ip-to': '', 'priority': 3200, 'route-table': 329647082}]}, 'ipv6': {'enabled': False, '_routes':
[], '_route_rules': []}, 'mac-address': 'E4:3D:1A:82:9F:C0', '_brport_options': {'name': 'ovirtmgmt', 'vlan': {'mode': 'access', 'tag': 1222}}, '_controller': 'vdsmbr_6SMdIi3B', '_controller_type': 'ovs-bridge'}
MainProcess|jsonrpc/3::DEBUG::2022-05-26 14:32:25,336::plugin::172::root::(apply_changes) Nispor: desired network state {'name': 'vdsmbr_6SMdIi3B', 'state': 'up', 'type': 'ovs-bridge', 'bridge': {'port': [{'name': 'bond0'}, {'name': 'ovirtmgmt', 'vlan': {'mode': 'access', 'tag': 1222}}]}, 'ipv6': {'enabled': False}}
MainProcess|jsonrpc/3::DEBUG::2022-05-26 14:32:25,340::context::148::root::(register_async) Async action: Update profile uuid:d8c57758-f784-44f4-a33a-c050ec50b9b9 iface:bond0 type:bond started
MainProcess|jsonrpc/3::DEBUG::2022-05-26 14:32:25,340::context::148::root::(register_async) Async action: Add profile: 623b6249-7cfa-4813-9ef6-4870ec6f3a79, iface:bond0, type:ovs-port started
MainProcess|jsonrpc/3::DEBUG::2022-05-26 14:32:25,340::context::148::root::(register_async) Async action: Add profile: ed8f5cae-5400-42fd-a72e-645e1fa61a39, iface:ovirtmgmt, type:ovs-interface started
MainProcess|jsonrpc/3::DEBUG::2022-05-26 14:32:25,341::context::148::root::(register_async) Async action: Add profile: 3572b137-2091-4825-b418-4d6966430cc1, iface:ovirtmgmt, type:ovs-port started
MainProcess|jsonrpc/3::DEBUG::2022-05-26 14:32:25,341::context::148::root::(register_async) Async action: Add profile: bd45447d-f241-4d14-bf5b-28c3966c011d, iface:vdsmbr_6SMdIi3B, type:ovs-bridge started
MainProcess|jsonrpc/3::DEBUG::2022-05-26 14:32:25,343::context::157::root::(finish_async) Async action: Update profile uuid:d8c57758-f784-44f4-a33a-c050ec50b9b9 iface:bond0 type:bond finished
MainProcess|jsonrpc/3::DEBUG::2022-05-26 14:32:25,349::context::157::root::(finish_async) Async action: Add profile: 623b6249-7cfa-4813-9ef6-4870ec6f3a79, iface:bond0, type:ovs-port finished
MainProcess|jsonrpc/3::DEBUG::2022-05-26 14:32:25,350::context::157::root::(finish_async) Async action: Add profile: ed8f5cae-5400-42fd-a72e-645e1fa61a39, iface:ovirtmgmt, type:ovs-interface finished
MainProcess|jsonrpc/3::DEBUG::2022-05-26 14:32:25,350::context::157::root::(finish_async) Async action: Add profile: 3572b137-2091-4825-b418-4d6966430cc1, iface:ovirtmgmt, type:ovs-port finished
MainProcess|jsonrpc/3::DEBUG::2022-05-26 14:32:25,350::context::157::root::(finish_async) Async action: Add profile: bd45447d-f241-4d14-bf5b-28c3966c011d, iface:vdsmbr_6SMdIi3B, type:ovs-bridge finished
MainProcess|jsonrpc/3::DEBUG::2022-05-26 14:32:25,350::context::148::root::(register_async) Async action: Activate profile uuid:bd45447d-f241-4d14-bf5b-28c3966c011d iface:vdsmbr_6SMdIi3B type: ovs-bridge started
MainProcess|jsonrpc/3::DEBUG::2022-05-26 14:32:25,352::active_connection::201::root::(_activate_profile_callback) Connection activation initiated: iface=vdsmbr_6SMdIi3B type=ovs-bridge con-state=<enum NM_ACTIVE_CONNECTION_STATE_ACTIVATING of type NM.ActiveConnectionState>
MainProcess|jsonrpc/3::DEBUG::2022-05-26 14:32:25,355::active_connection::339::root::(_activation_progress_check) Connection activation succeeded: iface=vdsmbr_6SMdIi3B, type=ovs-bridge, con_state=<enum NM_ACTIVE_CONNECTION_STATE_ACTIVATING of type NM.ActiveConnectionState>, dev_state=<enum NM_DEVICE_STATE_IP_CONFIG of type NM.DeviceState>, state_flags=<flags NM_ACTIVATION_STATE_FLAG_IS_MASTER | NM_ACTIVATION_STATE_FLAG_LAYER2_READY of type NM.ActivationStateFlags>
MainProcess|jsonrpc/3::DEBUG::2022-05-26 14:32:25,355::context::157::root::(finish_async) Async action: Activate profile uuid:bd45447d-f241-4d14-bf5b-28c3966c011d iface:vdsmbr_6SMdIi3B type: ovs-bridge finished
MainProcess|jsonrpc/3::DEBUG::2022-05-26 14:32:25,355::context::148::root::(register_async) Async action: Reapply device config: bond0 bond d8c57758-f784-44f4-a33a-c050ec50b9b9 started
MainProcess|jsonrpc/3::DEBUG::2022-05-26 14:32:25,358::device::83::root::(_reapply_callback) Device reapply failed on bond0 bond: error=nm-device-error-quark: Can't reapply changes to '802-3-ethernet.cloned-mac-address' setting (3), Fallback to device activation
MainProcess|jsonrpc/3::DEBUG::2022-05-26 14:32:25,358::context::148::root::(register_async) Async action: Activate profile uuid:d8c57758-f784-44f4-a33a-c050ec50b9b9 iface:bond0 type: bond started
MainProcess|jsonrpc/3::DEBUG::2022-05-26 14:32:25,360::active_connection::201::root::(_activate_profile_callback) Connection activation initiated: iface=bond0 type=bond con-state=<enum NM_ACTIVE_CONNECTION_STATE_ACTIVATING of type NM.ActiveConnectionState>
Regards,
Ravi
2 years, 10 months
Single-machine hosted-engine routing is not working
by Paul-Erik Törrönen
Hello,
I have oVirt 4.4 (latest that can be installed on RockyLinux 8.5)
running on a laptop with a self-hosted engine.
The setup was working fine after installation, but once I rebooted
(after having shut down all the VMs including the hosted engine), I can
no longer reach the oVirt console from any other computer on the same
subnet. The hosted engine does respond to ping from the host machine.
Logging onto the hosted engine from the serial console and I can only
ping the host machine. Any other address on the subnet is unreachable.
This seems to be some internal oVirt routing issue between the host and
the virtual machine since stopping the firewall service makes no
difference, neither on the host nor the hosted engine.
The host address is 192.168.42.2 and the hosted engine is 192.168.42.250.
broker.log says:
engine_health::246::engine_health.EngineHealth::(_result_from_stats) VM
is up on this host with healthy engine
cpu_load_no_engine::142::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load)
System load total=0.0250, engine=0.0028, non-engine=0.0222
network::88::network.Network::(action) Successfully verified network status
mem_free::51::mem_free.MemFree::(action) memFree: 26884
mgmt_bridge::65::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt
in up state
engine_health::246::engine_health.EngineHealth::(_result_from_stats) VM
is up on this host with healthy engine
agent.log says:
states::406::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
Engine vm running on localhost
hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Current state EngineUp (score: 3400)
ovn-controller.log says:
reconnect|INFO|ssl:192.168.42.250:6642: connected
ofctrl|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting to switch
rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting...
rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connected
pinctrl(ovn_pinctrl0)|INFO|unix:/var/run/openvswitch/br-int.mgmt:
connecting to switch
rconn(ovn_pinctrl0)|INFO|unix:/var/run/openvswitch/br-int.mgmt:
connecting...
rconn(ovn_pinctrl0)|INFO|unix:/var/run/openvswitch/br-int.mgmt: connected
ovs-vswitchd.log says:
connmgr|INFO|br-int: added service controller
"punix:/var/run/openvswitch/br-int.mgmt"
bridge|INFO|ovs-vswitchd (Open vSwitch) 2.11.8
memory|INFO|68900 kB peak resident set size after 10.0 seconds
memory|INFO|handlers:5 ofconns:2 ports:1 revalidators:3 rules:9
connmgr|INFO|br-int<->unix#0: 6 flow_mods 10 s ago (5 adds, 1 deletes)
The only actual error on the host is in ovsdb-server.log:
jsonrpc|WARN|unix#13: receive error: Connection reset by peer
reconnect|WARN|unix#13: connection dropped (Connection reset by peer)
What else should I look at in order to figure out why the host no longer
routes packets correctly from and to the hosted engine?
Poltsi
2 years, 10 months