Host deploy failure: Configure OVN for oVirt

Good day, I am running into a problem during host deploy via the oVirt engine GUI since upgrading to ovirt-engine-4.5.5-1.el8. The "Configure OVN for oVirt" task seem to fail when trying to run the vdsm-tool ovn-config command. Host deploy used to work fine when the engine was on version 4.5.4. Anyone that can guide me on the right path to get past this issue? Does not seem to be a new problem - https://lists.ovirt.org/archives/list/users@ovirt.org/thread/IDLGSBQFX35EHHG... Log extract: 2024-02-01 14:48:19 SAST - TASK [ovirt-provider-ovn-driver : Configure OVN for oVirt] ********************* . . . "stdout" : "fatal: [mob-r1-l-ovirt-aa-1-23.x.fnb.co.za]: FAILED! => {\"changed\": true, \"cmd\": [\"vdsm-tool\", \"ovn-config\", \"192.168.2.100\", \"host23.mydomain.com\"], \"delta\": \"0:00:00.538143\", \"end \": \"2024-02-01 14:48:20.596823\", \"msg\": \"non-zero return code\", \"rc\": 1, \"start\": \"2024-02-01 14:48:20.058680\", \"stderr\": \"Traceback (most recent call last):\\n File \\\"/usr/lib/python3.6/site-packages/vdsm/t ool/ovn_config.py\\\", line 117, in get_network\\n return networks[net_name]\\nKeyError: 'host23.mydomain.com'\\n\\nDuring handling of the above exception, another exception occurred:\\n\\nTraceback (most rec ent call last):\\n File \\\"/usr/bin/vdsm-tool\\\", line 195, in main\\n return tool_command[cmd][\\\"command\\\"](*args)\\n File \\\"/usr/lib/python3.6/site-packages/vdsm/tool/ovn_config.py\\\", line 63, in ovn_config\\n ip_address = get_ip_addr(get_network(network_caps(), net_name))\\n File \\\"/usr/lib/python3.6/site-packages/vdsm/tool/ovn_config.py\\\", line 119, in get_network\\n raise NetworkNotFoundError(net_name)\\nvdsm.tool.ovn _config.NetworkNotFoundError: host23.mydomain.com\", \"stderr_lines\": [\"Traceback (most recent call last):\", \" File \\\"/usr/lib/python3.6/site-packages/vdsm/tool/ovn_config.py\\\", line 117, in get_network \", \" return networks[net_name]\", \"KeyError: 'host23.mydomain.com'\", \"\", \"During handling of the above exception, another exception occurred:\", \"\", \"Traceback (most recent call last):\", \" File \ \\"/usr/bin/vdsm-tool\\\", line 195, in main\", \" return tool_command[cmd][\\\"command\\\"](*args)\", \" File \\\"/usr/lib/python3.6/site-packages/vdsm/tool/ovn_config.py\\\", line 63, in ovn_config\", \" ip_address = get_ip_addr(get_network(network_caps(), net_name))\", \" File \\\"/usr/lib/python3.6/site-packages/vdsm/tool/ovn_config.py\\\", line 119, in get_network\", \" raise NetworkNotFoundError(net_name)\", \"vdsm.tool.ovn_config. NetworkNotFoundError: host23.mydomain.com\"], \"stdout\": \"\", \"stdout_lines\": []}", Thanks in advance!! Stephan

Turns out the "when" condition in the blocks of /usr/share/ovirt-engine/ansible-runner-service-project/project/roles/ovirt-provider-ovn-driver/tasks/configure.yml was wrong. Changed it to the same as on ovirt-engine 4.5.4. Host deploy works on ovirt-engine 4.5.5 now . . . . when: - cluster_switch == "ovs" or (ovn_central is defined and ovn_central | ipaddr) . . . when: - ovn_central is defined - ovn_central | ipaddr

I had this same issue on oVirt Node 4.5.5, however, I did not see the same code in /usr/share/ovirt-engine/ansible-runner-service-project/project/roles/ovirt-provider-ovn-driver/tasks/configure.yml on the hosted engine. On my version 4.5.5, I have two blocks: one installs ovs and ensures Open vSwitch is started, the second block installs the ovirt-provider-ovn-driver and configures OVN (as well as some other steps). For the first block, my when statement shows as: when: - cluster_switch == "ovs" or (ovn_central is defined) For the second block, it shows: when: - ovn_central is defined In Ansible, inside a when: statement, multiple lines beginning with "-" are equivalent to AND conditions. For example: when: - this == true - that == true This would be equivalent to when: (this == true) and (that == true). I didn't want to toy with the control logic, but I realized that this was a non-issue. The error in this occurs in the Configuring OVN step, which in my configure.yml is near the end of the second block. The when statements are working fine, otherwise it wouldn't be executing those steps. I dug in further, and the issue comes about when the installer attempts to run: vdsm-tool config-ovn <IP-Central> <FQDN> ! I tried this on my own system: [root@b-drone11 ~]# vdsm-tool ovn-config 10.99.8.31 b-drone11.arcc.uwyo.edu Traceback (most recent call last): File "/usr/lib/python3.9/site-packages/vdsm/tool/ovn_config.py", line 117, in get_network return networks[net_name] KeyError: 'b-drone11.arcc.uwyo.edu' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/bin/vdsm-tool", line 195, in main return tool_command[cmd]["command"](*args) File "/usr/lib/python3.9/site-packages/vdsm/tool/ovn_config.py", line 63, in ovn_config ip_address = get_ip_addr(get_network(network_caps(), net_name)) File "/usr/lib/python3.9/site-packages/vdsm/tool/ovn_config.py", line 119, in get_network raise NetworkNotFoundError(net_name) vdsm.tool.ovn_config.NetworkNotFoundError: b-drone11.arcc.uwyo.edu It's the same error as in the host-deploy logs. If you dig in a bit more, you'll find in the ovn_config.py script referred to by the above output, there's a function get_networks() that is throwing the error: def get_network(net_caps, net_name): networks = net_caps['networks'] try: return networks[net_name] except KeyError: raise NetworkNotFoundError(net_name) Digging in EVEN further, if you look at where the function is called and how the "net_name" variable comes in, you'll find that it's only run when a FQDN is given as an argument to vdsm-tool ovn-config instead of an IP: if is_ipaddress(args[2]): ip_address = args[2] else: net_name = args[2] ip_address = get_ip_addr(get_network(network_caps(), net_name)) if not ip_address: raise IpAddressNotFoundError(net_name) Now, this is as far I got. As far as WHY the get_network() function isn't working, I haven't looked further into the ovirt code and can't say. But it appears somehow this function fails when attempting to resolve FQDN's. Which brings me to the WORKAROUND! Since the error lies in translating a FQDN to an IP, if you instead provide an IP address in the first place, it completely bypasses the buggy get_networks() function, and lets you add a host. So, when you run the host deploy, if you add the host using it's IP address vs. its FQDN, it goes through fine, and I've tested this on my cluster and it worked beautifully. The only caveat is you can't add with the FQDN, but for now, our cluster is up and working.

On Tue, Apr 23, 2024 at 6:57 AM Levi Wilbert <stop.play.rwd@gmail.com> wrote:
I had this same issue on oVirt Node 4.5.5, however, I did not see the same code in /usr/share/ovirt-engine/ansible-runner-service-project/project/roles/ovirt-provider-ovn-driver/tasks/configure.yml on the hosted engine.
On my version 4.5.5, I have two blocks: one installs ovs and ensures Open vSwitch is started, the second block installs the ovirt-provider-ovn-driver and configures OVN (as well as some other steps).
Hi, I would like to clarify what is happening.
For the first block, my when statement shows as: when: - cluster_switch == "ovs" or (ovn_central is defined)
For the second block, it shows: when: - ovn_central is defined
In Ansible, inside a when: statement, multiple lines beginning with "-" are equivalent to AND conditions. For example: when: - this == true - that == true
This would be equivalent to when: (this == true) and (that == true).
This condition is actually the problem, if you take a look at the previous one, the key thing is "ovn_central | ipaddr", this expects a valid ip address otherwise the condition will be false. However when the condition is only "ovn_central is defined" it will be true also for empty string.
I didn't want to toy with the control logic, but I realized that this was a non-issue. The error in this occurs in the Configuring OVN step, which in my configure.yml is near the end of the second block. The when statements are working fine, otherwise it wouldn't be executing those steps.
I dug in further, and the issue comes about when the installer attempts to run: vdsm-tool config-ovn <IP-Central> <FQDN> !
I tried this on my own system: [root@b-drone11 ~]# vdsm-tool ovn-config 10.99.8.31 b-drone11.arcc.uwyo.edu Traceback (most recent call last): File "/usr/lib/python3.9/site-packages/vdsm/tool/ovn_config.py", line 117, in get_network return networks[net_name] KeyError: 'b-drone11.arcc.uwyo.edu'
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/bin/vdsm-tool", line 195, in main return tool_command[cmd]["command"](*args) File "/usr/lib/python3.9/site-packages/vdsm/tool/ovn_config.py", line 63, in ovn_config ip_address = get_ip_addr(get_network(network_caps(), net_name)) File "/usr/lib/python3.9/site-packages/vdsm/tool/ovn_config.py", line 119, in get_network raise NetworkNotFoundError(net_name) vdsm.tool.ovn_config.NetworkNotFoundError: b-drone11.arcc.uwyo.edu
It's the same error as in the host-deploy logs. If you dig in a bit more, you'll find in the ovn_config.py script referred to by the above output, there's a function get_networks() that is throwing the error: def get_network(net_caps, net_name): networks = net_caps['networks'] try: return networks[net_name] except KeyError: raise NetworkNotFoundError(net_name)
Digging in EVEN further, if you look at where the function is called and how the "net_name" variable comes in, you'll find that it's only run when a FQDN is given as an argument to vdsm-tool ovn-config instead of an IP:
if is_ipaddress(args[2]): ip_address = args[2] else: net_name = args[2] ip_address = get_ip_addr(get_network(network_caps(), net_name)) if not ip_address: raise IpAddressNotFoundError(net_name)
By looking above this block you can see the comment below. Which states that the second argument is IP or network name and FQDN comes only after that. So that is tied to the ansible condition that we are getting the second parameter as an empty string. """ ovn-config IP-central [tunneling-IP|tunneling-network] host-fqdn Configures the ovn-controller on the host. Parameters: IP-central - the IP of the engine (the host where OVN central is located) tunneling-IP - the local IP which is to be used for OVN tunneling tunneling-network - the vdsm network meant to be used for OVN tunneling host-fqdn - FQDN that will be set as system-id for OvS (optional) """
Now, this is as far I got. As far as WHY the get_network() function isn't working, I haven't looked further into the ovirt code and can't say. But it appears somehow this function fails when attempting to resolve FQDN's. Which brings me to the WORKAROUND!
So the get_network() isn't really buggy in this sense, it expects a network name and not FQDN.
Since the error lies in translating a FQDN to an IP, if you instead provide an IP address in the first place, it completely bypasses the buggy get_networks() function, and lets you add a host.
This is actually not a workaround, but proper initialization of how it is supposed to be done.
So, when you run the host deploy, if you add the host using it's IP address vs. its FQDN, it goes through fine, and I've tested this on my cluster and it worked beautifully.
The only caveat is you can't add with the FQDN, but for now, our cluster is up and working.
_______________________________________________
Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/MXRNYITWWLR4RU...
With that being said, the problem is somewhere in the engine in a way how it propagates "ovn_central" and why it ends up being an empty string. Hopefully this helps. Best regards, Ales -- Ales Musil Senior Software Engineer - OVN Core Red Hat EMEA <https://www.redhat.com> amusil@redhat.com <https://red.ht/sig>

Hi! I have same problem after i upgraded my hosts from Rocky 8 to 9. Before all was normally, but now i see that OVN broken. Maybe you already find where is issue?

We use ovirt-engine 4.5.5-1.el8 and for us the following did the trick: 1. First you have to decide, if you need the ovn/ovs functionality. This is a cluster level decision! o If needed, you need to have a functional (You can test it) external network provider configured (Administration -> Providers -> External Network Provider) ?? As far as I know, a default “ovirt-provider-ovn” is already present in each installation o Then in the cluster option you choose the “Default Network Provider”. If you don’t need OVS/OVN, then select “No Default Provider”, else select the one you need o If you change here anything, ovirt want you to reinstall all your hosts in this cluster! 2. Then, replace the ansible file in your ovirt-engine: o /usr/share/ovirt-engine/ansible-runner-service-project/project/roles/ovirt-provider-ovn-driver/tasks/configure.yml o The new version can be found here: https://github.com/oVirt/ovirt-engine/blob/master/packaging/ansible-runner-s... as far as I can tell, only the last line (when statement) changes Probably, only ppl who don’t need ovn/ovs had (Setting: “No Default Provider”) had this problem. 3. Reinstall the hosts As always, know what you do, before you anything.
participants (5)
-
Ales Musil
-
Alexandr Mikhailov
-
andrea.cavegn@netfon.ch
-
Levi Wilbert
-
stephan.badenhorst@fnb.co.za