On Tue, Apr 23, 2024 at 6:57 AM Levi Wilbert <stop.play.rwd(a)gmail.com>
wrote:
I had this same issue on oVirt Node 4.5.5, however, I did not see the
same
code in
/usr/share/ovirt-engine/ansible-runner-service-project/project/roles/ovirt-provider-ovn-driver/tasks/configure.yml
on the hosted engine.
On my version 4.5.5, I have two blocks: one installs ovs and ensures Open
vSwitch is started, the second block installs the ovirt-provider-ovn-driver
and configures OVN (as well as some other steps).
Hi,
I would like to clarify what is happening.
For the first block, my when statement shows as:
when:
- cluster_switch == "ovs" or (ovn_central is defined)
For the second block, it shows:
when:
- ovn_central is defined
In Ansible, inside a when: statement, multiple lines beginning with "-"
are equivalent to AND conditions. For example:
when:
- this == true
- that == true
This would be equivalent to when: (this == true) and (that == true).
This condition is actually the problem, if you take a look at the previous
one, the key thing is "ovn_central | ipaddr", this expects a valid ip
address otherwise the condition will be false. However when the condition
is only "ovn_central is defined" it will be true also for empty string.
I didn't want to toy with the control logic, but I realized that this was
a non-issue. The error in this occurs in the Configuring OVN step, which in
my configure.yml is near the end of the second block. The when statements
are working fine, otherwise it wouldn't be executing those steps.
I dug in further, and the issue comes about when the installer attempts to
run:
vdsm-tool config-ovn <IP-Central> <FQDN> !
I tried this on my own system:
[root@b-drone11 ~]# vdsm-tool ovn-config 10.99.8.31
b-drone11.arcc.uwyo.edu
Traceback (most recent call last):
File "/usr/lib/python3.9/site-packages/vdsm/tool/ovn_config.py", line
117, in get_network
return networks[net_name]
KeyError: 'b-drone11.arcc.uwyo.edu'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/bin/vdsm-tool", line 195, in main
return tool_command[cmd]["command"](*args)
File "/usr/lib/python3.9/site-packages/vdsm/tool/ovn_config.py", line
63, in ovn_config
ip_address = get_ip_addr(get_network(network_caps(), net_name))
File "/usr/lib/python3.9/site-packages/vdsm/tool/ovn_config.py", line
119, in get_network
raise NetworkNotFoundError(net_name)
vdsm.tool.ovn_config.NetworkNotFoundError:
b-drone11.arcc.uwyo.edu
It's the same error as in the host-deploy logs. If you dig in a bit more,
you'll find in the ovn_config.py script referred to by the above output,
there's a function get_networks() that is throwing the error:
def get_network(net_caps, net_name):
networks = net_caps['networks']
try:
return networks[net_name]
except KeyError:
raise NetworkNotFoundError(net_name)
Digging in EVEN further, if you look at where the function is called and
how the "net_name" variable comes in, you'll find that it's only run
when a
FQDN is given as an argument to vdsm-tool ovn-config instead of an IP:
if is_ipaddress(args[2]):
ip_address = args[2]
else:
net_name = args[2]
ip_address = get_ip_addr(get_network(network_caps(), net_name))
if not ip_address:
raise IpAddressNotFoundError(net_name)
By looking above this block you can see the comment below. Which states
that the second argument is IP or network name and FQDN comes only after
that. So that is tied to the ansible condition that we are getting the
second parameter as an empty string.
"""
ovn-config IP-central [tunneling-IP|tunneling-network] host-fqdn
Configures the ovn-controller on the host.
Parameters:
IP-central - the IP of the engine (the host where OVN central is
located)
tunneling-IP - the local IP which is to be used for OVN tunneling
tunneling-network - the vdsm network meant to be used for OVN tunneling
host-fqdn - FQDN that will be set as system-id for OvS (optional)
"""
Now, this is as far I got. As far as WHY the get_network() function isn't
working, I haven't looked further into the ovirt code and can't say. But it
appears somehow this function fails when attempting to resolve FQDN's.
Which brings me to the WORKAROUND!
So the get_network() isn't really buggy in this sense, it expects a network
name and not FQDN.
Since the error lies in translating a FQDN to an IP, if you instead
provide an IP address in the first place, it completely bypasses the buggy
get_networks() function, and lets you add a host.
This is actually not a workaround, but proper initialization of how it is
supposed to be done.
So, when you run the host deploy, if you add the host using it's
IP
address vs. its FQDN, it goes through fine, and I've tested this on my
cluster and it worked beautifully.
The only caveat is you can't add with the FQDN, but for now, our cluster
is up and working.
_______________________________________________
With that being said, the problem is somewhere in the engine in a way how
it propagates "ovn_central" and why it ends up being an empty string.
Hopefully this helps.
Best regards,
Ales
--
Ales Musil
Senior Software Engineer - OVN Core
Red Hat EMEA <
https://www.redhat.com>
amusil(a)redhat.com
<
https://red.ht/sig>