Network Interface Already In USe - Self-Hosted Install

Hi Guys & Girls, <begin_rant> OK, so I am really, *really* starting to get fed up with this. I know this is probably my fault, but even if it is then the oVirt documentation isn't helping in any way (being... "less than clear"). What I would really like is instead of having to rely on the "black box" that is Ansible, what I'd like is a simple set of clear cut instructions, Step-By-Step, so that we actually *know* what was going on when attempting to do a Self-Hosted install. After all, oVirt's "competition" doesn't make things so difficult... <end_rant> Now that I've got that on my chest, I'm trying to do a straight forward Self-Hosted Install. I've followed the instructions in the oVirt doco pretty much to the letter, and I'm still having problems. My (pre-install) set-up: - A freshly installed server (oVirt_Node_1) running Rocky Linux 8.6 with 3 NICs - NIC_1, NIC_2, & NIC_3. - There are three VLANs - VLAN_A (172.16.1.0/24), VLAN_B (172.16.2.0/24), & VLAN_C (172.16.3.0/24). - NIC_1 & NIC_2 are formed into a bond (bond_1). - bond_1 is an 802.3ad bond. - bond_1 has 2 sub-interfaces - bond_1.a & bond_1.b - Interface bond_1.a in in VLAN_A. - Interface bond_1.b is in VLAN_B. - NIC_3 is sitting in VLAN_C. - VLAN_A is the everyday "working" VLAN where the rest of the servers all sit (ie DNS Servers, Local Repository Server, etc, etc, etc), and where the oVirt Engine (OVE) will sit. - VLAN B is for data throughput to and from the Ceph iSCSI Gateways in our Ceph Storage Cluster. This is a dedicated isolated VLAN with no gateway (ie only the oVirt Hosting Nodes and the Ceph iSCSI Gateways are on this VLAN). - VLAN C is for OOB management traffic. This is a dedicated isolated VLAN with no gateway. Everything is working. Everything can ping properly back and forth within the individual VLANs and VLAN_A can ping out to the Internet via its gateway (172.16.1.1). Because we don't require iSCSI connectivity for the OVE (its on a working local Gluster TSP volume) the iSCSI hasn't *yet* been implemented. After trying to do the install using our Local Repository Mirror (after discovering and mirroring all the required repositories), I gave up on that because for a "one-off" install it wasn't worth the time and effort it was taking, especially when it "seems" that the Ansible playbook wants the "original" repositories anyway - but that's another rant/issue. So, I'm using all the original repositories as per the oVirt doco, including the special instructions for Rocky Linux and RHEL-derivatives in general, and using the defaults for the answers to the deployment script (except where there are no defaults) - and now I've got the following error: ~~~ [ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "cmd": ["virsh", "net-start", "default"], "delta": "0:00:00.031972", "end": "2022-10-04 16:41:38.603454", "msg": "non-zero return code", "rc": 1, "start": "2022-10-04 16:41:38.571482", "stderr": "error: Failed to start network default\nerror: internal error: Network is already in use by interface bond_1.a", "stderr_lines": ["error: Failed to start network default", "error: internal error: Network is already in use by interface bond_1.a"], "stdout": "", "stdout_lines": []} [ ERROR ] Failed to execute stage 'Closing up': Failed getting local_vm_dir ~~~ The relevant lines from the log file (at least I think these are the relevant lines): ~~~ 2022-10-04 16:41:35,712+1100 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:115 TASK [ovirt.ovirt.hosted_engine_setup : Update libvirt default network configuration, undefine] 2022-10-04 16:41:37,017+1100 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:109 {'changed': False, 'stdout': '', 'stderr': "error: failed to get network 'default'\nerror: Network not found: no network with matching name 'default'", 'rc': 1, 'cmd': ['virsh', 'net-undefine', 'default'], 'start': '2022-10-04 16:41:35.806251', 'end': '2022-10-04 16:41:36.839780', 'delta': '0:00:01.033529', 'msg': 'non-zero return code', 'invocation': {'module_args': {'_raw_params': 'virsh net-undefine default', '_uses_shell': False, 'warn': False, 'stdin_add_newline': True, 'strip_empty_ends': True, 'argv': None, 'chdir': None, 'executable': None, 'creates': None, 'removes': None, 'stdin': None}}, 'stdout_lines': [], 'stderr_lines': ["error: failed to get network 'default'", "error: Network not found: no network with matching name 'default'"], '_ansible_no_log': False} 2022-10-04 16:41:37,118+1100 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:109 ignored: [localhost]: FAILED! => {"changed": false, "cmd": ["virsh", "net-undefine", "default"], "delta": "0:00:01.033529", "end": "2022-10-04 16:41:36.839780", "msg": "non-zero return code", "rc": 1, "start": "2022-10-04 16:41:35.806251", "stderr": "error: failed to get network 'default'\nerror: Network not found: no network with matching name 'default'", "stderr_lines": ["error: failed to get network 'default'", "error: Network not found: no network with matching name 'default'"], "stdout": "", "stdout_lines": []} 2022-10-04 16:41:37,219+1100 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:115 TASK [ovirt.ovirt.hosted_engine_setup : Update libvirt default network configuration, define] 2022-10-04 16:41:38,421+1100 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:115 ok: [localhost] 2022-10-04 16:41:38,522+1100 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:115 TASK [ovirt.ovirt.hosted_engine_setup : Activate default libvirt network] 2022-10-04 16:41:38,823+1100 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:109 {'changed': False, 'stdout': '', 'stderr': 'error: Failed to start network default\nerror: internal error: Network is already in use by interface bond_1.a', 'rc': 1, 'cmd': ['virsh', 'net-start', 'default'], 'start': '2022-10-04 16:41:38.571482', 'end': '2022-10-04 16:41:38.603454', 'delta': '0:00:00.031972', 'msg': 'non-zero return code', 'invocation': {'module_args': {'_raw_params': 'virsh net-start default', '_uses_shell': False, 'warn': False, 'stdin_add_newline': True, 'strip_empty_ends': True, 'argv': None, 'chdir': None, 'executable': None, 'creates': None, 'removes': None, 'stdin': None}}, 'stdout_lines': [], 'stderr_lines': ['error: Failed to start network default', 'error: internal error: Network is already in use by interface bond_1.a'], '_ansible_no_log': False} 2022-10-04 16:41:38,924+1100 ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:113 fatal: [localhost]: FAILED! => {"changed": false, "cmd": ["virsh", "net-start", "default"], "delta": "0:00:00.031972", "end": "2022-10-04 16:41:38.603454", "msg": "non-zero return code", "rc": 1, "start": "2022-10-04 16:41:38.571482", "stderr": "error: Failed to start network default\nerror: internal error: Network is already in use by interface bond_1.a", "stderr_lines": ["error: Failed to start network default", "error: internal error: Network is already in use by interface bond_1.a"], "stdout": "", "stdout_lines": []} 2022-10-04 16:41:39,125+1100 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:109 PLAY RECAP [localhost] : ok: 106 changed: 32 unreachable: 0 skipped: 61 failed: 1 2022-10-04 16:41:39,226+1100 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils.run:226 ansible-playbook rc: 2 2022-10-04 16:41:39,226+1100 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils.run:233 ansible-playbook stdout: 2022-10-04 16:41:39,226+1100 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils.run:236 ansible-playbook stderr: 2022-10-04 16:41:39,226+1100 DEBUG otopi.plugins.gr_he_ansiblesetup.core.misc misc._closeup:475 {'otopi_host_net': {'ansible_facts': {'otopi_host_net': ['ens0p1', 'bond_1.a', 'bond_1.b']}, '_ansible_no_log': False, 'changed': False}, 'ansible-playbook_rc': 2} 2022-10-04 16:41:39,226+1100 DEBUG otopi.context context._executeMethod:145 method exception Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/otopi/context.py", line 132, in _executeMethod method['method']() File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/gr-he-ansiblesetup/core/misc.py", line 485, in _closeup raise RuntimeError(_('Failed getting local_vm_dir')) RuntimeError: Failed getting local_vm_dir 2022-10-04 16:41:39,227+1100 ERROR otopi.context context._executeMethod:154 Failed to execute stage 'Closing up': Failed getting local_vm_dir 2022-10-04 16:41:39,228+1100 DEBUG otopi.context context.dumpEnvironment:765 ENVIRONMENT DUMP - BEGIN 2022-10-04 16:41:39,228+1100 DEBUG otopi.context context.dumpEnvironment:775 ENV BASE/error=bool:'True' 2022-10-04 16:41:39,228+1100 DEBUG otopi.context context.dumpEnvironment:775 ENV BASE/exceptionInfo=list:'[(<class 'RuntimeError'>, RuntimeError('Failed getting local_vm_dir',), <traceback object at 0x7f5210013088>)]' 2022-10-04 16:41:39,228+1100 DEBUG otopi.context context.dumpEnvironment:779 ENVIRONMENT DUMP - END ~~~ So, would someone please help me in getting this sorted - I mean, how are we supposed to do this install if the interface we need to connect to the box in the first place can't be used because it's "already in use"? Cheers Dulux-Oz

Hi Guys, I'm giving this a bump because I really do need help in getting this resolved - please. Cheers Dulux-Oz

Hi All, OK, so after much reading of logs, Ansible files, blog posts, documentation, and much gnashing of teeth, glasses of bourbon, language to make a sailor blush, tears, blood, sweat, and various versions of "DOH!", I finally worked out what was wrong - what I did wrong - and so I'm putting it down here so that the next person who comes along with the same (or a similar) issue doesn't have to go through what I went through - and I'm including a couple of suggestions to the devs/doco writers which (I believe) would have stopped me from making my mistake in the first place. When I did my install I used the command: ~~~ hosted-engine --deploy --4 --ansible-extra-vars=he_ipv4_subnet_prefix=172.16.1 ~~~ I did this because we're running an IPv4 network and because the oVirt Engine needs to be on the 172.16.1.0/24 network - and that's what I thought the "he_ipv4_subnet_prefix" option did, and I was trying to let the deployment script know this in advance instead of having to discover this itself. Now that I've gone back over *all* the doco I realise that the "he_ipv4_subnet_prefix" option is *not* used for this purpose, but is instead used for the *temporary* ip address of the deployment engine when the default subnet of 192.168.222.0/24 is not available. Because I was specifying the 172.16.1.0/24 network (which is already in use) the deployment failed because it was attempting to create that network as a temporary network for the initial deployment. So yes, as I said, my fault - no question about that at all. Some suggestions: Although it is stated in the documentation - Installing oVirt As A Self-Hosted Engine Using The Command Line, section 2.3.2 (https://www.ovirt.org/documentation/installing_ovirt_as_a_self-hosted_engine...) - (I believe) it is not very clear what is happening here, so a "Note:" or some sort of statement explicitly stating what this is used for might be in order. For example, here is the note I made for our team in our internal documentation: ~~~ **Note:** he_ipv4_subnet_prefix=x.x.x: - This is a temporary network prefix if 192.168.222.0/24 (the default) is not available - this is ***NOT*** the final working subnet of the oVirt Engine. ~~~ I also believe - quite strongly, in fact - that having the entire deployment hidden behind the "black box" that is the Ansible deployment - while making things easy by automating the deployment - makes troubleshooting more difficult. I believe that if there was a definite "Step-By-Step" list of what was going on behind the scenes - perhaps as an Appendix to the documentation - then the mistake I made would have been a lot harder to make - ie if there was such a list then it would have been less likely to make the assumption I made. I'm thinking something along the lines of (and I am aware that what follows is not correct): ~~~ 1. Collect info - this is stored in "/path/file" temporarily. 2. Install Deployment VM. 3. Deployment VM creates internal bridge - this uses 192.168.222.0/24 by default but can be overridden by "he_ipv4_subnet_prefix". 4. Deployment Engine creates oVirt Engine. etc, etc, etc ~~~ Anyway, that's my feedback / suggestions / mea culpa / whatever. :-) Cheers Dulux-Oz

Hi, On Tue, Oct 11, 2022 at 9:10 AM Matthew J Black <matthew@peregrineit.net> wrote:
Hi All,
OK, so after much reading of logs, Ansible files, blog posts, documentation, and much gnashing of teeth, glasses of bourbon, language to make a sailor blush, tears, blood, sweat, and various versions of "DOH!", I finally worked out what was wrong - what I did wrong - and so I'm putting it down here so that the next person who comes along with the same (or a similar) issue doesn't have to go through what I went through - and I'm including a couple of suggestions to the devs/doco writers which (I believe) would have stopped me from making my mistake in the first place.
Much appreciated!
When I did my install I used the command:
~~~ hosted-engine --deploy --4 --ansible-extra-vars=he_ipv4_subnet_prefix=172.16.1 ~~~
I did this because we're running an IPv4 network and because the oVirt Engine needs to be on the 172.16.1.0/24 network - and that's what I thought the "he_ipv4_subnet_prefix" option did, and I was trying to let the deployment script know this in advance instead of having to discover this itself.
Now that I've gone back over *all* the doco I realise that the "he_ipv4_subnet_prefix" option is *not* used for this purpose, but is instead used for the *temporary* ip address of the deployment engine when the default subnet of 192.168.222.0/24 is not available.
Because I was specifying the 172.16.1.0/24 network (which is already in use) the deployment failed because it was attempting to create that network as a temporary network for the initial deployment.
So yes, as I said, my fault - no question about that at all.
Some suggestions:
Although it is stated in the documentation - Installing oVirt As A Self-Hosted Engine Using The Command Line, section 2.3.2 (https://www.ovirt.org/documentation/installing_ovirt_as_a_self-hosted_engine...) - (I believe) it is not very clear what is happening here, so a "Note:" or some sort of statement explicitly stating what this is used for might be in order. For example, here is the note I made for our team in our internal documentation:
~~~ **Note:** he_ipv4_subnet_prefix=x.x.x: - This is a temporary network prefix if 192.168.222.0/24 (the default) is not available - this is ***NOT*** the final working subnet of the oVirt Engine. ~~~
I now read the subsection you linked to above - and IMO the context is well-presented - if you read the entirety of 2.3.2 (6 lines, in my browser), it should be clear. But of course - patches are welcome! This page has, like most others in the website, an "Edit this page" link at the bottom.
I also believe - quite strongly, in fact - that having the entire deployment hidden behind the "black box" that is the Ansible deployment - while making things easy by automating the deployment - makes troubleshooting more difficult. I believe that if there was a definite "Step-By-Step" list of what was going on behind the scenes - perhaps as an Appendix to the documentation - then the mistake I made would have been a lot harder to make - ie if there was such a list then it would have been less likely to make the assumption I made.
I'm thinking something along the lines of (and I am aware that what follows is not correct):
~~~ 1. Collect info - this is stored in "/path/file" temporarily. 2. Install Deployment VM. 3. Deployment VM creates internal bridge - this uses 192.168.222.0/24 by default but can be overridden by "he_ipv4_subnet_prefix". 4. Deployment Engine creates oVirt Engine. etc, etc, etc ~~~
Makes sense, but I do not think doing this well, and above that maintaining this well over time/versions - is going to happen. We have a very nice presentation from a few years ago, still relevant even if not up-to-date, which might help get the big picture. Searching google for "ovirt hosted-engine deep dive" finds it, for me: https://www.ovirt.org/media/Hosted-Engine-4.3-deep-dive.pdf BTW, in the long distant past, hosted-engine deployment was much more manual (the script guided you through stuff, but you did a lot more by hand - including installing the OS and engine on the VM, configuring stuff, etc.) and the move to what we have now (called "node zero" or "node 0" in some places, including above pdf) was definitely a huge improvement.
Anyway, that's my feedback / suggestions / mea culpa / whatever. :-)
Thanks! Best regards, -- Didi
participants (2)
-
Matthew J Black
-
Yedidyah Bar David