[ovirt-users] Need help recovering ovirt engine (was: New Management VLAN for hyperconverged cluster)

Friday, 9 April 2021

I was able to fix the connectivity issues between all 3 hosts.
It turned out that I hadn't completely deleted the old vlan settings from the host. I
re-ran "nmcli connection delete" on the old vlan. After that, I had to edit a
network-scripts file and change/fix the bridge to use ifcfg-ovirtmgmt. After I did all
that, the problematic host was accessible again. All 3 Gluster peers are now able to see
each other and communicate over the management network. 

From the command line, I was then able to successfully run "hosted-engine
--connect-storage" without errors. I was also able to then run "hosted-engine
--vm-start".
Unfortunately, the engine itself is still unstable, and when I access the web UI / oVirt
Manager, it shows that all 3 hosts are inaccessible and down.

I don't understand how the web UI is operational at all if the engine thinks that all
3 hosts are inaccessible. What's going on there? 

Although the initial problem was my own doing (I changed the management vlan), I'm
deeply concerned with how unstable everything became - and has continued to be- ever since
I lost connectivity to the 1 host. I thought the point of all of this was that things
would (should) continue to work if 1 of the hosts went away.

Anyway, at that point, all 3 hosts are able to communicate with each other over the
management network, but the engine still thinks that all 3 hosts are down, and is unable
to manage anything.
Any suggestions on how to proceed would be much appreciated.

Sent with ProtonMail Secure Email.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Wednesday, April 7, 2021 8:28 PM, David White <dmwhite823(a)protonmail.com&gt; wrote:

...
 I still haven't been able to resurrect the 1st host, so I've
spent some time trying to get the hosted engine stable. I would welcome input on how to
fix the problematic host so that it can be accessible again.

...
 As per my original email, this all started when I tried to change the
management vlan. I honestly cannot remember what I did (if anything) to the actual hosts
when this all started, but my troubleshooting steps today have been to try to fiddle with
the vlan settings and /etc/sysconfig/network-scripts/ files on the problematic host to
switch from the original vlan (1) to the new vlan (10).

...
 Until then, I'm troubleshooting why the hosted engine isn't
really working, since the other two hosts are operational.

...
 The hosted engine is "running" -- I can access and navigate
around the oVirt Manager.
 However, it appears that all of the storage domains are down, and all of the hosts are
"NonOperational". I was, however, able to put two of the hosts into Maintenance
Mode, including the problematic 1st host.

...
 This is what I see on the 2nd host:

...
 [root@cha2-storage network-scripts]# gluster peer status
 Number of Peers: 2

...
 Hostname: cha1-storage.mgt.example.com
 Uuid: 348de1f3-5efe-4e0c-b58e-9cf48071e8e1
 State: Peer in Cluster (Disconnected)

...
 Hostname: cha3-storage.mgt.example.com
 Uuid: 0563c3e8-237d-4409-a09a-ec51719b0da6
 State: Peer in Cluster (Connected)

...
 [root@cha2-storage network-scripts]# hosted-engine --vm-status
 The hosted engine configuration has not been retrieved from shared storage. Please ensure
that ovirt-ha-agent is running and the storage server is reachable.

...
 [root@cha2-storage network-scripts]# hosted-engine --connect-storage
 Traceback (most recent call last):
   File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main
     "__main__", mod_spec)
   File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
     exec(code, run_globals)
   File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/connect_storage_server.py",
line 30, in <module>
     timeout=ohostedcons.Const.STORAGE_SERVER_TIMEOUT,
   File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/client/client.py", line
312, in connect_storage_server
     sserver.connect_storage_server(timeout=timeout)
   File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_server.py",
line 394, in connect_storage_server
     'Connection to storage server failed'
 RuntimeError: Connection to storage server failed

...
 The ovirt-engine-ha service seems to be continuously trying to load /
activate, but failing:
 [root@cha2-storage network-scripts]# systemctl status -l ovirt-ha-agent
 ● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent
    Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor
preset: disabled)
    Active: activating (auto-restart) (Result: exit-code) since Wed 2021-04-07 20:24:46
EDT; 60ms ago
   Process: 124306 ExecStart=/usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent
(code=exited, status=157)
 Main PID: 124306 (code=exited, status=157)

...
 Some recent entries in  /var/log/ovirt-hosted-engine-ha/agent.log
 MainThread::ERROR::2021-04-07
20:22:59,115::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to
restart agent
 MainThread::INFO::2021-04-07
20:22:59,115::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting
down
 MainThread::INFO::2021-04-07
20:23:09,717::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run)
ovirt-hosted-engine-ha agent 2.4.6 started
 MainThread::INFO::2021-04-07
20:23:09,742::hosted_engine::242::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname)
Certificate common name not found, using hostname to identify host
 MainThread::INFO::2021-04-07
20:23:09,837::hosted_engine::548::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker)
Initializing ha-broker connection
 MainThread::INFO::2021-04-07
20:23:09,838::brokerlink::82::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor)
Starting monitor network, options {'addr': '10.1.0.1',
'network_test': 'dns', 'tcp_t_address': '',
'tcp_t_port': ''}
 MainThread::ERROR::2021-04-07
20:23:09,839::hosted_engine::564::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker)
Failed to start necessary monitors
 MainThread::ERROR::2021-04-07
20:23:09,842::agent::143::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback
(most recent call last):
   File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
line 85, in start_monitor
     response = self._proxy.start_monitor(type, options)
   File "/usr/lib64/python3.6/xmlrpc/client.py", line 1112, in __call__
     return self.__send(self.__name, args)
   File "/usr/lib64/python3.6/xmlrpc/client.py", line 1452, in __request
     verbose=self.__verbose
   File "/usr/lib64/python3.6/xmlrpc/client.py", line 1154, in request
     return self.single_request(host, handler, request_body, verbose)
   File "/usr/lib64/python3.6/xmlrpc/client.py", line 1166, in single_request
     http_conn = self.send_request(host, handler, request_body, verbose)
   File "/usr/lib64/python3.6/xmlrpc/client.py", line 1279, in send_request
     self.send_content(connection, request_body)
   File "/usr/lib64/python3.6/xmlrpc/client.py", line 1309, in send_content
     connection.endheaders(request_body)
   File "/usr/lib64/python3.6/http/client.py", line 1249, in endheaders
     self._send_output(message_body, encode_chunked=encode_chunked)
   File "/usr/lib64/python3.6/http/client.py", line 1036, in _send_output
     self.send(msg)
   File "/usr/lib64/python3.6/http/client.py", line 974, in send
     self.connect()
   File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/unixrpc.py", line
74, in connect
     self.sock.connect(base64.b16decode(self.host))
 FileNotFoundError: [Errno 2] No such file or directory

...
 During handling of the above exception, another exception occurred:

...
 Traceback (most recent call last):
   File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line
131, in _run_agent
     return action(he)
   File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line
55, in action_proper
     return he.start_monitoring()
   File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
line 437, in start_monitoring
     self._initialize_broker()
   File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
line 561, in _initialize_broker
     m.get('options', {}))
   File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
line 91, in start_monitor
     ).format(t=type, o=options, e=e)
 ovirt_hosted_engine_ha.lib.exceptions.RequestError: brokerlink - failed to start monitor
via ovirt-ha-broker: [Errno 2] No such file or directory, [monitor: 'network',
options: {'addr': '10.1.0.1', 'network_test': 'dns',
'tcp_t_address': '', 'tcp_t_port': ''}]

...
 MainThread::ERROR::2021-04-07
20:23:09,842::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to
restart agent
 MainThread::INFO::2021-04-07
20:23:09,842::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting
down

...
 Sent with ProtonMail Secure Email.

...
 ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
 On Wednesday, April 7, 2021 5:36 PM, David White via Users <users(a)ovirt.org&gt;
wrote:

...
 > I'm working on setting up my environment prior to
production, and have run into an issue.
 >  
...
 > I got most things configured, but due to a limitation on one of
my switches, I decided to change the management vlan that the hosts communicate on. Over
the course of changing that vlan, I wound up resetting my router to default settings.
 >  
...
 > I have the router operational again, and I also have 1 of my
switches operational.
 > Now, I'm trying to bring the oVirt cluster back online.
 > This is oVirt 4.5 running on RHEL 8.3.
 >  
...
 > The old vlan is 1, and the new vlan is 10.
 >  
...
 > Currently, hosts 2 & 3 are accessible over the new vlan, and
can ping each other.
 > I'm able to ssh to both hosts, and when I run "gluster peer status", I
see that they are connected to each other.
 >  
...
 > However, host 1 is not accessible from anything. I can't
ping it, and it cannot get out.
 >  
...
 > As part of my troubleshooting, I've done the following:
 > From the host console, I ran `nmcli connection delete` to delete the old vlan (VLAN
1).
 > I moved the /etc/sysconfig/network-scripts/interface.1 file to interface.10, and
edited the file accordingly to make sure the vlan and device settings are set to 10
instead of 1, and I rebooted the host.
 >  
...
 > The engine seems to be running, but I don't understand why.
 > From each of the hosts that are working (host 2 and host 3), I ran
"hosted-engine --check-liveliness" and both hosts indicate that the engine is
NOT running.
 >  
...
 > Yet the engine loads in a web browser, and I'm able to log
into /ovirt-engine/webadmin/.
 > The engine thinks that all 3 hosts is nonresponsive. See screenshot below:
 >  
...
 > [Screenshot from 2021-04-07 17-33-48.png]
 >  
...
 > What I'm really looking for help with is to get the first
host back online.
 > Once it is healthy and gluster is healthy, I feel confident I can get the engine
operational again.
 >  
...
 > What else should I look for on this host? 
 >  
...
 > Sent with ProtonMail Secure Email. 

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

[ovirt-users] Need help recovering ovirt engine (was: New Management VLAN for hyperconverged cluster)