This is resolved, and my environment is 100% stable now.
Or was, until I then used the engine to "upgrade" one of the hosts, at which
point I started having problems again after the reboot, because the old vlan came back.
I'll finish getting things stabilized today, and hopefully won't run into this
again.
I've been turning things on and off quite a bit, because they aren't in a proper
data center (yet) and are just sitting here in my home office.
So I'm sure shutting them down and turning them back on fairly often hasn't helped
the situation.
I initially had a few issues going on:
1. I of course first broke things when I tried to change the management vlan
2. Aside from my notes below and the troubleshooting steps I went through, yesterday, I
had forgotten that connectivity to the DNS server hadn't been restored. Once I got DNS
operational, the engine was able to see two of the hosts, and finally started showing some
green.
3. I then went in and ran `hosted-engine --vm-stop` to shutdown the engine, and then I
started it again... and viola. The last remaining problematic host came online, and a few
minutes later, the disks, volumes, and datacenter came online.
4. I think part of my problem has been this switch. I purchased a Netgear GS324T for my
frontend traffic. But I've also needed to put my backend traffic onto some temporary
ports on that switch until I can get a VM controller setup that will run my other switch,
a Ubiquiti US-XG-16 for my permanent backend traffic. The Netgear hasn't been nearly
as simple to configure as I had hoped. The vlan behavior has also been inconsistent -
sometimes I have vlan settings in place, and things work. Sometimes they don't work.
It has also been re-assigning a of the vlans occasionally after reboots, which has been
frustrating. I'm close to being completely done configuring the infrastructure, but
I'm also getting increasingly tempted to go find a different switch.
Lessons learned:
1. Always make sure DNS is functional
1. I was really hoping that I could run DNS as a VM (or multiple VMs) inside the
cluster.
2. That said, if the cluster and the engine won't even start correctly without, then
I may need to run DNS externally. I'm open to feedback on this.
1. I have 1 extra U of space at the datacenter reserved, and I do have a 4th spare server
that I haven't decided what to do with yet. It has way more CPU and RAM than would be
necessary to run an internal DNS server... but perhaps I have no choice. Thoughts?
3. Make sure your vlan settings are correct before you start deploying the hosted engine
and configure oVirt.
4. If possible, don't turn off and turn on your servers constantly. :) I realize this
is a given. I just don't have much choice in the matter right now, due to lack of
datacenter in my home office.
Sent with ProtonMail Secure Email.
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Friday, April 9, 2021 5:55 AM, David White via Users <users(a)ovirt.org> wrote:
I was able to fix the connectivity issues between all 3 hosts.
It turned out that I hadn't completely deleted the old vlan settings from the host. I
re-ran "nmcli connection delete" on the old vlan. After that, I had to edit a
network-scripts file and change/fix the bridge to use ifcfg-ovirtmgmt.
After I did all that, the problematic host was accessible again. All 3 Gluster peers are
now able to see each other and communicate over the management network.
From the command line, I was then able to successfully run
"hosted-engine --connect-storage" without errors. I was also able to then run
"hosted-engine --vm-start".
Unfortunately, the engine itself is still unstable, and when I access the web UI / oVirt
Manager, it shows that all 3 hosts are inaccessible and down.
I don't understand how the web UI is operational at all if the
engine thinks that all 3 hosts are inaccessible. What's going on there?
Although the initial problem was my own doing (I changed the
management vlan), I'm deeply concerned with how unstable everything became - and has
continued to be- ever since I lost connectivity to the 1 host. I thought the point of all
of this was that things would (should) continue to work if 1 of the hosts went away.
Anyway, at that point, all 3 hosts are able to communicate with each
other over the management network, but the engine still thinks that all 3 hosts are down,
and is unable to manage anything.
Any suggestions on how to proceed would be much appreciated.
Sent with ProtonMail Secure Email.
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Wednesday, April 7, 2021 8:28 PM, David White <dmwhite823(a)protonmail.com>
wrote:
> I still haven't been able to resurrect the 1st host, so
I've spent some time trying to get the hosted engine stable. I would welcome input on
how to fix the problematic host so that it can be accessible again.
>
> As per my original email, this all started when I tried to
change the management vlan. I honestly cannot remember what I did (if anything) to the
actual hosts when this all started, but my troubleshooting steps today have been to try to
fiddle with the vlan settings and /etc/sysconfig/network-scripts/ files on the problematic
host to switch from the original vlan (1) to the new vlan (10).
>
> Until then, I'm troubleshooting why the hosted engine
isn't really working, since the other two hosts are operational.
>
> The hosted engine is "running" -- I can access and
navigate around the oVirt Manager.
> However, it appears that all of the storage domains are down, and all of the hosts
are "NonOperational". I was, however, able to put two of the hosts into
Maintenance Mode, including the problematic 1st host.
>
> This is what I see on the 2nd host:
>
> [root@cha2-storage network-scripts]# gluster peer status
> Number of Peers: 2
>
> Hostname:
cha1-storage.mgt.example.com
> Uuid: 348de1f3-5efe-4e0c-b58e-9cf48071e8e1
> State: Peer in Cluster (Disconnected)
>
> Hostname:
cha3-storage.mgt.example.com
> Uuid: 0563c3e8-237d-4409-a09a-ec51719b0da6
> State: Peer in Cluster (Connected)
>
> [root@cha2-storage network-scripts]# hosted-engine --vm-status
> The hosted engine configuration has not been retrieved from shared storage. Please
ensure that ovirt-ha-agent is running and the storage server is reachable.
>
> [root@cha2-storage network-scripts]# hosted-engine
--connect-storage
> Traceback (most recent call last):
> File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main
> "__main__", mod_spec)
> File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
> exec(code, run_globals)
> File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/connect_storage_server.py",
line 30, in <module>
> timeout=ohostedcons.Const.STORAGE_SERVER_TIMEOUT,
> File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/client/client.py", line
312, in connect_storage_server
> sserver.connect_storage_server(timeout=timeout)
> File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_server.py",
line 394, in connect_storage_server
> 'Connection to storage server failed'
> RuntimeError: Connection to storage server failed
>
> The ovirt-engine-ha service seems to be continuously trying to
load / activate, but failing:
> [root@cha2-storage network-scripts]# systemctl status -l ovirt-ha-agent
> ● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent
> Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor
preset: disabled)
> Active: activating (auto-restart) (Result: exit-code) since Wed 2021-04-07
20:24:46 EDT; 60ms ago
> Process: 124306 ExecStart=/usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent
(code=exited, status=157)
> Main PID: 124306 (code=exited, status=157)
>
> Some recent entries in
/var/log/ovirt-hosted-engine-ha/agent.log
> MainThread::ERROR::2021-04-07
20:22:59,115::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to
restart agent
> MainThread::INFO::2021-04-07
20:22:59,115::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting
down
> MainThread::INFO::2021-04-07
20:23:09,717::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run)
ovirt-hosted-engine-ha agent 2.4.6 started
> MainThread::INFO::2021-04-07
20:23:09,742::hosted_engine::242::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname)
Certificate common name not found, using hostname to identify host
> MainThread::INFO::2021-04-07
20:23:09,837::hosted_engine::548::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker)
Initializing ha-broker connection
> MainThread::INFO::2021-04-07
20:23:09,838::brokerlink::82::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor)
Starting monitor network, options {'addr': '10.1.0.1',
'network_test': 'dns', 'tcp_t_address': '',
'tcp_t_port': ''}
> MainThread::ERROR::2021-04-07
20:23:09,839::hosted_engine::564::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker)
Failed to start necessary monitors
> MainThread::ERROR::2021-04-07
20:23:09,842::agent::143::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback
(most recent call last):
> File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
line 85, in start_monitor
> response = self._proxy.start_monitor(type, options)
> File "/usr/lib64/python3.6/xmlrpc/client.py", line 1112, in __call__
> return self.__send(self.__name, args)
> File "/usr/lib64/python3.6/xmlrpc/client.py", line 1452, in __request
> verbose=self.__verbose
> File "/usr/lib64/python3.6/xmlrpc/client.py", line 1154, in request
> return self.single_request(host, handler, request_body, verbose)
> File "/usr/lib64/python3.6/xmlrpc/client.py", line 1166, in
single_request
> http_conn = self.send_request(host, handler, request_body, verbose)
> File "/usr/lib64/python3.6/xmlrpc/client.py", line 1279, in
send_request
> self.send_content(connection, request_body)
> File "/usr/lib64/python3.6/xmlrpc/client.py", line 1309, in
send_content
> connection.endheaders(request_body)
> File "/usr/lib64/python3.6/http/client.py", line 1249, in endheaders
> self._send_output(message_body, encode_chunked=encode_chunked)
> File "/usr/lib64/python3.6/http/client.py", line 1036, in _send_output
> self.send(msg)
> File "/usr/lib64/python3.6/http/client.py", line 974, in send
> self.connect()
> File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/unixrpc.py", line
74, in connect
> self.sock.connect(base64.b16decode(self.host))
> FileNotFoundError: [Errno 2] No such file or directory
>
> During handling of the above exception, another exception
occurred:
>
> Traceback (most recent call last):
> File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line
131, in _run_agent
> return action(he)
> File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line
55, in action_proper
> return he.start_monitoring()
> File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
line 437, in start_monitoring
> self._initialize_broker()
> File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
line 561, in _initialize_broker
> m.get('options', {}))
> File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
line 91, in start_monitor
> ).format(t=type, o=options, e=e)
> ovirt_hosted_engine_ha.lib.exceptions.RequestError: brokerlink - failed to start
monitor via ovirt-ha-broker: [Errno 2] No such file or directory, [monitor:
'network', options: {'addr': '10.1.0.1', 'network_test':
'dns', 'tcp_t_address': '', 'tcp_t_port': ''}]
>
> MainThread::ERROR::2021-04-07
20:23:09,842::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to
restart agent
> MainThread::INFO::2021-04-07
20:23:09,842::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting
down
>
> Sent with ProtonMail Secure Email.
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Wednesday, April 7, 2021 5:36 PM, David White via Users <users(a)ovirt.org>
wrote:
>
> > I'm working on setting up my environment prior to
production, and have run into an issue.
> >
> > I got most things configured, but due to a limitation on
one of my switches, I decided to change the management vlan that the hosts communicate on.
Over the course of changing that vlan, I wound up resetting my router to default
settings.
> >
> > I have the router operational again, and I also have 1 of
my switches operational.
> > Now, I'm trying to bring the oVirt cluster back online.
> > This is oVirt 4.5 running on RHEL 8.3.
> >
> > The old vlan is 1, and the new vlan is 10.
> >
> > Currently, hosts 2 & 3 are accessible over the new
vlan, and can ping each other.
> > I'm able to ssh to both hosts, and when I run "gluster peer
status", I see that they are connected to each other.
> >
> > However, host 1 is not accessible from anything. I
can't ping it, and it cannot get out.
> >
> > As part of my troubleshooting, I've done the
following:
> > From the host console, I ran `nmcli connection delete` to delete the old vlan
(VLAN 1).
> > I moved the /etc/sysconfig/network-scripts/interface.1 file to interface.10,
and edited the file accordingly to make sure the vlan and device settings are set to 10
instead of 1, and I rebooted the host.
> >
> > The engine seems to be running, but I don't understand
why.
> > From each of the hosts that are working (host 2 and host 3), I ran
"hosted-engine --check-liveliness" and both hosts indicate that the engine is
NOT running.
> >
> > Yet the engine loads in a web browser, and I'm able to
log into /ovirt-engine/webadmin/.
> > The engine thinks that all 3 hosts is nonresponsive. See screenshot below:
> >
> > [Screenshot from 2021-04-07 17-33-48.png]
> >
> > What I'm really looking for help with is to get the
first host back online.
> > Once it is healthy and gluster is healthy, I feel confident I can get the
engine operational again.
> >
> > What else should I look for on this host?
> >
> > Sent with ProtonMail Secure Email.