Greetings,
we have recently installed ovirt as a hosted-engine with high availability on six nodes
over NFS storage (no Gluster), with power management through an on-board IPMI device, and
the setup was successful. All the nodes (from Supermicro) are identical in every aspect,
so no hardware differences exist and no modifications to the servers' hardware were
performed. The hosted-engine was deployed on a second host, where two of the six hosts
only were required to host the HE VM.
The network interface on each node is bonded between two physical fiber optics NICs in
LACP mode with a VLAN on top, serving as the sole networking interface for the
server/node, no separate VM or storage networks were needed, as the host OS, hosted-engine
vm, and storage are required to be on the same network and VLAN.
We started by testing the high-availability of the hosted-engine VM (as it was deployed on
two of the six nodes) by rebooting or powering off one of the hosts, and the VM would
migrate successfully to the second HE node. The main goal of our experiments is to test
the robustness of the setup, as it is required for the cluster to remain functional even
when up to two hosts are brought down (whether due to a network or power issue), however,
when rebooting or powering off one of the hosts, the HE VM goes down and takes the entire
cluster with it, where we can't even access the web portal. Once the host is rebooted,
the HE VM and the cluster becomes functional again. Sometimes the HE VM stays down for a
set amount of time (5 to 6 minutes) and then goes back up, and sometimes it goes down
until the problematic host is back up. This behavior happens to other VMs as well not the
the HE.
We suspected an issue with the NFS storage, however, during ovirt operation it is being
mounted properly over /rhev/data-center/mnt/<nfs:directory>, while the expected
behavior is for the cluster to stay operational and any other VMs to be migrated to other
hosts. During one of the tests, we tried to mount the NFS storage on a different directory
and there was no problem, we were even able to perform commands such as ls without any
issues, as well as writing a text file at the directory's root, and be able to modify
it normally.
We suspected a couple of things the first being that the HE is unable to fence the
problematic host (the one we took down), however, power management is setup properly.
The other thing we suspected is the cluster hosts (after taking down one of them) are
unable to acquire storage lease, which is weird since the host in question is down and
non-operational, hence no locks should be in place. The reason behind this suspicion is
the following two errors that we receive frequently when one host or more goes down from
the engine\ovirt-engine\engine.log file:
1- "EVENT_ID: VM_DOWN_ERROR(119), VM HostedEngine is down with error. Exit message:
resource busy: Failed to acquire lock: Lease is held by another host."
2- "[<id>] Command 'GetVmLeaseInfoVDSCommand(
VmLeaseVDSParameters:{expectedEngineErrors='[NoSuchVmLeaseOnDomain]',
storagePoolId='<pool-id>', ignoreFailoverLimit='false',
leaseId='<lease-id>', storageDomainId='<domain-id>'})'
execution failed: IRSGenericException: IRSErrorException: No such lease:
'lease=<lease-id>'"
This is a third warning from the /var/log/vdsm/vdsm.log file
1- "WARN (check/loop) [storage.check] Checker
'/rhev/data-center/mnt/<nfs-domain:/directory>/<id>/dom_md/metadata'
is blocked for 310.00 seconds (check:265)"
All the tests are done without setting nodes into maintenance mode as we are simulating an
emergency situation. No HE configuration were modified via the config-engine command, the
default values are used.
Is this a normal behavior? Are we missing something? Do we need to tweak a certain
configuration using the config-engine command to get a better behavior (e.g., shorter down
period)?
Best regards
Show replies by date
You must not have an even number of HE-capable nodes. What you're running into is a
classic split-brain scenario, with only two nodes allowed to run the HE, and one of the
nodes down, the surviving host does not have quorum so does not know it can safely power
off the other machine (because obviously this surviving node, from its viewpoint, may have
somehow become isolated from the network while the other host is happily alive and running
the engine and controlling everything).
In clustering, you _never_ want two, or four, or six of something. One, three, five,
etc... because it must be impossible to have a "tie" situation when the
decisions are being made on which hosts are needing to be fenced.
The split-brain explanation makes a lot of sense, so I changed the HE VM deployed hosts
from 6 to 5, as well as set an Affinity Group with soft VM to Hosts enforcement for the
other VMs (since HE is managed separately) for 5 hosts, with a similar Affinity Label, and
I ran a test where I perform outstanding upgrades from the Administration Portal on any
host (where I chose that the host be rebooted), the host is moved to maintenance, updates
are installed and host is rebooted, I still get the same issue where the entire cluster
with every single VM including HE goes down. Is there an additional configuration that
needs to be set/tweaked for the cluster not to fail?
Thank you and best regards
Hi
I am not sure how split brain works in ovirt. In other cluster solutions
you can have an even number of nodes such as 4, 6 , 8 etc. All nodes writes
to quorum disk, say we have four nodes if one node goes down the other
three will vote
node 1 says I see node 2 and 3 but not 4
node 2 says I see node 1 and 3 but not 4
node 3 says I see node 1 and 2 but not 4
So all three votes node 4 must leave the cluster and the split brain is
solved
On Tue, Mar 5, 2024 at 10:58 AM Youssef Khristo <
youssef.khristo(a)elsewedyintel.com> wrote:
The split-brain explanation makes a lot of sense, so I changed the HE
VM
deployed hosts from 6 to 5, as well as set an Affinity Group with soft VM
to Hosts enforcement for the other VMs (since HE is managed separately) for
5 hosts, with a similar Affinity Label, and I ran a test where I perform
outstanding upgrades from the Administration Portal on any host (where I
chose that the host be rebooted), the host is moved to maintenance, updates
are installed and host is rebooted, I still get the same issue where the
entire cluster with every single VM including HE goes down. Is there an
additional configuration that needs to be set/tweaked for the cluster not
to fail?
Thank you and best regards
_______________________________________________
Users mailing list -- users(a)ovirt.org
To unsubscribe send an email to users-leave(a)ovirt.org
Privacy Statement:
https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
List Archives:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/AA4FCX6WHYC...
So for the sake of figuring out what's going on, I un-deployed the hosted-engine from
3 of the 6 hosts, and then removed those 3 from the cluster. I left only 2 VMs up (HE and
another one), and the remaining 3 hosts have the capability of running the hosted-engine
with an Active Score of 3400, so no split-brain scenario could happen.
I simply unplugged the network from one of the hosts, that also happened not to be hosting
any VMs so as to simulate an unexpected shutdown/failure, and in around 30 seconds the 2
VMs (including the HE) were down and stayed down until the failed host was brought back
up.
What I'm trying to say here is that HE and any other highly available VM fail if a
single host (just one) fails. Is there anything I could debug or look into that I
might've missed when setting up oVirt? Is there a misconfiguration somewhere?
Thank you and best regards
That is wild and definitely not in line with the behavior we saw when a motherboard failed
in one of our small 3-node oVirt 4.5 clusters. The bad host hard-froze, shortly after that
one of the surviving nodes fenced it via IPMI, and all the VMs that had been running on it
were started on surviving hosts.
It sounds like your experience is that a single node fails (even one with no VMs on it at
the time), and the surviving hosts shut down (or do they kill) the VMs that they have
running. Is that accurate?
How is this going? Thread is inactive currently