Hyperconverged engine high availability?
by David White
I just finished deploying oVirt 4.4.5 onto a 3-node hyperconverged cluster running on Red Hat 8.3 OS.
Over the course of the setup, I noticed that I had to setup the storage for the engine separately from the gluster bricks.
It looks like the engine was installed onto /rhev/data-center/ on the first host, whereas the gluster bricks for all 3 hosts are on /gluster_bricks/.
I fear that I may already know the answer to this, but:
Is it possible to make the engine highly available?
Also, thinking hypothetically here, what would happen to my VMs that are physically on the first server, if the first server crashed? The engine is what handles the high availability, correct? So what if a VM was running on the first host? There would be nothing to automatically "move" it to one of the remaining healthy hosts.
Or am I misunderstanding something here?
Sent with ProtonMail Secure Email.
4 years
Deployment issues
by Valerio Luccio
Hello all,
last September I deployed Ovirt on a CentOS 8 server, with storage on
our gluster (replica 3). I then added some VMs, etc. A few days ago I
managed to screw everything up and, after bannging my head for a couple
of days, decided to start from scratch.
I made a copy of all the data under the storage to a safe space, the ran
ovirt-hosted-engine-cleanup and deleted everything under the storage and
tried to create a new hosted engine (I tried both from the cockpit and
from command line). Everything seems to work fine (I can ssh to the
engine) until it tries to save the engine to storage and it fails with
the error:
FAILED! => {"changed": false, "msg": "Fault reason is \"Operation Failed\". Fault detail is \"[Error creating a storage domain's metadata]\". HTTP response code is 400."}
I don't get any more details.
I'm using exactly the same parameters I used before. I have no problems
reaching the gluster storage and the process does create the top-level
directory and <top-level>/dom_md/ids with the correct ownership. I
looked at the glusterfs log files, including the
rhev-data-center-mnt-glusterSD-<host:_root>.log file, but I don't spot
any specific error.
What am I doing wrong ? Is there something else I need to clean up
before trying a new deployment ? Should I just try to delete all of the
ovirt configuration files ? Which ones ?
Thanks,
--
As a result of Coronavirus-related precautions, NYU and the Center for
Brain Imaging operations will be managed remotely until further notice.
All telephone calls and e-mail correspondence are being monitored
remotely during our normal business hours of 9am-5pm, Monday through
Friday.
For MRI scanner-related emergency, please contact: Keith Sanzenbach at
keith.sanzenbach(a)nyu.edu and/or Pablo Velasco at pablo.velasco(a)nyu.edu
For computer/hardware/software emergency, please contact: Valerio Luccio
at valerio.luccio(a)nyu.edu
For TMS/EEG-related emergency, please contact: Chrysa Papadaniil at
chrysa(a)nyu.edu
For CBI-related administrative emergency, please contact: Jennifer
Mangan at jennifer.mangan(a)nyu.edu
Valerio Luccio (212) 998-8736
Center for Brain Imaging 4 Washington Place, Room 158
New York University New York, NY 10003
"In an open world, who needs windows or gates ?"
4 years
Re: Power failure makes cluster and hosted engine unusable
by Thomas Hoberg
Yup, that's a bug in the ansible code, I've come across on hosts that had 512GB of RAM.
I quite simply deleted the checks from the ansible code and re-ran the wizard.
I can't read YAML or Python or whatever it is that Ansible uses, but my impression is that things are 'cast' or converted into an INT data type on these checks that overflows at that point. I wound up commenting out the entire set of checks to get past this, because I could see no easy way to fix this. I just checked that the commands used to retrieve the memory size returned the proper number of kilobytes and then rolled my eyes at what seemd like a type cast operation.
I never went deeper, because at that point I have a hard time keeping the beast inside me at bay, that sees how Ansible can bring a Xeon Scalable Gold system to what seems slower than a 6502 executing BASIC. The hosted-setup takes what seems like an hour, no matter how fast or slow your hardware.
I was so fed up at the speed of Ansible and the quality of oVirt QA, I couldn't bring myself to open a ticket, I hope you're better motivated.
BTW things aren't much better at the low end either. While Ansible doesn't seem that much slower on an Atom farm I also operate, the hosted-engine setup does fail on them, so replug their SSDs for that part into an i7 and then replug the SSDs to the Atoms aftwards. Once, up and running oVirt is just fine on Atoms (mine have 32GB of RAM each).
I am almost ready to donate my Atoms to the project, because I keep thinking that oVirt's major chance would be as edge HCI, but they are asking for 128GB RAM minimum...
BTW. I was running oVirt 4.3 on CentOS7 when hitting that error. No idea if it's still the same with 4.4/COS8 as my 'perhaps-next-but-more-likely-never' generation test farm runs on NUCs with 64GB.
4 years
Network issue
by Valerio Luccio
I'm slowly getting things back up, but I ran into a perplexing network
problem.
I created a new vNIC profile in the engine web UI and attached it to the
hosted engine to test it, which in retrospect was not a good idea. I
realized that I needed to change something in the parameters of the
vNIC, but it won't let me do it because it's attached to the running
engine; it will not let me remove the NIC from the engine because it's
running. If I shut down the engine, how can I then change the
configuration of it ? Seems like a Catch 22 situation.
Thanks,
--
As a result of Coronavirus-related precautions, NYU and the Center for
Brain Imaging operations will be managed remotely until further notice.
All telephone calls and e-mail correspondence are being monitored
remotely during our normal business hours of 9am-5pm, Monday through
Friday.
For MRI scanner-related emergency, please contact: Keith Sanzenbach at
keith.sanzenbach(a)nyu.edu and/or Pablo Velasco at pablo.velasco(a)nyu.edu
For computer/hardware/software emergency, please contact: Valerio Luccio
at valerio.luccio(a)nyu.edu
For TMS/EEG-related emergency, please contact: Chrysa Papadaniil at
chrysa(a)nyu.edu
For CBI-related administrative emergency, please contact: Jennifer
Mangan at jennifer.mangan(a)nyu.edu
Valerio Luccio (212) 998-8736
Center for Brain Imaging 4 Washington Place, Room 158
New York University New York, NY 10003
"In an open world, who needs windows or gates ?"
4 years
Import VMs from disk
by valerio.luccio@nyu.edu
As I've described before I had issues with my deployment and had to recreate the engine.
I never had a chance to export my VMs, but I did back up the files. Now I would like to import the old VMs into the new engine. Is there a way to 'slurp' in the old images ? I see different way to import from an export domain, form VMWare, etc., but I don't see a way to import directly from disk. Do I need to convert them ? To what and how ?
Thanks,
4 years
Re: Power failure makes cluster and hosted engine unusable
by Vincent Royer
Seann,
If this happens again, try doing nothing (seriously) Each time I've had a
power failure, the engine takes a really long time to come back up. I
don't know if it's by design or what. Host logs are flooded with errors,
everything seemingly storage related. However, my Gluster setup is on fast
SSDs and gets back and running pretty much straight away. It takes maybe 5
minutes for the nodes to re-join and the volumes to show 'UP' with no
heals. However, it still takes the hosted-engine a good hour or two to
simmer down and finally start up.
Sometimes I try to help by restarting ha-broker and ha-agent, or plunking
in other random commands from the mess of documentation, but it seems to
sort itself out on its own time, regardless of my tinkering.
I wish I could get more insight into the process, but definitely, doing
nothing and waiting has been the most successful troubleshooting step I
have taken.
Cheers!
On Mon, Mar 29, 2021 at 11:32 AM Seann G. Clark via Users <users(a)ovirt.org>
wrote:
> All,
>
>
>
> After a power failure, and generator failure I lost my cluster, and the
> Hosted engine refused to restart after power was restored. I would expect,
> once storage comes up that the hosted engine comes back online without too
> much of a fight. In practice because the SPM went down as well, there is no
> (clearly documented) way to clear any of the stale locks, and no way to
> recover both the hosted engine and the cluster.
>
>
>
> I have spent the last 12 hours trying to get a functional hosted-engine
> back online, on a new node and each attempt hits a new error, from the
> installer not understanding that 16384mb of dedicated VM memory out of
> 192GB free on the host is indeed bigger than 4096MB, to ansible dying on
> an error like this “Error while executing action: Cannot add Storage
> Connection. Storage connection already exists.”
>
> The memory error referenced above shows up as:
>
> [ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg":
> "Available memory ( {'failed': False, 'changed': False, 'ansible_facts':
> {u'max_mem': u'180746'}}MB ) is less then the minimal requirement (4096MB).
> Be aware that 512MB is reserved for the host and cannot be allocated to the
> engine VM."}
>
> That is what I typically get when I try the steps outlined in the KB
> “CHAPTER 7. RECOVERING A SELF-HOSTED ENGINE FROM AN EXISTING BACKUP” from
> the RH Customer portal. I have tried this numerous ways, and the cluster
> still remains in a bad state, with the hosted engine being 100% inoperable.
>
>
>
> What I do have are the two host that are part of the cluster and can host
> the engine, and backups of the original hosted engine, both disk and
> engine-backup generated. I am not sure what I can do next, to recover this
> cluster, any suggestions would be apricated.
>
>
>
> Regards,
>
> Seann
>
>
>
>
> _______________________________________________
> Users mailing list -- users(a)ovirt.org
> To unsubscribe send an email to users-leave(a)ovirt.org
> Privacy Statement: https://www.ovirt.org/privacy-policy.html
> oVirt Code of Conduct:
> https://www.ovirt.org/community/about/community-guidelines/
> List Archives:
> https://lists.ovirt.org/archives/list/users@ovirt.org/message/JLDIFTKYDPQ...
>
4 years
Power failure makes cluster and hosted engine unusable
by Seann G. Clark
All,
After a power failure, and generator failure I lost my cluster, and the Hosted engine refused to restart after power was restored. I would expect, once storage comes up that the hosted engine comes back online without too much of a fight. In practice because the SPM went down as well, there is no (clearly documented) way to clear any of the stale locks, and no way to recover both the hosted engine and the cluster.
I have spent the last 12 hours trying to get a functional hosted-engine back online, on a new node and each attempt hits a new error, from the installer not understanding that 16384mb of dedicated VM memory out of 192GB free on the host is indeed bigger than 4096MB, to ansible dying on an error like this "Error while executing action: Cannot add Storage Connection. Storage connection already exists."
The memory error referenced above shows up as:
[ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "Available memory ( {'failed': False, 'changed': False, 'ansible_facts': {u'max_mem': u'180746'}}MB ) is less then the minimal requirement (4096MB). Be aware that 512MB is reserved for the host and cannot be allocated to the engine VM."}
That is what I typically get when I try the steps outlined in the KB "CHAPTER 7. RECOVERING A SELF-HOSTED ENGINE FROM AN EXISTING BACKUP" from the RH Customer portal. I have tried this numerous ways, and the cluster still remains in a bad state, with the hosted engine being 100% inoperable.
What I do have are the two host that are part of the cluster and can host the engine, and backups of the original hosted engine, both disk and engine-backup generated. I am not sure what I can do next, to recover this cluster, any suggestions would be apricated.
Regards,
Seann
4 years
VLAN Trunking to Cluster Nodes
by Jason Alexander Hazen Valliant-Saunders
Good Day Ovirt Users;
I am running a the following Setup:
[image: image.png]
[image: image.png]
[image: image.png]
I need to setup the BOND1 connector in the Datacenter with 9 VLAN's they
Exist on the Ovirt Node already and the Bond is UP:
[image: image.png]
(bond0 is 10GB Chelsio Storage Network)
The idea is to pass through Vlan's on bond1 (bond1.1, 1.2, 1.3, 1.n )
where n = vlanid.
However I'm unsure as to how I go about setting up the Provider in the
datacenter, so that bond1 is used to trunk vlans into the DC.
From what I have read the VLAN"s need to exist on the cluster node (they
do); but the ovirt-engine must then have some kind of provider configured
(just like it does for ovirt mgt) then i can map the vNic Profiles to that
new provider as per their vlan assignment.
Regards,
Hazen
4 years
Cinderlib problem after upgrade from 4.3.10 to 4.4.5
by Marc-Christian Schröer
Hello all,
first of all thank you very much for this stable virtualization environment. It has been a pillar for our company’s business for more than 5 years now and after migrating from version 3 to 4 it has been so stable ever since. Anyway, I ran into a problem I cannot fix on my own yesterday:
After a lot of consideration and hesitation since this is a production environment I followed the upgrade guide (https://www.ovirt.org/documentation/upgrade_guide/ <https://www.ovirt.org/documentation/upgrade_guide/>), configured a vanilla CentOS 8 server as controller, decommissioned the old 4.3 controller and fired up the new one. It worked like a charm until I tried to migrate VMs, start new ones or even create new disks. We use Ceph as managed storage, providing a SSD only and a HDD only pool. The UI simply told me that there was an error.
I started investigating the issue and found corresponding log entries in ovirt-engine.log:
2021-03-22 10:36:37,247+01 ERROR [org.ovirt.engine.core.common.utils.cinderlib.CinderlibExecutor] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-24) [67bf193c] cinderlib execution failed:
But that was all the engine had to say about the issue. There was no stack trace or additional information. There is no logfile in /var/log/ovirt-engine/cinderlib/, the directory simply is empty while on the other controller it was frequently filed with annoying „already mounted“ messages.
Can anyone help me with that issue? I searched the web for a solution or someone else with the same problem, but came up empty. Is there a way to turn up the log level for cinderlib? Are there any dependencies I have to install besides the ovirt packages? Any help is very much appreciated!
Kind regards and stay healthy,
Marc
--
________________________________________________________________________
Dipl.-Inform. Marc-Christian Schröer schroeer(a)ingenit.com
Geschäftsführer / CEO
----------------------------------------------------------------------
ingenit GmbH & Co. KG Tel. +49 (0)231 58 698-120
Emil-Figge-Strasse 76-80 Fax. +49 (0)231 58 698-121
D-44227 Dortmund www.ingenit.com
Registergericht: Amtsgericht Dortmund, HRA 13 914
Gesellschafter : Thomas Klute, Marc-Christian Schröer
________________________________________________________________________
4 years