Re: Power failure makes cluster and hosted engine unusable
by Thomas Hoberg
Yup, that's a bug in the ansible code, I've come across on hosts that had 512GB of RAM.
I quite simply deleted the checks from the ansible code and re-ran the wizard.
I can't read YAML or Python or whatever it is that Ansible uses, but my impression is that things are 'cast' or converted into an INT data type on these checks that overflows at that point. I wound up commenting out the entire set of checks to get past this, because I could see no easy way to fix this. I just checked that the commands used to retrieve the memory size returned the proper number of kilobytes and then rolled my eyes at what seemd like a type cast operation.
I never went deeper, because at that point I have a hard time keeping the beast inside me at bay, that sees how Ansible can bring a Xeon Scalable Gold system to what seems slower than a 6502 executing BASIC. The hosted-setup takes what seems like an hour, no matter how fast or slow your hardware.
I was so fed up at the speed of Ansible and the quality of oVirt QA, I couldn't bring myself to open a ticket, I hope you're better motivated.
BTW things aren't much better at the low end either. While Ansible doesn't seem that much slower on an Atom farm I also operate, the hosted-engine setup does fail on them, so replug their SSDs for that part into an i7 and then replug the SSDs to the Atoms aftwards. Once, up and running oVirt is just fine on Atoms (mine have 32GB of RAM each).
I am almost ready to donate my Atoms to the project, because I keep thinking that oVirt's major chance would be as edge HCI, but they are asking for 128GB RAM minimum...
BTW. I was running oVirt 4.3 on CentOS7 when hitting that error. No idea if it's still the same with 4.4/COS8 as my 'perhaps-next-but-more-likely-never' generation test farm runs on NUCs with 64GB.
3 years, 8 months
Network issue
by Valerio Luccio
I'm slowly getting things back up, but I ran into a perplexing network
problem.
I created a new vNIC profile in the engine web UI and attached it to the
hosted engine to test it, which in retrospect was not a good idea. I
realized that I needed to change something in the parameters of the
vNIC, but it won't let me do it because it's attached to the running
engine; it will not let me remove the NIC from the engine because it's
running. If I shut down the engine, how can I then change the
configuration of it ? Seems like a Catch 22 situation.
Thanks,
--
As a result of Coronavirus-related precautions, NYU and the Center for
Brain Imaging operations will be managed remotely until further notice.
All telephone calls and e-mail correspondence are being monitored
remotely during our normal business hours of 9am-5pm, Monday through
Friday.
For MRI scanner-related emergency, please contact: Keith Sanzenbach at
keith.sanzenbach(a)nyu.edu and/or Pablo Velasco at pablo.velasco(a)nyu.edu
For computer/hardware/software emergency, please contact: Valerio Luccio
at valerio.luccio(a)nyu.edu
For TMS/EEG-related emergency, please contact: Chrysa Papadaniil at
chrysa(a)nyu.edu
For CBI-related administrative emergency, please contact: Jennifer
Mangan at jennifer.mangan(a)nyu.edu
Valerio Luccio (212) 998-8736
Center for Brain Imaging 4 Washington Place, Room 158
New York University New York, NY 10003
"In an open world, who needs windows or gates ?"
3 years, 8 months
Import VMs from disk
by valerio.luccio@nyu.edu
As I've described before I had issues with my deployment and had to recreate the engine.
I never had a chance to export my VMs, but I did back up the files. Now I would like to import the old VMs into the new engine. Is there a way to 'slurp' in the old images ? I see different way to import from an export domain, form VMWare, etc., but I don't see a way to import directly from disk. Do I need to convert them ? To what and how ?
Thanks,
3 years, 8 months
Re: Power failure makes cluster and hosted engine unusable
by Vincent Royer
Seann,
If this happens again, try doing nothing (seriously) Each time I've had a
power failure, the engine takes a really long time to come back up. I
don't know if it's by design or what. Host logs are flooded with errors,
everything seemingly storage related. However, my Gluster setup is on fast
SSDs and gets back and running pretty much straight away. It takes maybe 5
minutes for the nodes to re-join and the volumes to show 'UP' with no
heals. However, it still takes the hosted-engine a good hour or two to
simmer down and finally start up.
Sometimes I try to help by restarting ha-broker and ha-agent, or plunking
in other random commands from the mess of documentation, but it seems to
sort itself out on its own time, regardless of my tinkering.
I wish I could get more insight into the process, but definitely, doing
nothing and waiting has been the most successful troubleshooting step I
have taken.
Cheers!
On Mon, Mar 29, 2021 at 11:32 AM Seann G. Clark via Users <users(a)ovirt.org>
wrote:
> All,
>
>
>
> After a power failure, and generator failure I lost my cluster, and the
> Hosted engine refused to restart after power was restored. I would expect,
> once storage comes up that the hosted engine comes back online without too
> much of a fight. In practice because the SPM went down as well, there is no
> (clearly documented) way to clear any of the stale locks, and no way to
> recover both the hosted engine and the cluster.
>
>
>
> I have spent the last 12 hours trying to get a functional hosted-engine
> back online, on a new node and each attempt hits a new error, from the
> installer not understanding that 16384mb of dedicated VM memory out of
> 192GB free on the host is indeed bigger than 4096MB, to ansible dying on
> an error like this “Error while executing action: Cannot add Storage
> Connection. Storage connection already exists.”
>
> The memory error referenced above shows up as:
>
> [ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg":
> "Available memory ( {'failed': False, 'changed': False, 'ansible_facts':
> {u'max_mem': u'180746'}}MB ) is less then the minimal requirement (4096MB).
> Be aware that 512MB is reserved for the host and cannot be allocated to the
> engine VM."}
>
> That is what I typically get when I try the steps outlined in the KB
> “CHAPTER 7. RECOVERING A SELF-HOSTED ENGINE FROM AN EXISTING BACKUP” from
> the RH Customer portal. I have tried this numerous ways, and the cluster
> still remains in a bad state, with the hosted engine being 100% inoperable.
>
>
>
> What I do have are the two host that are part of the cluster and can host
> the engine, and backups of the original hosted engine, both disk and
> engine-backup generated. I am not sure what I can do next, to recover this
> cluster, any suggestions would be apricated.
>
>
>
> Regards,
>
> Seann
>
>
>
>
> _______________________________________________
> Users mailing list -- users(a)ovirt.org
> To unsubscribe send an email to users-leave(a)ovirt.org
> Privacy Statement: https://www.ovirt.org/privacy-policy.html
> oVirt Code of Conduct:
> https://www.ovirt.org/community/about/community-guidelines/
> List Archives:
> https://lists.ovirt.org/archives/list/users@ovirt.org/message/JLDIFTKYDPQ...
>
3 years, 9 months
Power failure makes cluster and hosted engine unusable
by Seann G. Clark
All,
After a power failure, and generator failure I lost my cluster, and the Hosted engine refused to restart after power was restored. I would expect, once storage comes up that the hosted engine comes back online without too much of a fight. In practice because the SPM went down as well, there is no (clearly documented) way to clear any of the stale locks, and no way to recover both the hosted engine and the cluster.
I have spent the last 12 hours trying to get a functional hosted-engine back online, on a new node and each attempt hits a new error, from the installer not understanding that 16384mb of dedicated VM memory out of 192GB free on the host is indeed bigger than 4096MB, to ansible dying on an error like this "Error while executing action: Cannot add Storage Connection. Storage connection already exists."
The memory error referenced above shows up as:
[ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "Available memory ( {'failed': False, 'changed': False, 'ansible_facts': {u'max_mem': u'180746'}}MB ) is less then the minimal requirement (4096MB). Be aware that 512MB is reserved for the host and cannot be allocated to the engine VM."}
That is what I typically get when I try the steps outlined in the KB "CHAPTER 7. RECOVERING A SELF-HOSTED ENGINE FROM AN EXISTING BACKUP" from the RH Customer portal. I have tried this numerous ways, and the cluster still remains in a bad state, with the hosted engine being 100% inoperable.
What I do have are the two host that are part of the cluster and can host the engine, and backups of the original hosted engine, both disk and engine-backup generated. I am not sure what I can do next, to recover this cluster, any suggestions would be apricated.
Regards,
Seann
3 years, 9 months
VLAN Trunking to Cluster Nodes
by Jason Alexander Hazen Valliant-Saunders
Good Day Ovirt Users;
I am running a the following Setup:
[image: image.png]
[image: image.png]
[image: image.png]
I need to setup the BOND1 connector in the Datacenter with 9 VLAN's they
Exist on the Ovirt Node already and the Bond is UP:
[image: image.png]
(bond0 is 10GB Chelsio Storage Network)
The idea is to pass through Vlan's on bond1 (bond1.1, 1.2, 1.3, 1.n )
where n = vlanid.
However I'm unsure as to how I go about setting up the Provider in the
datacenter, so that bond1 is used to trunk vlans into the DC.
From what I have read the VLAN"s need to exist on the cluster node (they
do); but the ovirt-engine must then have some kind of provider configured
(just like it does for ovirt mgt) then i can map the vNic Profiles to that
new provider as per their vlan assignment.
Regards,
Hazen
3 years, 9 months
Cinderlib problem after upgrade from 4.3.10 to 4.4.5
by Marc-Christian Schröer
Hello all,
first of all thank you very much for this stable virtualization environment. It has been a pillar for our company’s business for more than 5 years now and after migrating from version 3 to 4 it has been so stable ever since. Anyway, I ran into a problem I cannot fix on my own yesterday:
After a lot of consideration and hesitation since this is a production environment I followed the upgrade guide (https://www.ovirt.org/documentation/upgrade_guide/ <https://www.ovirt.org/documentation/upgrade_guide/>), configured a vanilla CentOS 8 server as controller, decommissioned the old 4.3 controller and fired up the new one. It worked like a charm until I tried to migrate VMs, start new ones or even create new disks. We use Ceph as managed storage, providing a SSD only and a HDD only pool. The UI simply told me that there was an error.
I started investigating the issue and found corresponding log entries in ovirt-engine.log:
2021-03-22 10:36:37,247+01 ERROR [org.ovirt.engine.core.common.utils.cinderlib.CinderlibExecutor] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-24) [67bf193c] cinderlib execution failed:
But that was all the engine had to say about the issue. There was no stack trace or additional information. There is no logfile in /var/log/ovirt-engine/cinderlib/, the directory simply is empty while on the other controller it was frequently filed with annoying „already mounted“ messages.
Can anyone help me with that issue? I searched the web for a solution or someone else with the same problem, but came up empty. Is there a way to turn up the log level for cinderlib? Are there any dependencies I have to install besides the ovirt packages? Any help is very much appreciated!
Kind regards and stay healthy,
Marc
--
________________________________________________________________________
Dipl.-Inform. Marc-Christian Schröer schroeer(a)ingenit.com
Geschäftsführer / CEO
----------------------------------------------------------------------
ingenit GmbH & Co. KG Tel. +49 (0)231 58 698-120
Emil-Figge-Strasse 76-80 Fax. +49 (0)231 58 698-121
D-44227 Dortmund www.ingenit.com
Registergericht: Amtsgericht Dortmund, HRA 13 914
Gesellschafter : Thomas Klute, Marc-Christian Schröer
________________________________________________________________________
3 years, 9 months
Unable to add hosts to cluster & possible related routing issue
by David White
Hi all,
I used the oVirt installer via cockpit to setup a hyperconverged cluster on 3 physical hosts running Red Hat 8.3. I used the following two resources to guide my work:
- https://www.ovirt.org/documentation/gluster-hyperconverged/chap-Deploying...
- https://blogs.ovirt.org/2018/02/up-and-running-with-ovirt-4-2-and-gluster...
As I mentioned in a previous email a week ago, it appears as if the wizard successfully setup gluster on all 3 of the servers, and the wizard also successfully setup the ovirt engine on the first host.
However, the oVirt engine only recognizes the first host, and only the first host's physical resources are available to the cluster.
So I have Gluster 8 installed on the 3 hosts (RHEL 8), and oVirt 4.4.5 installed on the first host, along with the ovirt engine VM.
I run `gluster peer status` from the first node, can confirm that the other two physical hosts are healthy:
(example.com is replacing actual domain below)
[root@cha1-storage ~]# gluster peer statusNumber of Peers: 2Hostname: cha2-storage.mgt.example.comUuid: 240a7ab1-ab52-4e5b-98ed-d978f848835eState: Peer in Cluster (Connected)Hostname: cha3-storage.mgt.example.comUuid: 0563c3e8-237d-4409-a09a-ec51719b0da6State: Peer in Cluster (Connected)
I am now trying to get the other two hosts added to the Engine. I navigate to Compute -> Hosts, and click New, fill in the details (hostname, root password, etc..) and begin the installation on the additional hosts.
It keeps failing.
Checking the error logs in /var/log/ovirt-engine/host-deploy on the Engine VM, I see the following near the bottom:
"msg" : "Failed to download metadata for repo 'ovirt-4.4-epel': Cannot prepare internal mirrorlist: Curl error (7): Couldn't connect to server for https://mirrors.fedoraproject.org/metalink?repo=epel-8&arch=x86_64&infra=... [Failed to connect to mirrors.fedoraproject.org port 443: No route to host]",
I can confirm that the Engine and each of the hosts are able to get to mirrors.fedoraproject.org:
I've run the following on both the Engine, as well as each of the hosts (the first host where everything is installed, as well as the 2nd host where I'm trying to get it installed):
[root@ovirt-engine1 host-deploy]# curl -i https://mirrors.fedoraproject.org
HTTP/2 302
Note, that this traffic is going out the management network.
That may be an important distinction -- keep reading.
So this does lead me to another issue that may or may not be related.
I've discovered that, from the RHEL 8 host's perspective, the public facing network is unable to get out to the internet.
Management (Gluster & the Engine VM) is on: 10.1.0.0/24
Each of the hosts are able to ping each other and communicate with each other.
Each of the hosts are able to ping 8.8.8.8 whenever the frontend network interface is disabled (see below).
The frontend network is on: 192.168.0.0/24
Each of the hosts are able to ping each other on this network
But this network isn't able to get out to the internet (yet).
Obviously I need to fix the routing and figure out why the 192.168.0.0/24 network is unable to reach the internet.
But shouldn't all the traffic to install the ovirt functionality onto the 2nd host go out the management network?
So to summarize, I have a few questions:
- Why didn't the wizard properly install the ovirt functionality onto all 3 hosts to begin with when I did the initial installation?
- From the physical host's perspective, what should be the default route? The internal management, or the front-end network?
- Is the following statement accurate?
- The ovirt-engine's default route should be management -- it doesn't even have a front-end IP address.
- Why would the ovirt engine fail to install, and indicate that it cannot get to mirrors.fedoraproject.org, when clearly it can?
- Again, the ovirt-engine VM is able to curl that URL and gets a valid http response.
- Any other tips or suggestions on how to troubleshoot this?
Sent with ProtonMail Secure Email.
3 years, 9 months
hosted engine vm don't boot after minor update
by a.cor@outlook.com
Hello!
I have rhv 4.4.2 and NFS storage type for HE Domain Storage.
I updated to 4.4.4 by guide (https://www.ovirt.org/documentation/upgrade_guide/#Updating_a_self-hosted...) and reboot Hosted Engine VM. After this VM is not booting.
Connected to console I see next error:
BdsDxe: failed to load Boot0001 "UEFI Misc Device"...
After "Press any key" I got BIOS screen, go to Device Manager -> Drive Health Manager I see nothing.
I suggest that something happened with VM config.
Check /etc/ovirt-hosted-engine/hosted-engine.conf
All looks correctly (I checked UUID to Domain Storage and etc.)
Check /var/run/ovirt-hosted-engine-ha/vm.conf.
After encoding base64 I found something like this:
<source file="/rhev/data-center/0000-0000-0000-0000-0000/6916940d-b21b-4739-9d6b-4f7ff9e72a9e/images/9a17d31d-30f8-480d-a106-2fa7f0d48c40/a817c10b-0366-4c05-b512-e6beb8308b5d
All path is correct only zero after data-center/.
I changed it to real path, decode to base64 and try to start VM with hosted-engine --vm-start --vm-conf=PATH_TO_NEW_FILE.
Nothing. I see similar error.
Could someone help me in this case?
3 years, 9 months