4.4 HCI Install Failure - Missing /etc/pki/CA/cacert.pem
by Stephen Panicho
Hi all! I'm using Cockpit to perform an HCI install, and it fails at the
hosted engine deploy. Libvirtd can't restart because of a missing
/etc/pki/CA/cacert.pem file.
The log (tasks seemingly from
/usr/share/ansible/roles/ovirt.hosted_engine_setup/tasks/initial_clean.yml):
[ INFO ] TASK [ovirt.hosted_engine_setup : Stop libvirt service]
[ INFO ] changed: [localhost]
[ INFO ] TASK [ovirt.hosted_engine_setup : Drop vdsm config statements]
[ INFO ] changed: [localhost]
[ INFO ] TASK [ovirt.hosted_engine_setup : Restore initial abrt config
files]
[ INFO ] changed: [localhost]
[ INFO ] TASK [ovirt.hosted_engine_setup : Restart abrtd service]
[ INFO ] changed: [localhost]
[ INFO ] TASK [ovirt.hosted_engine_setup : Drop libvirt sasl2 configuration
by vdsm]
[ INFO ] changed: [localhost]
[ INFO ] TASK [ovirt.hosted_engine_setup : Stop and disable services]
[ INFO ] ok: [localhost]
[ INFO ] TASK [ovirt.hosted_engine_setup : Restore initial libvirt default
network configuration]
[ INFO ] changed: [localhost]
[ INFO ] TASK [ovirt.hosted_engine_setup : Start libvirt]
[ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "Unable
to start service libvirtd: Job for libvirtd.service failed because the
control process exited with error code.\nSee \"systemctl status
libvirtd.service\" and \"journalctl -xe\" for details.\n"}
journalctl -u libvirtd:
May 22 04:33:25 node1 libvirtd[26392]: libvirt version: 5.6.0, package:
10.el8 (CBS <cbs(a)centos.org>, 2020-02-27-01:09:46, )
May 22 04:33:25 node1 libvirtd[26392]: hostname: node1
May 22 04:33:25 node1 libvirtd[26392]: Cannot read CA certificate
'/etc/pki/CA/cacert.pem': No such file or directory
May 22 04:33:25 node1 systemd[1]: libvirtd.service: Main process exited,
code=exited, status=6/NOTCONFIGURED
May 22 04:33:25 node1 systemd[1]: libvirtd.service: Failed with result
'exit-code'.
May 22 04:33:25 node1 systemd[1]: Failed to start Virtualization daemon.
From a fresh CentOS 8.1 minimal install, I've installed the following:
- The 4.4 repo
- cockpit
- ovirt-cockpit-dashboard
- vdsm-gluster (providing glusterfs-server and allowing the Gluster Wizard
to complete)
- gluster-ansible-roles (only on the bootstrap host)
I'm not exactly sure what that initial bit of the playbook does. Comparing
the bootstrap node with another that has yet to be touched, both
/etc/libvirt/libvirtd.conf and /etc/sysconfig/libvirtd are the same on both
hosts. Yet the bootstrap host can no longer start libvirtd while the other
host can. Neither host has the /etc/pki/CA/cacert.pem file.
Please let me know if I can provide any more information. Thanks!
4 years, 3 months
oVirt thrashes Docker network during installation
by thomas@hoberg.net
I want to run containers and VMs side by side and not necessarily nested. The main reason for that is GPUs, Voltas mostly, used for CUDA machine learning not for VDI, which is what most of the VM orchestrators like oVirt or vSphere seem to focus on. And CUDA drivers are notorious for refusing to work under KVM unless you pay $esla.
oVirt is more of a side show in my environment, used to run some smaller functional VMs alongside bigger containers, but also in order to consolidate and re-distribute the local compute node storage as a Gluster storage pool: Kibbutz storage and compute, if you want, very much how I understand the HCI philosophy behind oVirt.
The full integration of containers and VMs is still very much on the roadmap I believe, but I was surprised to see that even co-existence seems to be a problem currently.
So I set-up a 3-node HCI on CentOS7 (GPU-less and older) hosts and then added additional (beefier GPGPU) CentOS7 hosts, that have been running CUDA workloads on the latest Docker-CE v19 something.
The installation works fine, I can migrate VMs to these extra hosts etc., but to my dismay Docker containers on these hosts lose access to the local network, that is the entire subnet the host is in. For some strange reason I can still ping Internet hosts, perhaps even everything behind the host's gateway, but local connections are blocked.
It would seem that the ovritmgmt network that the oVirt installation puts in breaks the docker0 bridge that Docker put there first.
I'd consider that a bug, but I'd like to gather some feedback first, if anyone else has run into this problem.
I've repeated this several times in completely distinct environments with the same results:
Simply add a host with a working Docker-CE as an oVirt host to an existing DC/cluster and then try if you can still ping anyone on that net, including the Docker host from a busybox container afterwards (should try that ping just before you actually add it).
No, I didn't try this with podman yet, because that's separate challenge with CUDA: Would love to know if that is part of QA for oVirt already.
4 years, 3 months
Non storage nodes erronously included in quota calculations for HCI?
by thomas@hoberg.net
For my home-lab I operate a 3 node HCI cluster on 100% passive Atoms, mostly to run light infrastructure services such as LDAP and NextCloud.
I then add workstations or even laptops as pure compute hosts to the cluster for bigger but temporary things, that might actually run a different OS most of the time or just be shut off. From oVirt's point of view, these are just first put into maintenance and then shut down until needed again. No fencing or power management, all manual.
All nodes, even the HCI ones, run CentOS7 with more of a workstation configuration, so updates pile up pretty quickly.
After I recently upgraded one of these extra compute nodes, I found my three node HCI cluster not just faltering, but indeed very hard to reactivate at all.
The faltering is a distinct issue: I have the impression that reboots of oVirt nodes cause broadcast storms on my rather simplistic 10Gibt L2 switch, which a normal CentOS instance (or any other OS) doesn't, but that's for another post.
No what struck me, was that the gluster daemons on the three HCI nodes kept complaining about a lack of quorum long after the network was all back to normal, even if all three of them were there, saw each other perfectly on "gluster show status all", ready and without any healing issues pending at all.
Glusterd would complain on all three nodes that there was no quota for the bricks and stop them.
That went away as soon as I started one additional compute node, a node that was a gluster peer (because an oVirt host added to a HCI cluster always gets put into the Gluster, even if it's not contributing storage) but had no bricks. Immediately the gluster daemon on the three nodes with contributing bricks would report back good quota and launch the volumes (and thus all the rest of oVirt), even if in terms of *storage bricks* nothing had changed.
I am afraid that downing the extra compute-only oVirtNode will bring down the HCI: Clearly not the type of redundancy it's designed to deliver.
Evidently such compute-only hosts (and gluster members) get included into some quorum deliberations even if they hold not a single brick, neither storage nor arbitration.
To me that seems like a bug, if that is indeed what happens: There I need your advice and suggestions.
AFAIK HCI is a late addition to oVirt/RHEV as storage and compute were orginally designed to be completely distinct. In fact there are still remnants of documentation which seem to prohibit using a node for both compute and storage... what HCI is all about.
And I have seen compute nodes with "matching" storage (parts of a distinct HCI setup, that was taken down but still had all the storage and Gluster elements operable), being happliy absorbed into a HCI cluster with all Gluster storage appearing in the GUI etc., without any manual creation or inclusion of bricks: Fully automatic (and undocumented)!
In that case it makes sense to widen the scope of quota calculations when additional nodes are hyperconverged elements with contributing bricks. It also seems the only way to turn a 3 node HCI into 6 or 9 node one.
But if you really just want to add compute nodes without bricks, those can't get "quota votes" without storage to play a role in the redundancy.
I can easily imagine the missing "if then else" in the code here, but I was actually very surprised to see those failure and success messages coming from glusterd itself, which to my understanding is pretty unrelated to oVirt on top. Not from the management engine (wasn't running anyway), not from VDSM.
Re-creating the scenario is very scary even if I have gone through this three times already, trying to just bring my HCI back up. And then there is so verbose logs all over the place that I'd like some advice which ones I should post.
But simply speaking: Gluster peers should get no quota voting rights on volumes unless they contribute bricks. That rule seems broken.
Those in the know, please let me know if am on a goose chase or if there is a real issue here that deserves a bug report.
4 years, 3 months
Shutdown procedure for single host HCI Gluster
by Gianluca Cecchi
Hello,
I'm testing the single node HCI with ovirt-node-ng 4.3.9 iso.
Very nice and many improvements over the last time I tried it. Good!
I have a doubt related to shutdown procedure of the server.
Here below my steps:
- Shutdown all VMs (except engine)
- Put into maintenance data and vmstore domains
- Enable Global HA Maintenance
- Shutdown engine
- Shutdown hypervisor
It seems that the last step doesn't end and I had to brutally power off the
hypervisor.
Here the screenshot regarding infinite failure in unmounting
/gluster_bricks/engine
https://drive.google.com/file/d/1ee0HG21XmYVA0t7LYo5hcFx1iLxZdZ-E/view?us...
What would be the right step to do before the final shutdown of hypervisor?
Thanks,
Gianluca
4 years, 4 months
Upgrade ovirt from 3.4 to 4.3
by lu.alfonsi@almaviva.it
Good morning,
i have a difficult enviroment with 20 Hypervisors based on ovirt 3.4.3-1 and i would like to reach the 4.3 version. Which are the best steps to achieve these objective?
Thanks in advance
Luigi
4 years, 5 months
PKIX path error
by Stack Korora
Greetings,
I have a running oVirt install that's been working for almost 2 years.
I'm building a _completely_ new install. I mention it because it is
useful for me to compare configurations when I run into issues like this
one.
Right now there are three physical hosts:
1x management where I run the engine and db
2x hypervisor nodes.
I had it up and installed and running smooth this morning on
4.3.9.4-1.el7 on Scientific Linux 7.8 (fully patched).
I copied over our 3rd party certs from the running system and restarted
httpd. Perfect. SSL is running!
/etc/pki/ovirt-engine/apache-ca.pem
/etc/pki/ovirt-engine/certs/apache.cer
/etc/pki/ovirt-engine/keys/apache.key.nopass
Next I used ovirt-engine-extension-aaa-ldap-setup to point to our ldap
server. I did the login and search test and both passed on the command
line! Horray!
Then I went to the web interface...
sun.security.validator.ValidatorException: PKIX path building failed:
sun.security.provider.certpath.SunCertPathBuilderException: unable to
find valid certification path to requested target
I'm digging through logs and I don't see anything close to this error
except nearly the identical message in engine.log.
ERROR [org.ovirt.engine.core.aaa.servlet.SslPostLoginServlet] (default
task-2) [] server_error: sun.security.validator.ValidatorException: PKIX
path building failed:
sun.security.provider.certpath.SunCertPathBuilderException: unable to
find valid certification path to requested target
I can't log in via the web at all, I only get that message (so I can't
even test out the local admin). The aaa ldap configuration it generated
is darn near perfectly identical (just a name change). The certs are the
same. Even when I look in the keystore, the sha1 hashes are the same
between the two environments!
After over an hour poking at this, I'm completely stumped.
Can someone please give me a pointer on what I should try next?
Thanks!
~Stack~
4 years, 5 months
First ovirt 4.4 installation failing
by wart@caltech.edu
I'm having some trouble setting up my first oVirt system. I have the CentOS 8 installation on the bare metal (ovirt1.ldas.ligo-la.caltech.edu), the ovirt4.4 packages installed, and then try running 'hosted-engine --deploy' to set up my engine (ovirt-engine1.ldas.ligo-la.caltech.edu). For this initial deployment, I accept almost all of the defaults (other than local network-specific settings). However, the hosted-engine deployment fails with:
[ INFO ] TASK [ovirt.hosted_engine_setup : Obtain SSO token using username/pass
word credentials]
[ INFO ] ok: [localhost]
[ INFO ] TASK [ovirt.hosted_engine_setup : Wait for the host to be up]
[ ERROR ] fatal: [localhost]: FAILED! => {"attempts": 120, "changed": false, "ov
irt_hosts": []}
[...cleanup...]
[ INFO ] TASK [ovirt.hosted_engine_setup : Notify the user about a failure]
[ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "The system may not be provisioned according to the playbook results: please check the logs for the issue, fix accordingly or re-deploy from scratch.\n"}
However, when I run 'virsh list', I can still see a HostedEngine1 vm running.
In virt-hosted-engine-setup-20200522153439-e7iw3k.log I see the error:
2020-05-25 11:57:03,897-0500 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:103 {'changed': False, 'ovirt_hosts': [], 'invocation': {'module_args': {'pattern': 'name=ovirt1.ldas.ligo-la.caltech.edu', 'fetch_nested': False, 'nested_attributes': [], 'all_content': False, 'cluster_version': None}}, '_ansible_no_log': False, 'attempts': 120}
2020-05-25 11:57:03,998-0500 ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:107 fatal: [localhost]: FAILED! => {"attempts": 120, "changed": false, "ovirt_hosts": []}
In ovirt-hosted-engine-setup-ansible-bootstrap_local_vm-20200525112504-y2mmzu.log I see the following ansible errors:
2020-05-25 11:36:22,300-0500 DEBUG ansible on_any args localhostTASK: ovirt.hosted_engine_setup : Always revoke the SSO token kwargs
2020-05-25 11:36:23,766-0500 ERROR ansible failed {
"ansible_host": "localhost",
"ansible_playbook": "/usr/share/ovirt-hosted-engine-setup/ansible/trigger_role.yml",
"ansible_result": {
"_ansible_no_log": false,
"changed": false,
"invocation": {
"module_args": {
"ca_file": null,
"compress": true,
"headers": null,
"hostname": null,
"insecure": null,
"kerberos": false,
"ovirt_auth": {
"ansible_facts": {
"ovirt_auth": {
"ca_file": null,
"compress": true,
"headers": null,
"insecure": true,
"kerberos": false,
"timeout": 0,
"token": "tF4ZMU0Q23zS13W2vzyhkswGMB4XAXZCFiPg9IVvbJXkPq9MFmne40wvCKaQOJO_TkYOpfxe78r9HHJcSrUWCQ",
"url": "https://ovirt-engine1.ldas.ligo-la.caltech.edu/ovirt-engine/api"
}
},
"attempts": 1,
"changed": false,
"failed": false
},
"password": null,
"state": "absent",
"timeout": 0,
"token": null,
"url": null,
"username": null
}
},
"msg": "You must specify either 'url' or 'hostname'."
},
"ansible_task": "Always revoke the SSO token",
"ansible_type": "task",
"status": "FAILED",
"task_duration": 2
}
2020-05-25 11:36:23,767-0500 DEBUG ansible on_any args <ansible.executor.task_result.TaskResult object at 0x7f15adaffa58> kwargs ignore_errors:True
Then further down:
2020-05-25 11:57:05,063-0500 DEBUG var changed: host "localhost" var "ansible_failed_result" type "<class 'dict'>" value: "{
"_ansible_no_log": false,
"_ansible_parsed": true,
"attempts": 120,
"changed": false,
"failed": true,
"invocation": {
"module_args": {
"all_content": false,
"cluster_version": null,
"fetch_nested": false,
"nested_attributes": [],
"pattern": "name=ovirt1.ldas.ligo-la.caltech.edu"
}
},
"ovirt_hosts": []
}"
2020-05-25 11:57:05,063-0500 ERROR ansible failed {
"ansible_host": "localhost",
"ansible_playbook": "/usr/share/ovirt-hosted-engine-setup/ansible/trigger_role.yml",
"ansible_result": {
"_ansible_no_log": false,
"attempts": 120,
"changed": false,
"invocation": {
"module_args": {
"all_content": false,
"cluster_version": null,
"fetch_nested": false,
"nested_attributes": [],
"pattern": "name=ovirt1.ldas.ligo-la.caltech.edu"
}
},
"ovirt_hosts": []
},
"ansible_task": "Wait for the host to be up",
"ansible_type": "task",
"status": "FAILED",
"task_duration": 1235
}
2020-05-25 11:57:05,063-0500 DEBUG ansible on_any args <ansible.executor.task_result.TaskResult object at 0x7f15ad92dcc0> kwargs ignore_errors:None
Not being very familiar with ansible, I'm not sure where to look next for the root cause of the problem.
--Michael Thomas
4 years, 5 months
Q: Which types of tests and tools are used?
by Juergen Novak
Hi,
can anybody help me to find some information about test types used in
the project and tools used?
Particularly interesting would be tools and tests used for the Python
coding, but also any information about Java would be appreciated.
I already scanned the documentation, but I mainly found only information
about Mocking tools.
Thank you!
/juergen
4 years, 5 months
Ovirt 4.4 Migration assistance needed.
by Strahil Nikolov
Hello All,
I would like to ask for some assistance with the planing of the upgrade to 4.4 .
I have issues with the OVN (doesn't work at all), thus I would like to start fresh with the HE.
The plan so far (downtime is not an issue) :
1. Reinstall the nodes one by 1 and rejoin them in the Gluster TSP
2. Wipe the HostedEngine's gluster volume
3. Deploy a fresh hosted engine
4. Import the storage domains (gluster) back to the engine and import the VMs
Do you see any issues with the plan ?
Any problems expected if the VMs do have snapshots? What about the storage domain version ?
Thanks in Advance.
Best Regards,
Strahil Nikolov
4 years, 5 months
Tasks stuck waiting on another after failed storage migration (yet not visible on SPM)
by David Sekne
Hello,
I'm running oVirt version 4.3.9.4-1.el7.
After a failed live storage migration a VM got stuck with snapshot.
Checking the engine logs I can see that the snapshot removal task is
waiting for Merge to complete and vice versa.
2020-05-26 18:34:04,826+02 INFO
[org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommandCallback]
(EE-ManagedThreadFactory-engineScheduled-Thread-70)
[90f428b0-9c4e-4ac0-8de6-1103fc13da9e] Command
'RemoveSnapshotSingleDiskLive' (id: '60ce36c1-bf74-40a9-9fb0-7fcf7eb95f40')
waiting on child command id: 'f7d1de7b-9e87-47ba-9ba0-ee04301ba3b1'
type:'Merge' to complete
2020-05-26 18:34:04,827+02 INFO
[org.ovirt.engine.core.bll.MergeCommandCallback]
(EE-ManagedThreadFactory-engineScheduled-Thread-70)
[90f428b0-9c4e-4ac0-8de6-1103fc13da9e] Waiting on merge command to complete
(jobId = f694590a-1577-4dce-bf0c-3a8d74adf341)
2020-05-26 18:34:04,845+02 INFO
[org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback]
(EE-ManagedThreadFactory-engineScheduled-Thread-70)
[90f428b0-9c4e-4ac0-8de6-1103fc13da9e] Command 'RemoveSnapshot' (id:
'47c9a847-5b4b-4256-9264-a760acde8275') waiting on child command id:
'60ce36c1-bf74-40a9-9fb0-7fcf7eb95f40' type:'RemoveSnapshotSingleDiskLive'
to complete
2020-05-26 18:34:14,277+02 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmJobsMonitoring]
(EE-ManagedThreadFactory-engineScheduled-Thread-96) [] VM Job
[f694590a-1577-4dce-bf0c-3a8d74adf341]: In progress (no change)
I cannot see any runnig tasks on the SPM (vdsm-client Host
getAllTasksInfo). I also cannot find the task ID in any of the other node's
logs.
I already tried restarting the Engine (didn't help).
To start I'm puzzled as to where this task is queueing?
Any Ideas on how I could resolve this?
Thank you.
Regards,
David
4 years, 5 months