hosted-engine -vm-status show a ghost node that is not anymore in the cluster: how to remove?
by Diego Ercolani
engine 4.5.2.4
The issue is that in my cluster when I use the:
[root@ovirt-node3 ~]# hosted-engine --vm-status
--== Host ovirt-node3.ovirt (id: 1) status ==--
Host ID : 1
Host timestamp : 1633143
Score : 3400
Engine status : {"vm": "down", "health": "bad", "detail": "unknown", "reason": "vm not running on this host"}
Hostname : ovirt-node3.ovirt
Local maintenance : False
stopped : False
crc32 : 1cbfcd19
conf_on_shared_storage : True
local_conf_timestamp : 1633143
Status up-to-date : True
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=1633143 (Wed Aug 31 14:37:53 2022)
host-id=1
score=3400
vm_conf_refresh_time=1633143 (Wed Aug 31 14:37:53 2022)
conf_on_shared_storage=True
maintenance=False
state=EngineDown
stopped=False
--== Host ovirt-node1.ovirt (id: 2) status ==--
Host ID : 2
Host timestamp : 373629
Score : 0
Engine status : unknown stale-data
Hostname : ovirt-node1.ovirt
Local maintenance : True
stopped : False
crc32 : 12a6eb81
conf_on_shared_storage : True
local_conf_timestamp : 373630
Status up-to-date : False
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=373629 (Tue Jun 14 16:48:50 2022)
host-id=2
score=0
vm_conf_refresh_time=373630 (Tue Jun 14 16:48:50 2022)
conf_on_shared_storage=True
maintenance=True
state=LocalMaintenance
stopped=False
--== Host ovirt-node2.ovirt (id: 3) status ==--
Host ID : 3
Host timestamp : 434247
Score : 3400
Engine status : {"vm": "down", "health": "bad", "detail": "unknown", "reason": "vm not running on this host"}
Hostname : ovirt-node2.ovirt
Local maintenance : False
stopped : False
crc32 : badb3751
conf_on_shared_storage : True
local_conf_timestamp : 434247
Status up-to-date : True
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=434247 (Wed Aug 31 14:37:45 2022)
host-id=3
score=3400
vm_conf_refresh_time=434247 (Wed Aug 31 14:37:45 2022)
conf_on_shared_storage=True
maintenance=False
state=EngineDown
stopped=False
--== Host ovirt-node4.ovirt (id: 4) status ==--
Host ID : 4
Host timestamp : 1646655
Score : 3400
Engine status : {"vm": "up", "health": "good", "detail": "Up"}
Hostname : ovirt-node4.ovirt
Local maintenance : False
stopped : False
crc32 : 1a16027e
conf_on_shared_storage : True
local_conf_timestamp : 1646655
Status up-to-date : True
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=1646655 (Wed Aug 31 14:37:43 2022)
host-id=4
score=3400
vm_conf_refresh_time=1646655 (Wed Aug 31 14:37:43 2022)
conf_on_shared_storage=True
maintenance=False
state=EngineUp
stopped=False
The problem is that ovirt-node1.ovirt is not anymore in ther cluster, in the host list presented by the ui there is correctly no ovirt-node1, the ovirt-node1 appears only in the commandline.
I did a full text search in the engine DB, but node1 doesn't appear anywhere, even in the filesystem, a grep doesn't find anything
2 years, 2 months
How many oVirt cluster, hosts and VMs do you have running?
by jlhm@usa.net
Hi, just trying to understand how our oVirt deployment compare to others.
- 75 cluster (and 75 Data centers as we map them 1 to 1 due to security requirements) spanning +3 data centers
- 328 oVirt servers
- +1900 VMs running
Majority are still on 4.3 (CentOS 7) but our engines (4 of them) runs RedHat 8/oVirt 4.4. Working to upgrade all hypervisors to RedHat 8 (or Rocky 8)/oVirt 4.4.
When that is done we will start to upgrade to latest oVirt version but it takes time due to the size of our environment and we move slowly to ensure stability.
2 years, 2 months
Ubuntu NFS
by thilburn@generalpacific.com
Hello,
I was having trouble with getting an Ubuntu 22.04 NFS share working and after searching for hours I was able to figure out what was needed. Below is what I found if anyone else runs into this.
My error was
engine.log
"...Unexpected return value: Status [code=701, message=Could not initialize cluster lock: ()]"
Host
supervdsm.log
-open error -13 EACCES: no permission to open /ThePath/ids
-check that daemon user sanlock *** group sanlock *** has access to disk or file.
The fix was
changing /etc/nfs.conf manage-gids=y ( Which is the default ) to # manage-gids=y ( Commenting this sets the default which is no )
It looks like in the past the fix was to change /etc/default/nfs-kernel-server Line RPCMOUNTDOPTS="--manage-gids" which I didn't need to change.
2 years, 2 months
how kill backup operation
by Diego Ercolani
Hello I saw there are other thread asking how to delete disk snapshots from backup operation.
We definitively need a tool to kill pending backup operations and locked snapshots.
I Think this is very frustrating ovirt is a good piece of software but it's very immature in a dirty asyncronous world.
We need a unified toolbox to clean manually and do database housekeeping.
2 years, 2 months
Interested in contributing with Spanish translation.
by Luis Pereida
Hello,
My name is Luis Pereida, I am Mexican, from Guadalajara and I am currently
an application security specialist.
Some time ago I met the O-Virt project and it helped me a lot to solve many
situations where virtualization was the perfect option.
Since some time ago I have been thinking about how to contribute to the
project, and talking with some friends, they would like to have
documentation in Spanish. Although English is basic for us, many times the
context or expressions are hard to understand.
I would like to help with that. How can I do it? I see that it is necessary
to use a Zanata account. How can I get an account?
Regards and thanks for being so supportive of the community.
2 years, 2 months
Q: Engione 4.4.10 and New Host on CentOS Stream 9
by Andrei Verovski
Hi,
I have engine version 4.4.10.7-1.el8, is it possible to set up new node
host on CentOS Stream 9, or I need to upgrade engine to version 4.5
first (which is not right now possible because of some quite old nodes) ?
Thanks
Andrei
2 years, 2 months
Moving SelfHosted Engine
by murat.celebi@dbpro.com.tr
Hello
We have got an Production ovirt installation with 1 datacenter and 2 clusters.
Default Cluster has got 2 hosts in it. kvm01 and kvm02.
PrimarySite cluster has got 2 hosts in it. kvm03 and kvm04
Our Self hosted engine is running on kvm01. We need to migrate the Selfhosted-engine to kvm03 or kvm04 since the hosts kvm01 and kvm02 hosts are going to be retired.
Anyone has any idea how this can be accomplished?
2 years, 2 months
Failed to delete snapshot (again)
by Giulio Casella
Hi folks,
since some month I have an issue with snapshot removals (I have storware
vprotect backup system, heavily using snapshots).
After some time spent on a bugzilla
(https://bugzilla.redhat.com/show_bug.cgi?id=1948599) we discovered that
my issue is not depending on that bug :-(
So they pointed me here again.
Briefly: sometime snapshot removal fails, leaving snapshot in illegal
state. Trying to remove again (via ovirt UI) keeps failing and doesn't
solve. The only way to rebuild a consistent situation is live migrating
affected disk to another storage domain; after moving the disk, snapshot
is no more marked illegal and then I can remove it. You can imagine this
is a bit tricky, specially for large disks.
In my logs I can find:
2022-08-29 09:17:11,890+02 ERROR
[org.ovirt.engine.core.bll.MergeStatusCommand]
(EE-ManagedExecutorService-commandCoordinator-Thread-1)
[0eced56f-689d-422b-b15c-20b824377b08] Failed to live merge. Top volume
f8f84b1c-53ab-4c99-a01d-743ed3d7859b is still in qemu chain
[0ea89fbc-d39a-48ff-aa2b-0381d79d7714,
55bb387f-01a6-41b6-b585-4bcaf2ea5e32, f8f84b1c-53ab-4c99-a01d-743ed3d7859b]
My setup is ovirt-engine-4.5.2.4-1.el8.noarch, with hypervisors based on
oVirt Node 4.5.2 (vdsm-4.50.2.2-1.el8).
Thank you in advance.
Regards,
gc
2 years, 2 months
Ovirt 4.4.7, can't renew certificate of ovirt engine (certificates expired)
by vk@itiviti.com
Hi Team,
I'm looking for your help since I didn't find any clear documentation. Is there somewhere in ovirt website a clear documentation about how to renew the engine certificates located in /etc/pki/ovirt-engine/certs/
We have an engine GUI not working, showing error message "PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed".
After checking, all the cert in /etc/pki/ovirt-engine/certs/ are expired.
I didn't find a clear documentation on ovirt website, or even on redhat website (it was always about host but not the engine)
Anyway I've read that the renew process can be done via "engine-setup --offline", but when I try it, it generates this error:
--== PKI CONFIGURATION ==--
[ ERROR ] Failed to execute stage 'Environment customization': Unable to load certificate. See https://cryptography.io/en/latest/faq/#why-can-t-i-import-my-pem-file for more details.
and in log file:
File "/usr/lib64/python3.6/site-packages/cryptography/hazmat/backends/openssl/backend.py", line 1371, in load_pem_x509_certificate
"Unable to load certificate. See https://cryptography.io/en/la"
ValueError: Unable to load certificate. See https://cryptography.io/en/latest/faq/#why-can-t-i-import-my-pem-file for more details.
2022-08-29 19:16:29,502+0200 ERROR otopi.context context._executeMethod:154 Failed to execute stage 'Environment customization': Unable to load certificate. See https://cryptography.io/en/latest/faq/#why-can-t-i-import-my-pem-file for more details.
I've also tried the manual procedure (using /usr/share/ovirt-engine/bin/pki-enroll-pkcs12.sh) mentioned in https://users.ovirt.narkive.com/4ugjgicE/ovirt-regenerating-new-ssl-certi... (message from Alon Bar-Lev), but the 4th command always says I enter a wrogn apssword, but it's not.
we are blocked here and we can't use our ovirt cluster, so it's pretty blocking.
Thx a lot in advance
2 years, 2 months
unable to bring up gluster bricks after 4.5 upgrade
by Jayme
Hello All,
I've been struggling with a few issues upgrading my 3-node HCI custer from
4.4 to 4.5.
At present the self hosted engine VM is properly running oVirt 4.5 on
CentOS 8x stream.
First host node, I set in maintenance and installed new node-ng image. I
ran into issue with rescue mode on boot which appears to have been related
to LVM devices bug. I was able to work past that and get the node to boot.
The node running 4.5.2 image is booting properly and gluster/lvm mounts etc
all look good. I am able to activate the host and run VMs on it etc.
however, oVirt cli is showing that all bricks on host are DOWN.
I was unable to get the bricks back up even after doing a force start of
the volumes.
Here is the glusterd log from the host in question when I try force start
on the engine volume (other volumes are similar:
==> glusterd.log <==
The message "I [MSGID: 106568] [glusterd-svc-mgmt.c:266:glusterd_svc_stop]
0-management: bitd service is stopped" repeated 2 times between [2022-08-29
18:09:56.027147 +0000] and [2022-08-29 18:10:34.694144 +0000]
[2022-08-29 18:10:34.695348 +0000] I [MSGID: 106618]
[glusterd-svc-helper.c:909:glusterd_attach_svc] 0-glusterd: adding svc
glustershd (volume=engine) to existing process with pid 2473
[2022-08-29 18:10:34.695669 +0000] I [MSGID: 106131]
[glusterd-proc-mgmt.c:84:glusterd_proc_stop] 0-management: scrub already
stopped
[2022-08-29 18:10:34.695691 +0000] I [MSGID: 106568]
[glusterd-svc-mgmt.c:266:glusterd_svc_stop] 0-management: scrub service is
stopped
[2022-08-29 18:10:34.695832 +0000] I [MSGID: 106617]
[glusterd-svc-helper.c:698:glusterd_svc_attach_cbk] 0-management: svc
glustershd of volume engine attached successfully to pid 2473
[2022-08-29 18:10:34.703718 +0000] E [MSGID: 106115]
[glusterd-mgmt.c:119:gd_mgmt_v3_collate_errors] 0-management: Post commit
failed on gluster2.xxxxx. Please check log file for details.
[2022-08-29 18:10:34.703774 +0000] E [MSGID: 106115]
[glusterd-mgmt.c:119:gd_mgmt_v3_collate_errors] 0-management: Post commit
failed on gluster1.xxxxx. Please check log file for details.
[2022-08-29 18:10:34.703797 +0000] E [MSGID: 106664]
[glusterd-mgmt.c:1969:glusterd_mgmt_v3_post_commit] 0-management: Post
commit failed on peers
[2022-08-29 18:10:34.703800 +0000] E [MSGID: 106664]
[glusterd-mgmt.c:2664:glusterd_mgmt_v3_initiate_all_phases] 0-management:
Post commit Op Failed
If I run start command manually on host cli:
gluster volume start engine force
volume start: engine: failed: Post commit failed on gluster1.xxxx. Please
check log file for details.
Post commit failed on gluster2.xxxx. Please check log file for details.
I feel like this may be some issue with the difference in major versions of
GlusterFS on the nodes but I am unsure. The other nodes are running
ovirt-node-ng-4.4.6.3
At this point I am afraid to bring down any other node to attempt upgrading
it without the bricks in UP status on the first host. I do not want to lose
quorum and potentially disrupt running VMs.
Any idea why I can't seem to start the volumes on the upgraded host?
Thanks!
2 years, 2 months