VDSM Issue after Upgrade of Node in HCI
by Abe E
So I have my 2nd node in my cluster that showed an upgrade option in OVIRT.
I put it in maint mode and ran the upgrade, it went through it but at one point it lost its internet connection or connection within the gluster, it didnt get to the reboot process and simply lost its connection to the engine from there.
I can see the gluster is still running and was able to keep all 3 glusters syncing but it seems the VDSM may be the culprit here.
ovirt-ha-agent wont start and the hosted-engine --connect-storage returns:
Traceback (most recent call last):
File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/connect_storage_server.py", line 30, in <module>
timeout=ohostedcons.Const.STORAGE_SERVER_TIMEOUT,
File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/client/client.py", line 312, in connect_storage_server
sserver.connect_storage_server(timeout=timeout)
File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_server.py", line 411, in connect_storage_server
timeout=timeout,
File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 474, in connect_vdsm_json_rpc
__vdsm_json_rpc_connect(logger, timeout)
File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 415, in __vdsm_json_rpc_connect
timeout=VDSM_MAX_RETRY * VDSM_DELAY
RuntimeError: Couldn't connect to VDSM within 60 seconds
VDSM just keeps loop restart and failing, vdsm-tool configure --force throws this :
[root@ovirt-2 ~]# vdsm-tool configure --force
Checking configuration status...
sanlock is configured for vdsm
abrt is already configured for vdsm
Current revision of multipath.conf detected, preserving
lvm is configured for vdsm
Managed volume database is already configured
libvirt is already configured for vdsm
SUCCESS: ssl configured to true. No conflicts
Running configure...
libsepol.context_from_record: type insights_client_var_lib_t is not defined
libsepol.context_from_record: could not create context structure
libsepol.context_from_string: could not create context structure
libsepol.sepol_context_to_sid: could not convert system_u:object_r:insights_client_var_lib_t:s0 to sid
invalid context system_u:object_r:insights_client_var_lib_t:s0
libsemanage.semanage_validate_and_compile_fcontexts: setfiles returned error code 255.
Traceback (most recent call last):
File "/usr/bin/vdsm-tool", line 209, in main
return tool_command[cmd]["command"](*args)
File "/usr/lib/python3.6/site-packages/vdsm/tool/__init__.py", line 40, in wrapper
func(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/vdsm/tool/configurator.py", line 145, in configure
_configure(c)
File "/usr/lib/python3.6/site-packages/vdsm/tool/configurator.py", line 92, in _configure
getattr(module, 'configure', lambda: None)()
File "/usr/lib/python3.6/site-packages/vdsm/tool/configurators/sebool.py", line 88, in configure
_setup_booleans(True)
File "/usr/lib/python3.6/site-packages/vdsm/tool/configurators/sebool.py", line 60, in _setup_booleans
sebool_obj.finish()
File "/usr/lib/python3.6/site-packages/seobject.py", line 340, in finish
self.commit()
File "/usr/lib/python3.6/site-packages/seobject.py", line 330, in commit
rc = semanage_commit(self.sh)
OSError: [Errno 0] Error
Anyone have ideas where I could recover this, I am not sure if something corrupted on update or on a reboot -- I would prefer updating notes from the CLI next time but unfortunately I have not looked that far, it would have helped me see what failed and where much easier.
2 years, 8 months
Cloning VM selecting part of the disks
by Gianluca Cecchi
Hello,
in recent versions of oVirt (eg my last 4.4.10) there is the feature to
make a clone of a running VM.
This operation passes through a temporary VM snapshot (then automatically
deleted) and cloning of this snapshot.
Sometimes there is a need to clone a VM but only a subset of its disks is
required (eg in my case I want to retain boot disk, 20Gb, and dedicated sw
disk, 20Gb, but not data disk, that usually is big... in my case 200Gb).
In this scenario I have to go the old path where I explicitly create a
snapshot of the VM, where I can select a subset of the disks, then I clone
the snapshot and last I delete the snapshot.
Do you think it is interesting to have the option of selecting disks when
you clone a running VM and go automatic..?
If I want to open a bugzilla as RFE, what components and options I have to
select?
Thanks,
Gianluca
2 years, 8 months
Engine across Clusters
by Abe E
Has anyone setup hype converged gluster (3Nodes) and then added more after while maintaining access to the engine?
An oversight on my end was 2 fold, Engine gluster being on engine nodes and new nodes requiring their own cluster due to different CPU type.
So basically I am trying to see if I can setup a new cluster for my other nodes that require it while trying to give them ability to run the engine and ofcourse because they arent part of the engine cluster, we all know how that goes. Has anyone dealt with this or worked around it, any advices?
2 years, 8 months
Gluster issue with brick going down
by Chris Adams
I have a hyper-converged cluster running oVirt 4.4.10 and Gluster 8.6.
Periodically, one brick of one volume will drop out, but it's seemingly
random as to which volume and brick is affected. All I see in the brick
log is:
[2022-03-19 13:27:36.360727] W [MSGID: 113075] [posix-helpers.c:2135:posix_fs_health_check] 0-vmstore-posix: aio_read_cmp_buf() on /gluster_bricks/vmstore/vmstore/.glusterfs/health_check returned ret is -1 error is Structure needs cleaning
[2022-03-19 13:27:36.361160] M [MSGID: 113075] [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vmstore-posix: health-check failed, going down
[2022-03-19 13:27:36.361395] M [MSGID: 113075] [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vmstore-posix: still alive! -> SIGTERM
Searching around, I see references to similar issues, but no real
solutions. I see a suggestion that changing the health-check-interval
from 10 to 30 seconds helps, but it looks like 30 seconds is the default
with this version of Gluster (and I don't see it explicitly set for any
of my volumes).
While "Structure needs cleaning" appears to be an XFS filesystem error,
I don't see any XFS errors from the kernel.
This is a low I/O cluster - the storage network is on two 10 gig
switches with a two-port LAG to each server, but typically is only
seeing a few tens of megabits per second.
--
Chris Adams <cma(a)cmadams.net>
2 years, 8 months
Unable to deploy ovirt 4.4 on alma 8.5
by Richa Gupta
Hi Team,
While installing ovirt 4.4 on alma Linux 8.5 we are facing following issue:
[ INFO ] TASK [ovirt.ovirt.engine_setup : Install oVirt Engine package]
[ ERROR ] fatal: [localhost -> 192.168.222.56]: FAILED! => {"changed": false, "msg": "Failed to download metadata for repo 'ovirt-4.4': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried", "rc": 1, "results": []}
Using repo as following:
[ovirt-4.4]
name=Latest oVirt 4.4 Release
#baseurl=https://resources.ovirt.org/pub/ovirt-4.4/rpm/el$releasever/
mirrorlist=https://mirrorlist.ovirt.org/mirrorlist-ovirt-4.4-el$releasever
enabled=1
countme=1
fastestmirror=1
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-ovirt-4.4
Can someone please help in resolving this issue?
2 years, 8 months
all hosts lost scenario - restore engine from backup file - LOCKED HOST 4.3 version
by goosesk blabla
Hi,
I have problem with LOCKED host after restore of engine.
i am trying many disaster scenarios and one from them is loosing of all hosts with self hosted engine included with backup file of engine only.
When all hosts and engine VM were destroyed at the same time, i installed new self hosted engine on hardware from one lost host. Then tried restore engine
engine-backup --mode=restore --scope=all --file=backup/file/ovirt-engine-backup --log=backup/log/ovirt-engine-backup.log --no-restore-permissions --provision-db --provision-dwh-db –provision-reports-db
After this, lost engine was restored successfully. I had offline Datacentrer, dead hosts and VM. I added new hosts, which were able to connect to domain storage automatically and I was able to start VM.
New host cannot has the same IP as already dead host or hardware UID. This can be solved by setup old dead host to maintenance mode and then delete host. Then the same hardware and IP can be reused.
But, host where old engine was running is LOCKED. You cannot migrate engine VM, cannot start hosted engine from new engine GUI, cannot setup host to maintenance mode and delete.. This is problem when your hardware for hosts is limited.
I would like to ask how to solve this situation, Is there any way how to “reinstall” old hosts? I see that SSL certificates were changed with new installation of new hosts, but I don’t know if there is way how to enable old dead hosts.
Is there any way how to destroy old engine VM to add host back to work ?
Thank you
2 years, 8 months