Failing "change Master storage domain" from gluster to iscsi
by Diego Ercolani
In the current release of ovirt (4.5.4) I'm currently experiencing a fail in change master storage domain from a gluster volume to everywhere.
The GUI talk about a "general" error.
watching the engine log:
2023-03-28 11:51:16,601Z WARN [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-46) [] Unexpected return value: TaskStatus [code=331, message=value=Tar command failed: ({'reader': {'cmd': ['/usr/bin/tar', 'cf', '-', '--exclude=./lost+found', '-C', '/rhev/data-center/mnt/glusterSD/ovirt-node3.ovirt:_gv0/4745320f-bfc3-46c4-8849-b4fe8f1b2de6/master', '.'], 'rc': 1, 'err': '/usr/bin/tar: ./tasks/20a9aa7f-80f5-403b-b296-ea95d9fd3f97: file changed as we read it\n/usr/bin/tar: ./tasks/87783efa-42ac-4cd9-bda5-ad68c59bb881/87783efa-42ac-4cd9-bda5-ad68c59bb881.task: file changed as we read it\n'}},) abortedcode=331]
2023-03-28 11:51:16,601Z ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-46) [] Failed in 'HSMGetAllTasksStatusesVDS' method
Seeming that somewhat is changing file under the directory but:
[vdsm@ovirt-node2 4745320f-bfc3-46c4-8849-b4fe8f1b2de6]$ /usr/bin/tar -cf - --exclude=./lost+found -C '/rhev/data-center/mnt/glusterSD/ovirt-node3.ovirt:_gv0/4745320f-bfc3-46c4-8849-b4fe8f1b2de6/master' '.' > /tmp/tar.tar
/usr/bin/tar: ./tasks/20a9aa7f-80f5-403b-b296-ea95d9fd3f97: file changed as we read it
/usr/bin/tar: ./tasks: file changed as we read it
[vdsm@ovirt-node2 master]$ find '/rhev/data-center/mnt/glusterSD/ovirt-node3.ovirt:_gv0/4745320f-bfc3-46c4-8849-b4fe8f1b2de6/master' -mtime -1
/rhev/data-center/mnt/glusterSD/ovirt-node3.ovirt:_gv0/4745320f-bfc3-46c4-8849-b4fe8f1b2de6/master/tasks
[vdsm@ovirt-node2 master]$ ls -l /rhev/data-center/mnt/glusterSD/ovirt-node3.ovirt:_gv0/4745320f-bfc3-46c4-8849-b4fe8f1b2de6/master/
total 0
drwxr-xr-x. 6 vdsm kvm 182 Mar 28 11:51 tasks
drwxr-xr-x. 2 vdsm kvm 6 Mar 26 20:36 vms
[vdsm@ovirt-node2 master]$ date; stat tasks
Tue Mar 28 12:04:06 UTC 2023
File: tasks
Size: 182 Blocks: 0 IO Block: 131072 directory
Device: 31h/49d Inode: 12434008067414313592 Links: 6
Access: (0755/drwxr-xr-x) Uid: ( 36/ vdsm) Gid: ( 36/ kvm)
Context: system_u:object_r:fusefs_t:s0
Access: 2023-03-28 11:55:17.771046746 +0000
Modify: 2023-03-28 11:51:16.641145314 +0000
Change: 2023-03-28 11:51:16.641145314 +0000
Birth: -
It seem the task directory isn't touched since
1 year, 7 months
clock skew in hosted engine and VMs due to slow IO storage
by Diego Ercolani
I don't know why (but I suppose is related to storage speed) the virtual machines tend to present a skew in the clock from some days to a century forward (2177)
I see in the journal of the engine:
Mar 28 13:19:40 ovirt-engine.ovirt NetworkManager[1158]: <info> [1680009580.2045] dhcp4 (eth0): state changed new lease, address=192.168.123.20
Mar 28 13:24:40 ovirt-engine.ovirt NetworkManager[1158]: <info> [1680009880.2042] dhcp4 (eth0): state changed new lease, address=192.168.123.20
Mar 28 13:29:40 ovirt-engine.ovirt NetworkManager[1158]: <info> [1680010180.2039] dhcp4 (eth0): state changed new lease, address=192.168.123.20
Apr 01 08:15:42 ovirt-engine.ovirt chronyd[1072]: Forward time jump detected!
Apr 01 08:15:42 ovirt-engine.ovirt NetworkManager[1158]: <info> [1680336942.4396] dhcp4 (eth0): activation: beginning transaction (timeout in 45 seconds)
Apr 01 08:15:42 ovirt-engine.ovirt chronyd[1072]: Can't synchronise: no selectable sources
When this happens in the hosted-engine tipically:
1. the DWH became unconsistent as I stated here: https://lists.ovirt.org/archives/list/users@ovirt.org/thread/KPW5FFKG3AI6... or https://lists.ovirt.org/archives/list/users@ovirt.org/thread/WUNZUSZ2ARRL...
2. the skew causes the engine to kick off the nodes that appears "down" in "connecting" state
This compromises all the task in pending state and raise countermeasures to the ovirt-engine manager and also vdsm daemon.
I currently tried to put in engine's crontab every 5 minutes a "hwclock --hctosys" as it seem the hwclock don't skew
1 year, 7 months
How to enable the storage pool correctly
by ziyi Liu
The /var/ folder is full, I can't enter the web ui to set it, I can only use the command line mode
vdsm-client StorageDomain activate
vdsm-client StorageDomain attach
vdsm-client StoragePool connect
vdsm-client StoragePool connectStorageServer
I have tried these commands and they all prompt message=Unknown pool id, pool not connected
1 year, 7 months
Failing vm backup
by Giulio Casella
Hi,
since yesterday backup for one of my VMs is failing. Backup is performed
by Storware vprotect, based on CBT strategy.
In ovirt events I can see:
VDSM host03 command StartNbdServerVDS failed: Bitmap does not exist:
"{'reason': 'Bitmap does not exist in
/rhev/data-center/mnt/blockSD/459011cf-ebb6-46ff-831d-8ccfafd82c8a/images/a9ec8085-6ac3-4be0-bbd2-7752f7e29368/083f5dc5-9003-4908-96b1-5f750b5a4197',
'bitmap': '1b88a937-0ab8-4f6e-a117-bbd522e21448'}"
Transfer was stopped by system. Reason: failed to create a signed image
ticket.
Backups for other VMs are working correctly. What could it be? Somehow
corrupted image?
I tried to migrate VM to other hypervisors, with no luck. Didn't tried
to move disks (two, one for the OS, one for data) to another data
domain, not so fast operation (about 2.8TB).
The VM is working correctly, but Murphy's law states this machine is a
fileserver with users' data and email :-(
Any hint?
TIA,
gc
--
Giulio Casella giulio at di.unimi.it
System and network architect
Computer Science Dept. - University of Milano
1 year, 7 months
Fail to import VM from Export Domain
by carl langlois
Hi,
Our main storage domain has failed and we are trying to recover it. On our
system we also have an Export Domain that adds some clone vm in it. But
when trying to import the vm from the export domain i always get a fail.
023-03-27 08:45:40,273-04 INFO
[org.ovirt.engine.core.bll.exportimport.ImportVmCommand] (default task-31)
[30eb771d-ccd6-4484-97bd-825aaf9b2443] Lock Acquired to object
'EngineLock:{exclusiveLocks='[flexlm_clone=VM_NAME,
e36327c2-9019-4eb4-8309-4df0557961cb=VM]',
sharedLocks='[99eae7e7-29b4-4226-907a-02011ca5eec1=REMOTE_VM]'}'
2023-03-27 08:47:28,697-04 ERROR
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(EE-ManagedThreadFactory-engine-Thread-35)
[30eb771d-ccd6-4484-97bd-825aaf9b2443] EVENT_ID:
IMPORTEXPORT_IMPORT_VM_FAILED(1,153), Failed to import Vm flexlm_clone to
Data Center Default, Cluster Default
2023-03-27 08:47:28,798-04 INFO
[org.ovirt.engine.core.bll.exportimport.ImportVmCommand]
(EE-ManagedThreadFactory-engine-Thread-35)
[30eb771d-ccd6-4484-97bd-825aaf9b2443] Lock freed to object
'EngineLock:{exclusiveLocks='[flexlm_clone=VM_NAME,
e36327c2-9019-4eb4-8309-4df0557961cb=VM]',
sharedLocks='[99eae7e7-29b4-4226-907a-02011ca5eec1=REMOTE_VM]'}'
2023-03-27 08:47:39,737-04 WARN
[org.ovirt.engine.core.bll.lock.InMemoryLockManager]
(EE-ManagedThreadFactory-engineScheduled-Thread-16)
[30eb771d-ccd6-4484-97bd-825aaf9b2443] Trying to release exclusive lock
which does not exist, lock key: 'flexlm_cloneVM_NAME'
2023-03-27 08:47:39,737-04 INFO
[org.ovirt.engine.core.bll.exportimport.ImportVmCommand]
(EE-ManagedThreadFactory-engineScheduled-Thread-16)
[30eb771d-ccd6-4484-97bd-825aaf9b2443] Lock freed to object
'EngineLock:{exclusiveLocks='[flexlm_clone=VM_NAME,
e36327c2-9019-4eb4-8309-4df0557961cb=VM]',
sharedLocks='[99eae7e7-29b4-4226-907a-02011ca5eec1=REMOTE_VM]'}'
2023-03-27 08:47:40,091-04 ERROR
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(EE-ManagedThreadFactory-engineScheduled-Thread-16)
[30eb771d-ccd6-4484-97bd-825aaf9b2443] EVENT_ID:
IMPORTEXPORT_IMPORT_VM_FAILED(1,153), Failed to import Vm flexlm_clone to
Data Center Default, Cluster Default
I was able to import 1 of them. but all the others fail.
Any hints to suggest?
Regards
Carl
1 year, 7 months
Setup oVirt self hosted engine on Rocky LInux 8 using cockpit - stuck in deadlock
by dirk.schulz@geniodata.de
Hi all,
I have set up 3 servers in 3 data centers, each having one physical interface and a vlan interface parented by it.
The connection between the 3 servers over the vlan interfaces (using private ip addresses) works (using icmp ping as the test).
Now I want to turn them into an ovirt cluster creating the self hosted engine on the first server. I have
- made sure the engine fqdn is in dns forward and reverse and in /etc/hosts
- made sure that both interfaces have unique dns entries which can be resolved forward and reverse
- made sure that both interfaces' fqdns are in /etc/hosts
- made sure only the primary hostname (not fqdn) is in /etc/hostname,
- made sure ipv6 is available on the physical interface,
- made sure ipv6 method is "disabled" on the vlan interface,
- set /usr/share/ansible/collections/ansible_collections/ovirt/ovirt/roles/hosted_engine_setup/defaults/main.yml:he_force_ip4: true to make sure no ipv6 attempts to interfere.
Now when I use cockpit's hosted engine wizard (not hyperconverged), i run into 2 opposing problems.
If I set the FQDN in the "Advanced" sub pane to the FQDN of the vlan interface, the wizards gets stuck at "preparing VM" with "The resolved address doesn't resolve on the selected interface\n".
If I set the FQDN in the "Advanced" sub pane to the FQDN of the physical interface, I get the same result.
If i add the physical interfaces FQDN to the vlan ip address in /etc/hosts, i get "hostname 'x.y.z' doesn't uniquely match the interface 'enp5s0.4000' selected for the management bridge; it matches also interface with IP ['physical']. Please make sure that the hostname got from the interface for the management network resolves only there." So clearly separating the two interfaces namewise is mandatory.
I tried to follow the ansible workflow step by step to see what it does. I seems the validate hostname is triggered twice, second time on filling in FQDN in "Advanced" sub pane - it succeeds with both hostnames (physiscal interface and vlan ip), but that does not prevent the "prepare VM" workflow in doing the same verification and failing, as far as I can see. This is where it happens:
2023-03-20 14:31:48,354+0100 DEBUG ansible on_any args TASK: ovirt.ovirt.hosted_engine_setup : Check the resolved address resolves on the selected interface kwargs is_conditional:False
2023-03-20 14:31:48,355+0100 DEBUG ansible on_any args localhost TASK: ovirt.ovirt.hosted_engine_setup : Check the resolved address resolves on the selected interface kwargs
2023-03-20 14:31:48,481+0100 DEBUG var changed: host "localhost" var "ansible_play_hosts" type "<class 'list'>" value: "[]"
2023-03-20 14:31:48,481+0100 DEBUG var changed: host "localhost" var "ansible_play_batch" type "<class 'list'>" value: "[]"
2023-03-20 14:31:48,481+0100 DEBUG var changed: host "localhost" var "play_hosts" type "<class 'list'>" value: "[]"
2023-03-20 14:31:48,481+0100 ERROR ansible failed {
"ansible_host": "localhost",
"ansible_playbook": "/usr/share/ovirt-hosted-engine-setup/ansible/trigger_role.yml",
"ansible_result": {
"_ansible_no_log": false,
"changed": false,
"msg": "The resolved address doesn't resolve on the selected interface\n"
},
"ansible_task": "Check the resolved address resolves on the selected interface",
"ansible_type": "task",
"status": "FAILED",
"task_duration": 0
}
So I am really stuck there. I do not have any idea how and where to go on. I can try changing bits in the playbooks and parameters (like using "hostname -A" instead of "hostname -f" for the failing test), but that can't really be the idea - I am to new to this to run into a bug or similar, I will suspect I do overlook something.
Any hint or help is appreciated.
Cheers,
Dirk
1 year, 8 months
Uncertain what to do with "Master storage domain" residing on obsolete storage domain.
by goestin@intert00bz.nl
Hello All,
I want to phase out a storage domain however it is marked as "master", I have the following questions:
1. What does it mean when a storage domain is "master".
2. What is the correct way to remove a storage domain that is assigned the "master" status.
Any insight on the matter would be highly appreciated.
Kind regards,
Justin
1 year, 8 months
Emergency Mode following Disk addition
by simon@justconnect.ie
Hi All,
I added a new local disk and created a new LVM thinpool, VG and lVm’s for Gluster in a 4.5.3 HCI environment.
Following reboot the server enters Emergency Mode caused by inactive LV/VG. Running ‘vgchange - ay vg’ activates the LV’s.
LVM is configured to use device file but disabling this and using a filter allows the system to start normally. Unfortunately ‘lsblk’ shows that the disk is under multipath control.
I’ve tried to exclude this disk from multipath but all efforts have failed.
‘pvscan’ shows the device as /dev/mapper/‘wwid’ instead of /dev/sd*
I’ve added the wwid to the blacklists manually and restarted multipath but males no change.
Regards
Simon
1 year, 8 months
VDI management on top of Ovirt
by samuel.xhu@horebdata.cn
Hello, Ovirt folks,
What is the most popular VDI management framework for Ovirt? After googling, I find adfinis/virtesk however which is not active now.
I would be much apppreciated if anyone can point out the best open source VDI solutions with Ovirt?
best regards,
Samuel
Do Right Thing (做正确的事) / Pursue Excellence (追求卓越) / Help Others Succeed (成就他人)
1 year, 8 months