iSCSI domain not seen by a node if the topology differ from the topology of the other nodes?
by Diego Ercolani
Hello, I think I0ve found another issue:
I have the three node that are under heavy test and, after having problem with gluster I configured them to use with iSCSI (without multipath now) so I configured via the gui a new iscsi data domain using a single target under a single VLAN.
I suspect there is an issue reporting the correct volume in my case:
I try to explain.
This are the scsi devices in the three nodes:
[root@ovirt-node2 ~]# lsscsi
[4:0:0:0] disk ATA ST4000NM000A-2HZ TN02 /dev/sda
[5:0:0:0] disk ATA Samsung SSD 870 2B6Q /dev/sdb
[6:0:0:0] disk IBM 2145 0000 /dev/sdc
[N:0:1:1] disk Force MP600__1 /dev/nvme0n1
[root@ovirt-node3 ~]# lsscsi
[0:0:0:0] disk ATA Samsung SSD 870 2B6Q /dev/sda
[6:0:0:0] disk IBM 2145 0000 /dev/sdb
[N:0:0:1] disk WD Blue SN570 500GB__1 /dev/nvme0n1
[root@ovirt-node4 ~]# lsscsi
[3:0:0:0] disk ATA ST4000NM000A-2HZ TN02 /dev/sda
[4:0:0:0] disk ATA KINGSTON SA400S3 1103 /dev/sdb
[5:0:0:0] disk IBM 2145 0000 /dev/sdc
So you see, the SCSI target (IBM 2145) are mapped as /dev/sdc in node2 and node4, but in node3 is mapped as /dev/sdb.
In vdsm log of node3 I can find:
2022-09-21 15:53:57,831+0000 INFO (monitor/aac7917) [storage.storagedomaincache] Looking up domain aac79175-ab2b-4b5b-a6e4-9feef9ce17ab (sdc:171)
2022-09-21 15:53:57,899+0000 INFO (monitor/aac7917) [storage.storagedomaincache] Looking up domain aac79175-ab2b-4b5b-a6e4-9feef9ce17ab: 0.07 seconds (utils:390)
2022-09-21 15:53:57,899+0000 ERROR (monitor/aac7917) [storage.monitor] Setting up monitor for aac79175-ab2b-4b5b-a6e4-9feef9ce17ab failed (monitor:363)
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 360, in _setupLoop
self._setupMonitor()
File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 382, in _setupMonitor
self._setupDomain()
File "/usr/lib/python3.6/site-packages/vdsm/utils.py", line 153, in wrapper
value = meth(self, *a, **kw)
File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 598, in _setupDomain
domain = sdCache.produce(self.sdUUID)
File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 115, in produce
domain.getRealDomain()
File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain
return self._cache._realProduce(self._sdUUID)
File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 139, in _realProduce
domain = self._findDomain(sdUUID)
File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 156, in _findDomain
return findMethod(sdUUID)
File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 186, in _findUnfetchedDomain
raise se.StorageDomainDoesNotExist(sdUUID)
vdsm.storage.exception.StorageDomainDoesNotExist: Storage domain does not exist: ('aac79175-ab2b-4b5b-a6e4-9feef9ce17ab',)
So the node is kicked out of the ovirt cluster telling that it's not possible to connect to iSCSI domain.....
1 year, 8 months
VM Down With "Bad Volume Specification"
by Clint Boggio
I had occasion to shutdown a VM for the purpose of adding RAM and processor to it and the VM willnot boot back up. I'm seeing "VM Issabel_PBX is down with error. Exit message: Bad volume specification {'address': {'bus': '0', 'controller': '0', 'type': 'drive', 'target': '0', 'unit': '0'}, 'serial': '6af66318-e6f8-45d7-8b4e-2183faf0a917', 'index': 0, 'iface': 'scsi', 'apparentsize': '5706743808', 'specParams': {}, 'cache': 'none', 'imageID': '6af66318-e6f8-45d7-8b4e-2183faf0a917', 'truesize': '5950070784', 'type': 'disk', 'domainID': '24c4dc1b-c843-4ae2-963f-9d0548305192', 'reqsize': '0', 'format': 'cow', 'poolID': '31fdd642-6b06-11ea-a4c4-00163e333bd2', 'device': 'disk', 'path': '/rhev/data-center/31fdd642-6b06-11ea-a4c4-00163e333bd2/24c4dc1b-c843-4ae2-963f-9d0548305192/images/6af66318-e6f8-45d7-8b4e-2183faf0a917/576b2761-a5bc-427b-95a9-0594447f0705', 'propagateErrors': 'off', 'name': 'sda', 'bootOrder': '1', 'volumeID': '576b2761-a5bc-427b-95a9-0594447f0705', 'diskType': 'file', 'a
lias': 'ua-6af66318-e6f8-45d7-8b4e-2183faf0a917', 'discard': False}."
in the log. I tried to move the VMs disk from one gluster datastore to another to see if the problem would clear and now the disk is locked and the move is stuck at %10. In the engine logs I have "2022-09-19 12:48:25,614-05 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedThreadFactory-engineScheduled-Thread-87) [] SPMAsyncTask::ClearAsyncTask: At time of attempt to clear task '4b96a8e1-ab65-4d1c-97dd-e985ab7816c6' the response code was 'TaskStateError' and message was 'Operation is not allowed in this task state: ("can't clean in state running",)'. Task will not be cleaned
2022-09-19 12:48:25,614-05 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedThreadFactory-engineScheduled-Thread-87) [] Task id '4b96a8e1-ab65-4d1c-97dd-e985ab7816c6' has passed pre-polling period time and should be polled. Pre-polling period is 60000 millis.
2022-09-19 12:48:25,631-05 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-87) [] EVENT_ID: TASK_CLEARING_ASYNC_TASK(9,501), Clearing asynchronous task Unknown that started at Tue Jul 12 12:19:21 CDT 2022
2022-09-19 12:48:25,631-05 INFO [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (EE-ManagedThreadFactory-engineScheduled-Thread-87) [] Cleaning zombie tasks: Clearing async task 'Unknown' that started at 'Tue Jul 12 12:19:21 CDT 2022' since it reached a timeout of 3000 minutes
2022-09-19 12:48:25,631-05 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedThreadFactory-engineScheduled-Thread-87) [] SPMAsyncTask::ClearAsyncTask: Attempting to clear task '4b96a8e1-ab65-4d1c-97dd-e985ab7816c6'
2022-09-19 12:48:25,632-05 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.SPMClearTaskVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-87) [] START, SPMClearTaskVDSCommand( SPMTaskGuidBaseVDSCommandParameters:{storagePoolId='31fdd642-6b06-11ea-a4c4-00163e333bd2', ignoreFailoverLimit='false', taskId='4b96a8e1-ab65-4d1c-97dd-e985ab7816c6'}), log id: 40d6d67f
2022-09-19 12:48:25,633-05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-87) [] START, HSMClearTaskVDSCommand(HostName = hprvsr00.locacore.com, HSMTaskGuidBaseVDSCommandParameters:{hostId='6c910725-fb42-4a64-b614-2a29bf0800e2', taskId='4b96a8e1-ab65-4d1c-97dd-e985ab7816c6'}), log id: 22c28060
2022-09-19 12:48:25,638-05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-87) [] FINISH, HSMClearTaskVDSCommand, return: , log id: 22c28060
2022-09-19 12:48:25,639-05 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.SPMClearTaskVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-87) [] FINISH, SPMClearTaskVDSCommand, return: , log id: 40d6d67f
2022-09-19 12:48:25,639-05 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedThreadFactory-engineScheduled-Thread-87) [] SPMAsyncTask::ClearAsyncTask: At time of attempt to clear task '4b96a8e1-ab65-4d1c-97dd-e985ab7816c6' the response code was 'TaskStateError' and message was 'Operation is not allowed in this task state: ("can't clean in state running",)'. Task will not be cleaned
2022-09-19 12:48:25,876-05 INFO [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-98) [149db168-34c0-4814-869e-1ca0fdbde768] Command 'CopyImageGroupWithData' (id: '9e73adde-e485-461a-b349-7fd814890aa6') waiting on child command id: '5b4d0a00-7e40-4016-bf5c-0db013e22983' type:'CopyImageGroupVolumesData' to complete
2022-09-19 12:48:26,878-05 INFO [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-84) [149db168-34c0-4814-869e-1ca0fdbde768] Command 'CopyImageGroupVolumesData' (id: '5b4d0a00-7e40-4016-bf5c-0db013e22983') waiting on child command id: '1b496e55-9407-4fd7-a2f2-bb70bf4e7aa0' type:'CopyData' to complete
2022-09-19 12:48:27,886-05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHostJobsVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-89) [149db168-34c0-4814-869e-1ca0fdbde768] START, GetHostJobsVDSCommand(HostName = hprvsr00.locacore.com, GetHostJobsVDSCommandParameters:{hostId='6c910725-fb42-4a64-b614-2a29bf0800e2', type='storage', jobIds='[06208564-b66d-4947-a96f-4d163ef2fbe0]'}), log id: 75ca6198
2022-09-19 12:48:27,894-05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHostJobsVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-89) [149db168-34c0-4814-869e-1ca0fdbde768] FINISH, GetHostJobsVDSCommand, return: {06208564-b66d-4947-a96f-4d163ef2fbe0=HostJobInfo:{id='06208564-b66d-4947-a96f-4d163ef2fbe0', type='storage', description='copy_data', status='running', progress='null', error='null'}}, log id: 75ca6198
2022-09-19 12:48:27,902-05 INFO [org.ovirt.engine.core.bll.StorageJobCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-89) [149db168-34c0-4814-869e-1ca0fdbde768] Command CopyData id: '1b496e55-9407-4fd7-a2f2-bb70bf4e7aa0': waiting for job '06208564-b66d-4947-a96f-4d163ef2fbe0' on host 'hprvsr00.locacore.com' (id: '6c910725-fb42-4a64-b614-2a29bf0800e2') to complete
2022-09-19 12:48:29,911-05 INFO [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-57) [149db168-34c0-4814-869e-1ca0fdbde768] Command 'MoveOrCopyDisk' (id: '2ec06560-b552-4418-abd7-e2945cd98c12') waiting on child command id: '80a2481e-707f-49bf-b469-33cd90c1a51c' type:'MoveImageGroup' to complete
2022-09-19 12:48:29,915-05 INFO [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-57) [149db168-34c0-4814-869e-1ca0fdbde768] Command 'MoveImageGroup' (id: '80a2481e-707f-49bf-b469-33cd90c1a51c') waiting on child command id: '9e73adde-e485-461a-b349-7fd814890aa6' type:'CopyImageGroupWithData' to complete"
Any help would be appreciated as the client's PBX is currently down as a result.
1 year, 8 months
Re: Self-hosted-engine timeout and recovering time
by Yedidyah Bar David
On Wed, Sep 21, 2022 at 12:22 AM Marcos Sungaila
<marcos.sungaila(a)oracle.com> wrote:
>
> Hi all,
>
> I have a cluster running the 4.4.10 release with 6 KVM hosts and Self-Hosted-Engine.
What storage?
> I'm testing some network outage scenarios, and I faced strange behavior.
I suppose you have redundancy in your network.
It's important to clarify (for yourself, mainly) what exactly you
test, what's important, what's expected, etc.
> After disconnecting the KVM hosts hosting the SHE, there was a long timeout until switching the Self-Hosted-Engine to another host as expected.
I suggest studying the ha-agent logs, /var/log/ovirt-hosted-engine-ha/agent.log.
Much of the relevant code is in ovirt_hosted_engine_ha/agent/states.py
(in the git repo, or under /usr/lib/python3.6/site-packages/ on your
machine).
> Also, there took a relatively long time to take over the HA VMs from the failing server.
That's a separate issue, about which I personally know very little.
You might want to start a separate thread about it.
I do know, though, that if you keep the storage connected, the host
might be able to keep updating VM leases on the storage. See e.g.:
https://www.ovirt.org/develop/release-management/features/storage/vm-leas...
I didn't check the admin guide, but I suppose it has some material about HA VMs.
> Is there a configuration where I can reduce the SHE timeout to make this recover process faster?
IIRC there is nothing user-configurable.
You can see most relevant constants in
ovirt_hosted_engine_ha/agent/constants.py{,.in}.
Nothing stops you from changing them, but please note that this is
somewhat risky, and I strongly suggest to do very careful testing with
your new settings. It might make sense to try to methodically go
through all the possible state changes in the above state machine.
The general assumption is that network and storage, for critical
setups, are redundant, and that the engine itself is not considered
critical, in the sense that if it's dead, all your VMs are still
alive. And also, that it's more important to not corrupt VM disk
images (e.g. by starting the VM concurrently on two hosts) than to
keep the VM alive.
Best regards,
--
Didi
1 year, 8 months
all active domains with status unknown in old 4.3 cluster
by Jorick Astrego
Hi,
Currently I'm debugging a client's ovirt 4.3 cluster. I was adding two
new gluster domains and got a timeout "VDSM command
AttachStorageDomainVDS failed: Resource timeout: ()" and "Failed to
attach Storage Domain *** to Data Center **".
Then I had to restart ovirt-engine and now all the domains including NFS
domains have status "unknown" and I see "VDSM command
GetStoragePoolInfoVDS failed: Resource timeout: ()" in the events.
Anyone fixed this before or have any tips?
Met vriendelijke groet, With kind regards,
Jorick Astrego
Netbulae Virtualization Experts
----------------
Tel: 053 20 30 270 info(a)netbulae.eu Staalsteden 4-3A KvK 08198180
Fax: 053 20 30 271 www.netbulae.eu 7547 TA Enschede BTW NL821234584B01
----------------
1 year, 8 months
Snapshot task stuck at oVirt 4.4.8
by nicolas@devels.es
Hi,
We're running oVirt 4.4.8 and one of our users tried to create a
snapshot on a VM. The snapshot task got stuck (not sure why) and since
then a "locked" icon is being shown on the VM. We need to remove this
VM, but since it has a pending task, we're unable.
The ovirt-engine log shows hundreds of events like:
[2022-09-20 09:23:09,286+01 INFO
[org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback]
(EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-27)
[2769dad5-3ec3-4c46-90a2-924746ea8d97] Command 'CreateSnapshotForVm'
(id: '4fcb6ab7-2cd7-4a0c-be97-f6979be25bb9') waiting on child command
id: 'cbb7a2c0-2111-4958-a55d-d48bf2d8591b'
type:'CreateLiveSnapshotForVm' to complete
An ovirt-engine restart didn't make any difference.
Is there a way to remove this task manually, even changing something in
the DB?
Thanks.
1 year, 8 months
oVirt Engine VM On Rocky Linux
by Matthew J Black
Hi Everybody (Hi Dr. Nick),
Has anyone attempted to migrate the oVirt Engine VM over to Rocky Linux (v8.6), and if so, any "gotchas" we need to know about?
Cheers
Dulux-Oz
1 year, 8 months
oVirt & (Ceph) iSCSI
by Matthew J Black
Hi Everybody (Hi Dr. Nick),
So, next question in my on-going saga: *somewhere* in the documentation I read that when using oVirt with multiple iSCSI paths (in my case, multiple Ceph iSCSI Gateways) we need to set up DM Multipath.
My question is: Is this still relevant information when using oVirt v4.5.2?
Relevant link referred to by the oVirt Documentation:
- https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/...
Cheers
Dulux-Oz
1 year, 8 months
Self-hosted-engine timeout and recovering time
by Marcos Sungaila
Hi all,
I have a cluster running the 4.4.10 release with 6 KVM hosts and Self-Hosted-Engine.
I'm testing some network outage scenarios, and I faced strange behavior.
After disconnecting the KVM hosts hosting the SHE, there was a long timeout until switching the Self-Hosted-Engine to another host as expected.
Also, there took a relatively long time to take over the HA VMs from the failing server.
Is there a configuration where I can reduce the SHE timeout to make this recover process faster?
Regards,
Marcos Sungaila
1 year, 8 months
How do I migrate a running VM off unassigned host?
by David White
Ok, now that I'm able to (re)deploy ovirt to new hosts, I now need to migrate VMs that are running on hosts that are currently in an "unassigned" state in the cluser.
This is the result of having moved the oVirt engine OUT of a hyperconverged environment onto its own stand-alone system, while simultaneously upgrading oVirt from v4.4 to the latest v4.5.
See the following email threads:
- https://lists.ovirt.org/archives/list/users@ovirt.org/thread/TZAUCM3GB5ER...
- https://lists.ovirt.org/archives/list/users@ovirt.org/thread/3IWXZ7VXM6CY...
The oVirt engine knows about the VMs, and oVirt knows about the storage that those VMs are on. But the engine sees 2 of my hosts as "unassigned", and I've been unable to migrate the disks to new storage, nor live migrate a VM from an unassigned host, nor make a clone of an existing VM.
Is there a way to recover from this scenario? I was thinking something along the lines of manually shutting down the VM on the unassigned host, and then somehow force the engine to bring the VM online again from a healthy host?
Thanks,
David
Sent with Proton Mail secure email.
1 year, 8 months
long time running backup (hanged in image finalizing state )
by Jirka Simon
Hello there.
we have issue with backups on our cluster, one backup started 2 days ago
and is is still in state finalizing.
select * from vm_backups;
backup_id | from_checkpoint_id |
to_checkpoint_id | vm_id
| phase | _create_date | host_id | des
cription | _update_date | backup_type |
snapshot_id | is_stopped
--------------------------------------+--------------------+--------------------------------------+--------------------------------------+-------+----------------------------+---------+----
---------+----------------------------+-------------+--------------------------------------+------------
b9c458e6-64e2-41c2-93b8-96761e71f82b | |
7a558f2a-57b6-432f-b5dd-85f5fb9dac8e |
c3b2199f-35cc-41dc-8787-835e945217d2 | Ready | 2022-09-17
00:44:56.877+02 | |
| 2022-09-17 00:45:19.057+02 | hybrid |
0c6ebd56-dcfe-46a8-91cc-327cc94e9773 | f
(1 row)
and if I check imagetransfer table, I see bytes_sent = bytes_total.
engine=# select it.disk_id,bd.disk_alias,it.last_updated, it.bytes_sent,
it.bytes_total from image_transfers as it , base_disks as bd where
it.disk_id = bd.disk_id;
disk_id | disk_alias
| last_updated | bytes_sent |
bytes_total
--------------------------------------+-------------------------------------------------------+----------------------------+--------------+--------------
950279ef-485c-400e-ba66-a3f545618de5 |
log1.util.prod.hq.sldev.cz_log1.util.prod.hq.sldev.cz | 2022-09-17
01:43:09.229+02 | 214748364800 | 214748364800
there is no error in logs
if i use /usr/share/ovirt-engine/setup/dbutils/unlock_entity.sh -t all
-qc there is no record in any part.
I can clean these record from DB to fix it but it will happen again in
few days.
vdsm.x86_64 4.50.2.2-1.el8
ovirt-engine.noarch 4.5.2.4-1.el8
is there anything i can check to find reason of this ?
Thank you Jirka
1 year, 8 months