cancel export task - no qemu-img process on host
by goosesk blabla
Hi,
i tried to find way how to cancel hanged export task. I found that there should be some hidden possibility, but it is still not released and i cannot find how to allow it.
i found how to remove this task directly from pg database on engine, but i am not sure if this also will unlock that VM.
i cannot kill it by qemu-img process, because this process is not running.
stucked VM is ubuntu 20.04 with 3 disks 20 + 80 + 300GB . Already running more than 12 hours and still did not start making ova file (checking in export folder) . qemu-img process on the host is missing .
i tried to export smaller VM 7GB and it was successfully done by 2 minutes. During export qemu-img process was running on the host.
Thank you for any help.
BR
jan
2 years, 2 months
Uncaught exception occurred
by nikkognt@gmail.com
Hi all,
recently I have upgraded from 4.4 to 4.5, since while a create or edit a new pool an error occurr in a pop up red:
"Uncaught exception occurred. Please try reloading the page. Details: (TypeError) : Cannot read properties of undefined (reading 'Kh') Please have your administrator check the UI logs"
and the pool is create but with wrong settings.
I checked the ui.log file:
2022-09-21 16:33:21,382+02 ERROR [org.ovirt.engine.ui.frontend.server.gwt.OvirtRemoteLoggingService] (default task-23) [] Permutation name: 872EE01DC49AC3F81586694F1584D19D
2022-09-21 16:33:21,382+02 ERROR [org.ovirt.engine.ui.frontend.server.gwt.OvirtRemoteLoggingService] (default task-23) [] Uncaught exception: com.google.gwt.core.client.JavaScriptException: (TypeError) : Cannot read properties of undefined (reading 'Kh')
at org.ovirt.engine.ui.common.widget.uicommon.storage.DisksAllocationView.$addDiskList(DisksAllocationView.java:190)
at org.ovirt.engine.ui.common.widget.uicommon.storage.DisksAllocationView.$lambda$0(DisksAllocationView.java:179)
at org.ovirt.engine.ui.common.widget.uicommon.storage.DisksAllocationView$lambda$0$Type.eventRaised(DisksAllocationView.java:179)
at org.ovirt.engine.ui.uicompat.Event.$raise(Event.java:99)
at org.ovirt.engine.ui.uicommonweb.models.storage.DisksAllocationModel.$onPropertyChanged(DisksAllocationModel.java:310)
at org.ovirt.engine.ui.uicommonweb.models.storage.DisksAllocationModel.$setQuotaEnforcementType(DisksAllocationModel.java:121)
at org.ovirt.engine.ui.uicommonweb.models.vms.UnitVmModel.$compatibilityVersionChanged(UnitVmModel.java:2384)
at org.ovirt.engine.ui.uicommonweb.models.vms.UnitVmModel.eventRaised(UnitVmModel.java:2223)
at org.ovirt.engine.ui.uicompat.Event.$raise(Event.java:99)
at org.ovirt.engine.ui.uicommonweb.models.ListModel.$setSelectedItem(ListModel.java:82)
at org.ovirt.engine.ui.uicommonweb.builders.vm.CoreVmBaseToUnitBuilder.$postBuild(CoreVmBaseToUnitBuilder.java:35)
at org.ovirt.engine.ui.uicommonweb.builders.vm.CoreVmBaseToUnitBuilder.postBuild(CoreVmBaseToUnitBuilder.java:35)
at org.ovirt.engine.ui.uicommonweb.builders.CompositeBuilder$LastBuilder.build(CompositeBuilder.java:45)
at org.ovirt.engine.ui.uicommonweb.builders.BaseSyncBuilder.build(BaseSyncBuilder.java:13)
at org.ovirt.engine.ui.uicommonweb.builders.BaseSyncBuilder.build(BaseSyncBuilder.java:13)
at org.ovirt.engine.ui.uicommonweb.builders.vm.IconVmBaseToUnitBuilder.lambda$0(IconVmBaseToUnitBuilder.java:23)
at org.ovirt.engine.ui.uicommonweb.builders.vm.IconVmBaseToUnitBuilder$lambda$0$Type.onSuccess(IconVmBaseToUnitBuilder.java:23)
at org.ovirt.engine.ui.uicommonweb.models.vms.IconCache.$lambda$0(IconCache.java:57)
at org.ovirt.engine.ui.uicommonweb.models.vms.IconCache$lambda$0$Type.onSuccess(IconCache.java:57)
at org.ovirt.engine.ui.frontend.Frontend$1.$onSuccess(Frontend.java:239)
at org.ovirt.engine.ui.frontend.Frontend$1.onSuccess(Frontend.java:239)
at org.ovirt.engine.ui.frontend.communication.OperationProcessor$1.$onSuccess(OperationProcessor.java:133)
at org.ovirt.engine.ui.frontend.communication.OperationProcessor$1.onSuccess(OperationProcessor.java:133)
at org.ovirt.engine.ui.frontend.communication.GWTRPCCommunicationProvider$5$1.$onSuccess(GWTRPCCommunicationProvider.java:270)
at org.ovirt.engine.ui.frontend.communication.GWTRPCCommunicationProvider$5$1.onSuccess(GWTRPCCommunicationProvider.java:270)
at com.google.gwt.user.client.rpc.impl.RequestCallbackAdapter.onResponseReceived(RequestCallbackAdapter.java:198)
at com.google.gwt.http.client.Request.$fireOnResponseReceived(Request.java:233)
at com.google.gwt.http.client.RequestBuilder$1.onReadyStateChange(RequestBuilder.java:409)
at Unknown.eval(webadmin-0.js)
at com.google.gwt.core.client.impl.Impl.apply(Impl.java:306)
at com.google.gwt.core.client.impl.Impl.entry0(Impl.java:345)
at Unknown.eval(webadmin-0.js)
Any suggestions for resolve the problem?
Thank you
2 years, 2 months
GUI snapshot issue
by Facundo Badaracco
Hi everyone.
I have made several snapshot of my vm, the interface says everything is
good. But, when i go to the snapshot section of any vm, there is no
snapshot. But, in disk section, all my snapshot are there.
Any hint?
2 years, 2 months
VMs hang periodically: gluster problem?
by Diego Ercolani
Hello, I have a cluster made by 3 nodes in a "self-hosted-engine" topology.
I implemented the storage with gluster implementation in 2 replica + arbiter topology.
I have two gluster volumes
glen - is the volume used by hosted-engine vm
gv0 - is the volume used by VMs
The physical disks are 4TB SSD used only to accomodate VMs (also hosted-engine)
I have continuos VMs hangs, even hosted-engine, this give full of troubles as I have continuous hangs by hosted-engine and this happen asyncrounosly even while there is management operation on VMs (mobility, cloning...)
after a while it happens that the VM is freed but in the VMs I have in console kernel complaining by CPU hang or timer hangs and the solution is only to shutdown/poweroff the VM... even hosted engine in fact it happens that hosted-engine -vm-status give "state=EngineUpBadHealth"
This is the log during the event in the host while there is the event:
Sep 01 07:48:58 ovirt-node3.ovirt NetworkManager[1923]: <info> [1662018538.0166] device (vnet73): state change: activated -> unmanaged (reason 'unmanaged', sys-iface-state: 'removed')
Sep 01 07:48:58 ovirt-node3.ovirt NetworkManager[1923]: <info> [1662018538.0168] device (vnet73): released from master device ovirtmgmt
Sep 01 07:48:58 ovirt-node3.ovirt libvirtd[2496]: Unable to read from monitor: Connection reset by peer
Sep 01 07:48:58 ovirt-node3.ovirt libvirtd[2496]: internal error: qemu unexpectedly closed the monitor: 2022-09-01T07:48:57.930955Z qemu-kvm: -device virtio-blk-pci,iothread=iothread1,bus=pci.6,addr=0x0,drive=libvirt-1-format,id=ua-0a1a501c-fc45-430f-bfd3-076172cec406,bootindex=1,write-cache=on,serial=0a1a501c-fc45-430f-bfd3-076172cec406,werror=stop,rerror=stop: Failed to get "write" lock Is another process using the image [/run/vdsm/storage/3577c21e-f757-4405-97d1-0f827c9b4e22/0a1a501c-fc45-430f-bfd3-076172cec406/f65dab86-67f1-46fa-87c0-f9076f479741]?
Sep 01 07:48:58 ovirt-node3.ovirt kvm[268578]: 5 guests now active
Sep 01 07:48:58 ovirt-node3.ovirt systemd[1]: machine-qemu\x2d67\x2dHostedEngine.scope: Succeeded.
Sep 01 07:48:58 ovirt-node3.ovirt systemd-machined[1613]: Machine qemu-67-HostedEngine terminated.
Sep 01 07:49:08 ovirt-node3.ovirt systemd[1]: NetworkManager-dispatcher.service: Succeeded.
Sep 01 07:49:08 ovirt-node3.ovirt ovirt-ha-agent[3338]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine ERROR Engine VM stopped on localhost
Sep 01 07:49:14 ovirt-node3.ovirt vdsm[3335]: WARN Failed to retrieve Hosted Engine HA info: timed out
Sep 01 07:49:18 ovirt-node3.ovirt vdsm[3335]: WARN Failed to retrieve Hosted Engine HA info: timed out
Sep 01 07:49:18 ovirt-node3.ovirt vdsm[3335]: WARN Failed to retrieve Hosted Engine HA info: timed out
Sep 01 07:49:28 ovirt-node3.ovirt vdsm[3335]: WARN Failed to retrieve Hosted Engine HA info: timed out
Sep 01 07:49:28 ovirt-node3.ovirt sanlock[1633]: 2022-09-01 07:49:28 1706161 [5083]: s4 delta_renew long write time 11 sec
Sep 01 07:49:28 ovirt-node3.ovirt sanlock[1633]: 2022-09-01 07:49:28 1706161 [5033]: s3 delta_renew long write time 11 sec
Sep 01 07:49:34 ovirt-node3.ovirt libvirtd[2496]: Domain id=65 name='ocr-Brain-28-ovirt.dmz.ssis' uuid=00425fb1-c24b-4eaa-9683-534d66b2cb04 is tainted: custom-ga-command
Sep 01 07:49:47 ovirt-node3.ovirt sudo[268984]: root : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/privsep-helper --privsep_context os_brick.privileged.default --privsep_sock_path /tmp/tmp1iolt06i/privsep.sock
This is the indication I have on gluster:
[root@ovirt-node3 ~]# gluster volume heal gv0 info
Brick ovirt-node2.ovirt:/brickgv0/_gv0
Status: Connected
Number of entries: 0
Brick ovirt-node3.ovirt:/brickgv0/gv0_1
Status: Connected
Number of entries: 0
Brick ovirt-node4.ovirt:/dati/_gv0
Status: Connected
Number of entries: 0
[root@ovirt-node3 ~]# gluster volume heal glen info
Brick ovirt-node2.ovirt:/brickhe/_glen
Status: Connected
Number of entries: 0
Brick ovirt-node3.ovirt:/brickhe/glen
Status: Connected
Number of entries: 0
Brick ovirt-node4.ovirt:/dati/_glen
Status: Connected
Number of entries: 0
So it seem healty.
I don't know how to address the issue but this is a great problem.
2 years, 2 months
iSCSI domain not seen by a node if the topology differ from the topology of the other nodes?
by Diego Ercolani
Hello, I think I0ve found another issue:
I have the three node that are under heavy test and, after having problem with gluster I configured them to use with iSCSI (without multipath now) so I configured via the gui a new iscsi data domain using a single target under a single VLAN.
I suspect there is an issue reporting the correct volume in my case:
I try to explain.
This are the scsi devices in the three nodes:
[root@ovirt-node2 ~]# lsscsi
[4:0:0:0] disk ATA ST4000NM000A-2HZ TN02 /dev/sda
[5:0:0:0] disk ATA Samsung SSD 870 2B6Q /dev/sdb
[6:0:0:0] disk IBM 2145 0000 /dev/sdc
[N:0:1:1] disk Force MP600__1 /dev/nvme0n1
[root@ovirt-node3 ~]# lsscsi
[0:0:0:0] disk ATA Samsung SSD 870 2B6Q /dev/sda
[6:0:0:0] disk IBM 2145 0000 /dev/sdb
[N:0:0:1] disk WD Blue SN570 500GB__1 /dev/nvme0n1
[root@ovirt-node4 ~]# lsscsi
[3:0:0:0] disk ATA ST4000NM000A-2HZ TN02 /dev/sda
[4:0:0:0] disk ATA KINGSTON SA400S3 1103 /dev/sdb
[5:0:0:0] disk IBM 2145 0000 /dev/sdc
So you see, the SCSI target (IBM 2145) are mapped as /dev/sdc in node2 and node4, but in node3 is mapped as /dev/sdb.
In vdsm log of node3 I can find:
2022-09-21 15:53:57,831+0000 INFO (monitor/aac7917) [storage.storagedomaincache] Looking up domain aac79175-ab2b-4b5b-a6e4-9feef9ce17ab (sdc:171)
2022-09-21 15:53:57,899+0000 INFO (monitor/aac7917) [storage.storagedomaincache] Looking up domain aac79175-ab2b-4b5b-a6e4-9feef9ce17ab: 0.07 seconds (utils:390)
2022-09-21 15:53:57,899+0000 ERROR (monitor/aac7917) [storage.monitor] Setting up monitor for aac79175-ab2b-4b5b-a6e4-9feef9ce17ab failed (monitor:363)
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 360, in _setupLoop
self._setupMonitor()
File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 382, in _setupMonitor
self._setupDomain()
File "/usr/lib/python3.6/site-packages/vdsm/utils.py", line 153, in wrapper
value = meth(self, *a, **kw)
File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 598, in _setupDomain
domain = sdCache.produce(self.sdUUID)
File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 115, in produce
domain.getRealDomain()
File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain
return self._cache._realProduce(self._sdUUID)
File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 139, in _realProduce
domain = self._findDomain(sdUUID)
File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 156, in _findDomain
return findMethod(sdUUID)
File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 186, in _findUnfetchedDomain
raise se.StorageDomainDoesNotExist(sdUUID)
vdsm.storage.exception.StorageDomainDoesNotExist: Storage domain does not exist: ('aac79175-ab2b-4b5b-a6e4-9feef9ce17ab',)
So the node is kicked out of the ovirt cluster telling that it's not possible to connect to iSCSI domain.....
2 years, 2 months
VM Down With "Bad Volume Specification"
by Clint Boggio
I had occasion to shutdown a VM for the purpose of adding RAM and processor to it and the VM willnot boot back up. I'm seeing "VM Issabel_PBX is down with error. Exit message: Bad volume specification {'address': {'bus': '0', 'controller': '0', 'type': 'drive', 'target': '0', 'unit': '0'}, 'serial': '6af66318-e6f8-45d7-8b4e-2183faf0a917', 'index': 0, 'iface': 'scsi', 'apparentsize': '5706743808', 'specParams': {}, 'cache': 'none', 'imageID': '6af66318-e6f8-45d7-8b4e-2183faf0a917', 'truesize': '5950070784', 'type': 'disk', 'domainID': '24c4dc1b-c843-4ae2-963f-9d0548305192', 'reqsize': '0', 'format': 'cow', 'poolID': '31fdd642-6b06-11ea-a4c4-00163e333bd2', 'device': 'disk', 'path': '/rhev/data-center/31fdd642-6b06-11ea-a4c4-00163e333bd2/24c4dc1b-c843-4ae2-963f-9d0548305192/images/6af66318-e6f8-45d7-8b4e-2183faf0a917/576b2761-a5bc-427b-95a9-0594447f0705', 'propagateErrors': 'off', 'name': 'sda', 'bootOrder': '1', 'volumeID': '576b2761-a5bc-427b-95a9-0594447f0705', 'diskType': 'file', 'a
lias': 'ua-6af66318-e6f8-45d7-8b4e-2183faf0a917', 'discard': False}."
in the log. I tried to move the VMs disk from one gluster datastore to another to see if the problem would clear and now the disk is locked and the move is stuck at %10. In the engine logs I have "2022-09-19 12:48:25,614-05 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedThreadFactory-engineScheduled-Thread-87) [] SPMAsyncTask::ClearAsyncTask: At time of attempt to clear task '4b96a8e1-ab65-4d1c-97dd-e985ab7816c6' the response code was 'TaskStateError' and message was 'Operation is not allowed in this task state: ("can't clean in state running",)'. Task will not be cleaned
2022-09-19 12:48:25,614-05 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedThreadFactory-engineScheduled-Thread-87) [] Task id '4b96a8e1-ab65-4d1c-97dd-e985ab7816c6' has passed pre-polling period time and should be polled. Pre-polling period is 60000 millis.
2022-09-19 12:48:25,631-05 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-87) [] EVENT_ID: TASK_CLEARING_ASYNC_TASK(9,501), Clearing asynchronous task Unknown that started at Tue Jul 12 12:19:21 CDT 2022
2022-09-19 12:48:25,631-05 INFO [org.ovirt.engine.core.bll.tasks.AsyncTaskManager] (EE-ManagedThreadFactory-engineScheduled-Thread-87) [] Cleaning zombie tasks: Clearing async task 'Unknown' that started at 'Tue Jul 12 12:19:21 CDT 2022' since it reached a timeout of 3000 minutes
2022-09-19 12:48:25,631-05 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedThreadFactory-engineScheduled-Thread-87) [] SPMAsyncTask::ClearAsyncTask: Attempting to clear task '4b96a8e1-ab65-4d1c-97dd-e985ab7816c6'
2022-09-19 12:48:25,632-05 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.SPMClearTaskVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-87) [] START, SPMClearTaskVDSCommand( SPMTaskGuidBaseVDSCommandParameters:{storagePoolId='31fdd642-6b06-11ea-a4c4-00163e333bd2', ignoreFailoverLimit='false', taskId='4b96a8e1-ab65-4d1c-97dd-e985ab7816c6'}), log id: 40d6d67f
2022-09-19 12:48:25,633-05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-87) [] START, HSMClearTaskVDSCommand(HostName = hprvsr00.locacore.com, HSMTaskGuidBaseVDSCommandParameters:{hostId='6c910725-fb42-4a64-b614-2a29bf0800e2', taskId='4b96a8e1-ab65-4d1c-97dd-e985ab7816c6'}), log id: 22c28060
2022-09-19 12:48:25,638-05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMClearTaskVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-87) [] FINISH, HSMClearTaskVDSCommand, return: , log id: 22c28060
2022-09-19 12:48:25,639-05 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.SPMClearTaskVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-87) [] FINISH, SPMClearTaskVDSCommand, return: , log id: 40d6d67f
2022-09-19 12:48:25,639-05 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedThreadFactory-engineScheduled-Thread-87) [] SPMAsyncTask::ClearAsyncTask: At time of attempt to clear task '4b96a8e1-ab65-4d1c-97dd-e985ab7816c6' the response code was 'TaskStateError' and message was 'Operation is not allowed in this task state: ("can't clean in state running",)'. Task will not be cleaned
2022-09-19 12:48:25,876-05 INFO [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-98) [149db168-34c0-4814-869e-1ca0fdbde768] Command 'CopyImageGroupWithData' (id: '9e73adde-e485-461a-b349-7fd814890aa6') waiting on child command id: '5b4d0a00-7e40-4016-bf5c-0db013e22983' type:'CopyImageGroupVolumesData' to complete
2022-09-19 12:48:26,878-05 INFO [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-84) [149db168-34c0-4814-869e-1ca0fdbde768] Command 'CopyImageGroupVolumesData' (id: '5b4d0a00-7e40-4016-bf5c-0db013e22983') waiting on child command id: '1b496e55-9407-4fd7-a2f2-bb70bf4e7aa0' type:'CopyData' to complete
2022-09-19 12:48:27,886-05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHostJobsVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-89) [149db168-34c0-4814-869e-1ca0fdbde768] START, GetHostJobsVDSCommand(HostName = hprvsr00.locacore.com, GetHostJobsVDSCommandParameters:{hostId='6c910725-fb42-4a64-b614-2a29bf0800e2', type='storage', jobIds='[06208564-b66d-4947-a96f-4d163ef2fbe0]'}), log id: 75ca6198
2022-09-19 12:48:27,894-05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHostJobsVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-89) [149db168-34c0-4814-869e-1ca0fdbde768] FINISH, GetHostJobsVDSCommand, return: {06208564-b66d-4947-a96f-4d163ef2fbe0=HostJobInfo:{id='06208564-b66d-4947-a96f-4d163ef2fbe0', type='storage', description='copy_data', status='running', progress='null', error='null'}}, log id: 75ca6198
2022-09-19 12:48:27,902-05 INFO [org.ovirt.engine.core.bll.StorageJobCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-89) [149db168-34c0-4814-869e-1ca0fdbde768] Command CopyData id: '1b496e55-9407-4fd7-a2f2-bb70bf4e7aa0': waiting for job '06208564-b66d-4947-a96f-4d163ef2fbe0' on host 'hprvsr00.locacore.com' (id: '6c910725-fb42-4a64-b614-2a29bf0800e2') to complete
2022-09-19 12:48:29,911-05 INFO [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-57) [149db168-34c0-4814-869e-1ca0fdbde768] Command 'MoveOrCopyDisk' (id: '2ec06560-b552-4418-abd7-e2945cd98c12') waiting on child command id: '80a2481e-707f-49bf-b469-33cd90c1a51c' type:'MoveImageGroup' to complete
2022-09-19 12:48:29,915-05 INFO [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-57) [149db168-34c0-4814-869e-1ca0fdbde768] Command 'MoveImageGroup' (id: '80a2481e-707f-49bf-b469-33cd90c1a51c') waiting on child command id: '9e73adde-e485-461a-b349-7fd814890aa6' type:'CopyImageGroupWithData' to complete"
Any help would be appreciated as the client's PBX is currently down as a result.
2 years, 2 months
Re: Self-hosted-engine timeout and recovering time
by Yedidyah Bar David
On Wed, Sep 21, 2022 at 12:22 AM Marcos Sungaila
<marcos.sungaila(a)oracle.com> wrote:
>
> Hi all,
>
> I have a cluster running the 4.4.10 release with 6 KVM hosts and Self-Hosted-Engine.
What storage?
> I'm testing some network outage scenarios, and I faced strange behavior.
I suppose you have redundancy in your network.
It's important to clarify (for yourself, mainly) what exactly you
test, what's important, what's expected, etc.
> After disconnecting the KVM hosts hosting the SHE, there was a long timeout until switching the Self-Hosted-Engine to another host as expected.
I suggest studying the ha-agent logs, /var/log/ovirt-hosted-engine-ha/agent.log.
Much of the relevant code is in ovirt_hosted_engine_ha/agent/states.py
(in the git repo, or under /usr/lib/python3.6/site-packages/ on your
machine).
> Also, there took a relatively long time to take over the HA VMs from the failing server.
That's a separate issue, about which I personally know very little.
You might want to start a separate thread about it.
I do know, though, that if you keep the storage connected, the host
might be able to keep updating VM leases on the storage. See e.g.:
https://www.ovirt.org/develop/release-management/features/storage/vm-leas...
I didn't check the admin guide, but I suppose it has some material about HA VMs.
> Is there a configuration where I can reduce the SHE timeout to make this recover process faster?
IIRC there is nothing user-configurable.
You can see most relevant constants in
ovirt_hosted_engine_ha/agent/constants.py{,.in}.
Nothing stops you from changing them, but please note that this is
somewhat risky, and I strongly suggest to do very careful testing with
your new settings. It might make sense to try to methodically go
through all the possible state changes in the above state machine.
The general assumption is that network and storage, for critical
setups, are redundant, and that the engine itself is not considered
critical, in the sense that if it's dead, all your VMs are still
alive. And also, that it's more important to not corrupt VM disk
images (e.g. by starting the VM concurrently on two hosts) than to
keep the VM alive.
Best regards,
--
Didi
2 years, 2 months
all active domains with status unknown in old 4.3 cluster
by Jorick Astrego
Hi,
Currently I'm debugging a client's ovirt 4.3 cluster. I was adding two
new gluster domains and got a timeout "VDSM command
AttachStorageDomainVDS failed: Resource timeout: ()" and "Failed to
attach Storage Domain *** to Data Center **".
Then I had to restart ovirt-engine and now all the domains including NFS
domains have status "unknown" and I see "VDSM command
GetStoragePoolInfoVDS failed: Resource timeout: ()" in the events.
Anyone fixed this before or have any tips?
Met vriendelijke groet, With kind regards,
Jorick Astrego
Netbulae Virtualization Experts
----------------
Tel: 053 20 30 270 info(a)netbulae.eu Staalsteden 4-3A KvK 08198180
Fax: 053 20 30 271 www.netbulae.eu 7547 TA Enschede BTW NL821234584B01
----------------
2 years, 2 months
Snapshot task stuck at oVirt 4.4.8
by nicolas@devels.es
Hi,
We're running oVirt 4.4.8 and one of our users tried to create a
snapshot on a VM. The snapshot task got stuck (not sure why) and since
then a "locked" icon is being shown on the VM. We need to remove this
VM, but since it has a pending task, we're unable.
The ovirt-engine log shows hundreds of events like:
[2022-09-20 09:23:09,286+01 INFO
[org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback]
(EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-27)
[2769dad5-3ec3-4c46-90a2-924746ea8d97] Command 'CreateSnapshotForVm'
(id: '4fcb6ab7-2cd7-4a0c-be97-f6979be25bb9') waiting on child command
id: 'cbb7a2c0-2111-4958-a55d-d48bf2d8591b'
type:'CreateLiveSnapshotForVm' to complete
An ovirt-engine restart didn't make any difference.
Is there a way to remove this task manually, even changing something in
the DB?
Thanks.
2 years, 2 months
oVirt Engine VM On Rocky Linux
by Matthew J Black
Hi Everybody (Hi Dr. Nick),
Has anyone attempted to migrate the oVirt Engine VM over to Rocky Linux (v8.6), and if so, any "gotchas" we need to know about?
Cheers
Dulux-Oz
2 years, 2 months