Re: Self-hosted-engine timeout and recovering time
by Yedidyah Bar David
On Wed, Sep 21, 2022 at 12:22 AM Marcos Sungaila
<marcos.sungaila(a)oracle.com> wrote:
>
> Hi all,
>
> I have a cluster running the 4.4.10 release with 6 KVM hosts and Self-Hosted-Engine.
What storage?
> I'm testing some network outage scenarios, and I faced strange behavior.
I suppose you have redundancy in your network.
It's important to clarify (for yourself, mainly) what exactly you
test, what's important, what's expected, etc.
> After disconnecting the KVM hosts hosting the SHE, there was a long timeout until switching the Self-Hosted-Engine to another host as expected.
I suggest studying the ha-agent logs, /var/log/ovirt-hosted-engine-ha/agent.log.
Much of the relevant code is in ovirt_hosted_engine_ha/agent/states.py
(in the git repo, or under /usr/lib/python3.6/site-packages/ on your
machine).
> Also, there took a relatively long time to take over the HA VMs from the failing server.
That's a separate issue, about which I personally know very little.
You might want to start a separate thread about it.
I do know, though, that if you keep the storage connected, the host
might be able to keep updating VM leases on the storage. See e.g.:
https://www.ovirt.org/develop/release-management/features/storage/vm-leas...
I didn't check the admin guide, but I suppose it has some material about HA VMs.
> Is there a configuration where I can reduce the SHE timeout to make this recover process faster?
IIRC there is nothing user-configurable.
You can see most relevant constants in
ovirt_hosted_engine_ha/agent/constants.py{,.in}.
Nothing stops you from changing them, but please note that this is
somewhat risky, and I strongly suggest to do very careful testing with
your new settings. It might make sense to try to methodically go
through all the possible state changes in the above state machine.
The general assumption is that network and storage, for critical
setups, are redundant, and that the engine itself is not considered
critical, in the sense that if it's dead, all your VMs are still
alive. And also, that it's more important to not corrupt VM disk
images (e.g. by starting the VM concurrently on two hosts) than to
keep the VM alive.
Best regards,
--
Didi
2 years, 5 months
all active domains with status unknown in old 4.3 cluster
by Jorick Astrego
Hi,
Currently I'm debugging a client's ovirt 4.3 cluster. I was adding two
new gluster domains and got a timeout "VDSM command
AttachStorageDomainVDS failed: Resource timeout: ()" and "Failed to
attach Storage Domain *** to Data Center **".
Then I had to restart ovirt-engine and now all the domains including NFS
domains have status "unknown" and I see "VDSM command
GetStoragePoolInfoVDS failed: Resource timeout: ()" in the events.
Anyone fixed this before or have any tips?
Met vriendelijke groet, With kind regards,
Jorick Astrego
Netbulae Virtualization Experts
----------------
Tel: 053 20 30 270 info(a)netbulae.eu Staalsteden 4-3A KvK 08198180
Fax: 053 20 30 271 www.netbulae.eu 7547 TA Enschede BTW NL821234584B01
----------------
2 years, 5 months
Snapshot task stuck at oVirt 4.4.8
by nicolas@devels.es
Hi,
We're running oVirt 4.4.8 and one of our users tried to create a
snapshot on a VM. The snapshot task got stuck (not sure why) and since
then a "locked" icon is being shown on the VM. We need to remove this
VM, but since it has a pending task, we're unable.
The ovirt-engine log shows hundreds of events like:
[2022-09-20 09:23:09,286+01 INFO
[org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback]
(EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-27)
[2769dad5-3ec3-4c46-90a2-924746ea8d97] Command 'CreateSnapshotForVm'
(id: '4fcb6ab7-2cd7-4a0c-be97-f6979be25bb9') waiting on child command
id: 'cbb7a2c0-2111-4958-a55d-d48bf2d8591b'
type:'CreateLiveSnapshotForVm' to complete
An ovirt-engine restart didn't make any difference.
Is there a way to remove this task manually, even changing something in
the DB?
Thanks.
2 years, 5 months
oVirt Engine VM On Rocky Linux
by Matthew J Black
Hi Everybody (Hi Dr. Nick),
Has anyone attempted to migrate the oVirt Engine VM over to Rocky Linux (v8.6), and if so, any "gotchas" we need to know about?
Cheers
Dulux-Oz
2 years, 5 months
oVirt & (Ceph) iSCSI
by Matthew J Black
Hi Everybody (Hi Dr. Nick),
So, next question in my on-going saga: *somewhere* in the documentation I read that when using oVirt with multiple iSCSI paths (in my case, multiple Ceph iSCSI Gateways) we need to set up DM Multipath.
My question is: Is this still relevant information when using oVirt v4.5.2?
Relevant link referred to by the oVirt Documentation:
- https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/...
Cheers
Dulux-Oz
2 years, 5 months
Self-hosted-engine timeout and recovering time
by Marcos Sungaila
Hi all,
I have a cluster running the 4.4.10 release with 6 KVM hosts and Self-Hosted-Engine.
I'm testing some network outage scenarios, and I faced strange behavior.
After disconnecting the KVM hosts hosting the SHE, there was a long timeout until switching the Self-Hosted-Engine to another host as expected.
Also, there took a relatively long time to take over the HA VMs from the failing server.
Is there a configuration where I can reduce the SHE timeout to make this recover process faster?
Regards,
Marcos Sungaila
2 years, 5 months
How do I migrate a running VM off unassigned host?
by David White
Ok, now that I'm able to (re)deploy ovirt to new hosts, I now need to migrate VMs that are running on hosts that are currently in an "unassigned" state in the cluser.
This is the result of having moved the oVirt engine OUT of a hyperconverged environment onto its own stand-alone system, while simultaneously upgrading oVirt from v4.4 to the latest v4.5.
See the following email threads:
- https://lists.ovirt.org/archives/list/users@ovirt.org/thread/TZAUCM3GB5ER...
- https://lists.ovirt.org/archives/list/users@ovirt.org/thread/3IWXZ7VXM6CY...
The oVirt engine knows about the VMs, and oVirt knows about the storage that those VMs are on. But the engine sees 2 of my hosts as "unassigned", and I've been unable to migrate the disks to new storage, nor live migrate a VM from an unassigned host, nor make a clone of an existing VM.
Is there a way to recover from this scenario? I was thinking something along the lines of manually shutting down the VM on the unassigned host, and then somehow force the engine to bring the VM online again from a healthy host?
Thanks,
David
Sent with Proton Mail secure email.
2 years, 5 months
long time running backup (hanged in image finalizing state )
by Jirka Simon
Hello there.
we have issue with backups on our cluster, one backup started 2 days ago
and is is still in state finalizing.
select * from vm_backups;
backup_id | from_checkpoint_id |
to_checkpoint_id | vm_id
| phase | _create_date | host_id | des
cription | _update_date | backup_type |
snapshot_id | is_stopped
--------------------------------------+--------------------+--------------------------------------+--------------------------------------+-------+----------------------------+---------+----
---------+----------------------------+-------------+--------------------------------------+------------
b9c458e6-64e2-41c2-93b8-96761e71f82b | |
7a558f2a-57b6-432f-b5dd-85f5fb9dac8e |
c3b2199f-35cc-41dc-8787-835e945217d2 | Ready | 2022-09-17
00:44:56.877+02 | |
| 2022-09-17 00:45:19.057+02 | hybrid |
0c6ebd56-dcfe-46a8-91cc-327cc94e9773 | f
(1 row)
and if I check imagetransfer table, I see bytes_sent = bytes_total.
engine=# select it.disk_id,bd.disk_alias,it.last_updated, it.bytes_sent,
it.bytes_total from image_transfers as it , base_disks as bd where
it.disk_id = bd.disk_id;
disk_id | disk_alias
| last_updated | bytes_sent |
bytes_total
--------------------------------------+-------------------------------------------------------+----------------------------+--------------+--------------
950279ef-485c-400e-ba66-a3f545618de5 |
log1.util.prod.hq.sldev.cz_log1.util.prod.hq.sldev.cz | 2022-09-17
01:43:09.229+02 | 214748364800 | 214748364800
there is no error in logs
if i use /usr/share/ovirt-engine/setup/dbutils/unlock_entity.sh -t all
-qc there is no record in any part.
I can clean these record from DB to fix it but it will happen again in
few days.
vdsm.x86_64 4.50.2.2-1.el8
ovirt-engine.noarch 4.5.2.4-1.el8
is there anything i can check to find reason of this ?
Thank you Jirka
2 years, 5 months
Unable to deploy to new host
by David White
I currently have a self-hosted engine that was restored from a backup of an engine that was originally in a hyperconverged state. (See https://lists.ovirt.org/archives/list/users@ovirt.org/message/APQ3XBU...).
This was also an upgrade from ovirt 4.4 to ovirt 4.5.
There were 4 hosts in this cluster. Unfortunately, 2 of them are completely in an "Unassigned" state right now, and I don't know why. The VMs on those hosts are working fine, but I have no way to move the VMs or manage them.
More to the point of this email:
I'm trying to re-deploy onto a 3rd host. I did a fresh install of Rocky Linux 8, and followed the instructions at https://ovirt.org/download/ and at https://ovirt.org/download/install_on_rhel.html, including the part there that is specific to Rocky.
After installing the centos-release-ovirt45 package, I then logged into the oVirt engine web UI, and went to Compute -> Hosts -> New, and have tried (and failed) many times to install / deploy to this new host.
The last error in the host deploy log is the following:
2022-09-18 21:29:39 EDT - { "uuid" : "94b93e6a-5410-4d26-b058-d7d1db0a151e",
"counter" : 404,
"stdout" : "fatal: [cha2-storage.mgt.example.com]: FAILED! => {\"msg\": \"The conditional check 'cluster_switch == \\\"ovs\\\" or (ovn_central is defined and ovn_central | ipaddr)' failed. The error was: The ipaddr filter requires python's netaddr be installed on the ansible controller\\n\\nThe error appears to be in '/usr/share/ovirt-engine/ansible-runner-service-project/project/roles/ovirt-provider-ovn-driver/tasks/configure.yml': line 3, column 5, but may\\nbe elsewhere in the file depending on the exact syntax problem.\\n\\nThe offending line appears to be:\\n\\n- block:\\n - name: Install ovs\\n ^ here\\n\"}",
"start_line" : 405,
"end_line" : 406,
"runner_ident" : "e2cbd38d-64fa-4ecd-82c6-114420ea14a4",
"event" : "runner_on_failed",
"pid" : 65899,
"created" : "2022-09-19T01:29:38.983937",
"parent_uuid" : "02113221-f1b3-920f-8bd4-00000000003d",
"event_data" : {
"playbook" : "ovirt-host-deploy.yml",
"playbook_uuid" : "73a6e8f1-3836-49e1-82fd-5367b0bf4e90",
"play" : "all",
"play_uuid" : "02113221-f1b3-920f-8bd4-000000000006",
"play_pattern" : "all",
"task" : "Install ovs",
"task_uuid" : "02113221-f1b3-920f-8bd4-00000000003d",
"task_action" : "package",
"task_args" : "",
"task_path" : "/usr/share/ovirt-engine/ansible-runner-service-project/project/roles/ovirt-provider-ovn-driver/tasks/configure.yml:3",
"role" : "ovirt-provider-ovn-driver",
"host" : "cha2-storage.mgt.example.com",
"remote_addr" : "cha2-storage.mgt.example.com",
"res" : {
"msg" : "The conditional check 'cluster_switch == \"ovs\" or (ovn_central is defined and ovn_central | ipaddr)' failed. The error was: The ipaddr filter requires python's netaddr be installed on the ansible controller\n\nThe error appears to be in '/usr/share/ovirt-engine/ansible-runner-service-project/project/roles/ovirt-provider-ovn-driver/tasks/configure.yml': line 3, column 5, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n- block:\n - name: Install ovs\n ^ here\n",
"_ansible_no_log" : false
},
"start" : "2022-09-19T01:29:38.919334",
"end" : "2022-09-19T01:29:38.983680",
"duration" : 0.064346,
"ignore_errors" : null,
"event_loop" : null,
"uuid" : "94b93e6a-5410-4d26-b058-d7d1db0a151e"
}
}
On the engine, I have verified that netaddr is installed. And just for kicks, I've installed as many different versions as I can find:
[root@ovirt-engine1 host-deploy]# rpm -qa | grep netaddrpython38-netaddr-0.7.19-8.1.1.el8.noarch
python2-netaddr-0.7.19-8.1.1.el8.noarch
python3-netaddr-0.7.19-8.1.1.el8.noarch
The engine is based on CentOS Stream 8 (when I moved the engine out of the hyperconverged environment, my goal was to keep things as close to the original environment as possible)
[root@ovirt-engine1 host-deploy]# cat /etc/redhat-release
CentOS Stream release 8
The engine is fully up-to-date:
[root@ovirt-engine1 host-deploy]# uname -a
Linux ovirt-engine1.mgt.barredowlweb.com 4.18.0-408.el8.x86_64 #1 SMP Mon Jul 18 17:42:52 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
And the engine has the following repos:
[root@ovirt-engine1 host-deploy]# yum repolistrepo id repo name
appstream CentOS Stream 8 - AppStream
baseos CentOS Stream 8 - BaseOS
centos-ceph-pacific CentOS-8-stream - Ceph Pacific
centos-gluster10 CentOS-8-stream - Gluster 10
centos-nfv-openvswitch CentOS-8 - NFV OpenvSwitch
centos-opstools CentOS-OpsTools - collectd
centos-ovirt45 CentOS Stream 8 - oVirt 4.5
extras CentOS Stream 8 - Extras
extras-common CentOS Stream 8 - Extras common packages
ovirt-45-centos-stream-openstack-yoga CentOS Stream 8 - oVirt 4.5 - OpenStack Yoga Repository
ovirt-45-upstream oVirt upstream for CentOS Stream 8 - oVirt 4.5
powertools CentOS Stream 8 - PowerTools
Why does deploying to this new Rocky host keep failing?
Sent with Proton Mail secure email.
2 years, 5 months