Following up on this, I was able to recover everything, with only minor (and easy to fix)
data loss.
The old hosted engine refused to come up, ever after a few hours of sitting. That is when
I dug into the issue and found the agent service stating the image didn't exist/no
such file or directory. It seems that was just one aspect of storage being impacted from
the unexpected outage.
In regards to the memory issue, I was only getting it on one host, but was able to
install, and recover, on another host in my cluster without the issue.
The broken host has this version of ansible's engine setup package:
ansible-2.9.18-1.el7.noarch
ovirt-ansible-hosted-engine-setup-1.0.32-1.el7.noarch
ovirt-ansible-engine-setup-1.1.9-1.el7.noarch
ovirt-hosted-engine-setup-2.3.13-1.el7.noarch
The one that works is:
ansible-2.8.3-1.el7.noarch
ovirt-ansible-hosted-engine-setup-1.0.26-1.el7.noarch
ovirt-ansible-engine-setup-1.1.9-1.el7.noarch
ovirt-hosted-engine-setup-2.3.11-1.el7.noarch
All of the SANLOCK issues I saw before, were remediated on the new deployment and recovery
of the cluster as well.
Regards,
Seann
From: Roman Bednar
Sent: Thursday, April 01, 2021 6:07 AM
To: Thomas Hoberg <thomas(a)hoberg.net>
Cc: users(a)ovirt.org
Subject: [ovirt-users] Re: Power failure makes cluster and hosted engine unusable
Hi Thomas,
Thanks for looking into this, the problem is really somewhere around this tasks file.
However I just tried faking the memory values directly inside the tasks file to something
way higher and everything looks fine. I think the problem resides in registering the
output of the "free -m" at the beginning of this file. There are also debug
tasks which print registered values from the shell commands where we could take a closer
look, see if it looks normal (stdout mainly).
This part that of the output that Seann provided seems particularly strange: Available
memory ( {'failed': False, 'changed': False, 'ansible_facts':
{u'max_mem': u'180746'}}MB )
Normally it should just show the exact value/string, here we're getting some
dictionary from python most likely. I'd check if the latest version of ansible is
installed and see if this can be reproduced if there was an update available.
If the issue persists please provide full log of the ansible run (ideally with -vvvv).
-Roman
On Wed, Mar 31, 2021 at 9:19 PM Thomas Hoberg
<thomas@hoberg.net<mailto:thomas@hoberg.net>> wrote:
Roman, I believe the bug is in
/usr/share/ansible/roles/ovirt.hosted_engine_setup/tasks/pre_checks/validate_memory_size.yml
- name: Set Max memory
set_fact:
max_mem: "{{ free_mem.stdout|int + cached_mem.stdout|int -
he_reserved_memory_MB + he_avail_memory_grace_MB }}"
If these lines are casting the result of `free -m` into 'int', that seems to fail
at bigger RAM sizes.
I wound up having to delete all the available memory checks from that file to have the
wizard progress on a machine with 512GB of RAM.
_______________________________________________
Users mailing list -- users@ovirt.org<mailto:users@ovirt.org>
To unsubscribe send an email to users-leave@ovirt.org<mailto:users-leave@ovirt.org>
Privacy Statement:
https://www.ovirt.org/privacy-policy.html<https://nam11.safelinks.prot...
oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/<https://na...
List Archives:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/CARDJXYUPFU...