Following up on this, I was able to recover everything, with only minor (and easy to fix) data loss.
The old hosted engine refused to come up, ever after a few hours of sitting. That is when I dug into the issue and found the agent service stating the image didn’t exist/no such file or directory. It seems that was just one aspect of storage
being impacted from the unexpected outage.
In regards to the memory issue, I was only getting it on one host, but was able to install, and recover, on another host in my cluster without the issue.
The broken host has this version of ansible’s engine setup package:
ansible-2.9.18-1.el7.noarch
ovirt-ansible-hosted-engine-setup-1.0.32-1.el7.noarch
ovirt-ansible-engine-setup-1.1.9-1.el7.noarch
ovirt-hosted-engine-setup-2.3.13-1.el7.noarch
The one that works is:
ansible-2.8.3-1.el7.noarch
ovirt-ansible-hosted-engine-setup-1.0.26-1.el7.noarch
ovirt-ansible-engine-setup-1.1.9-1.el7.noarch
ovirt-hosted-engine-setup-2.3.11-1.el7.noarch
All of the SANLOCK issues I saw before, were remediated on the new deployment and recovery of the cluster as well.
Regards,
Seann
From: Roman Bednar
Sent: Thursday, April 01, 2021 6:07 AM
To: Thomas Hoberg <thomas@hoberg.net>
Cc: users@ovirt.org
Subject: [ovirt-users] Re: Power failure makes cluster and hosted engine unusable
Hi Thomas,
Thanks for looking into this, the problem is really somewhere around this tasks file. However I just tried faking the memory values directly inside the tasks file to something way higher and everything looks fine. I think the problem resides
in registering the output of the "free -m" at the beginning of this file. There are also debug tasks which print registered values from the shell commands where we could take a closer look, see if it looks normal (stdout mainly).
This part that of the output that Seann provided seems particularly strange: Available memory ( {'failed': False, 'changed': False, 'ansible_facts': {u'max_mem': u'180746'}}MB )
Normally it should just show the exact value/string, here we're getting some dictionary from python most likely. I'd check if the latest version of ansible is installed and see if this can be reproduced if there was an update available.
If the issue persists please provide full log of the ansible run (ideally with -vvvv).
-Roman
On Wed, Mar 31, 2021 at 9:19 PM Thomas Hoberg <thomas@hoberg.net> wrote:
Roman, I believe the bug is in /usr/share/ansible/roles/ovirt.hosted_engine_setup/tasks/pre_checks/validate_memory_size.yml
- name: Set Max memory
set_fact:
max_mem: "{{ free_mem.stdout|int + cached_mem.stdout|int - he_reserved_memory_MB + he_avail_memory_grace_MB }}"
If these lines are casting the result of `free -m` into 'int', that seems to fail at bigger RAM sizes.
I wound up having to delete all the available memory checks from that file to have the wizard progress on a machine with 512GB of RAM.
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/CARDJXYUPFUFJT2VE2UNXELL2PSUZSPS/