Re: Power failure makes cluster and hosted engine unusable

Hi Seann, On Mon, Mar 29, 2021 at 8:31 PM Seann G. Clark via Users <users@ovirt.org> wrote:
All,
After a power failure, and generator failure I lost my cluster, and the Hosted engine refused to restart after power was restored. I would expect, once storage comes up that the hosted engine comes back online without too much of a fight. In practice because the SPM went down as well, there is no (clearly documented) way to clear any of the stale locks, and no way to recover both the hosted engine and the cluster.
Could you provide more details/logs on storage not coming up? Also more information about the current locks would be great, is there any procedure you tried that did not work for cleaning those up? I have spent the last 12 hours trying to get a functional hosted-engine
back online, on a new node and each attempt hits a new error, from the installer not understanding that 16384mb of dedicated VM memory out of 192GB free on the host is indeed bigger than 4096MB, to ansible dying on an error like this “Error while executing action: Cannot add Storage Connection. Storage connection already exists.”
The memory error referenced above shows up as:
[ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "Available memory ( {'failed': False, 'changed': False, 'ansible_facts': {u'max_mem': u'180746'}}MB ) is less then the minimal requirement (4096MB). Be aware that 512MB is reserved for the host and cannot be allocated to the engine VM."}
That is what I typically get when I try the steps outlined in the KB “CHAPTER 7. RECOVERING A SELF-HOSTED ENGINE FROM AN EXISTING BACKUP” from the RH Customer portal. I have tried this numerous ways, and the cluster still remains in a bad state, with the hosted engine being 100% inoperable.
This could be a bug in the ansible role, did that happen during "hosted-engine --deploy" or other part of the recovery guide? Provide logs here as well please, its seems like a completely separate issue though.
What I do have are the two host that are part of the cluster and can host the engine, and backups of the original hosted engine, both disk and engine-backup generated. I am not sure what I can do next, to recover this cluster, any suggestions would be apricated.
Regards,
Seann
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/JLDIFTKYDPQ6YK...

Roman, I believe the bug is in /usr/share/ansible/roles/ovirt.hosted_engine_setup/tasks/pre_checks/validate_memory_size.yml - name: Set Max memory set_fact: max_mem: "{{ free_mem.stdout|int + cached_mem.stdout|int - he_reserved_memory_MB + he_avail_memory_grace_MB }}" If these lines are casting the result of `free -m` into 'int', that seems to fail at bigger RAM sizes. I wound up having to delete all the available memory checks from that file to have the wizard progress on a machine with 512GB of RAM.

Hi Thomas, Thanks for looking into this, the problem is really somewhere around this tasks file. However I just tried faking the memory values directly inside the tasks file to something way higher and everything looks fine. I think the problem resides in registering the output of the "free -m" at the beginning of this file. There are also debug tasks which print registered values from the shell commands where we could take a closer look, see if it looks normal (stdout mainly). This part that of the output that Seann provided seems particularly strange: Available memory ( {'failed': False, 'changed': False, 'ansible_facts': {u'max_mem': u'180746'}}MB ) Normally it should just show the exact value/string, here we're getting some dictionary from python most likely. I'd check if the latest version of ansible is installed and see if this can be reproduced if there was an update available. If the issue persists please provide full log of the ansible run (ideally with -vvvv). -Roman On Wed, Mar 31, 2021 at 9:19 PM Thomas Hoberg <thomas@hoberg.net> wrote:
Roman, I believe the bug is in /usr/share/ansible/roles/ovirt.hosted_engine_setup/tasks/pre_checks/validate_memory_size.yml
- name: Set Max memory set_fact: max_mem: "{{ free_mem.stdout|int + cached_mem.stdout|int - he_reserved_memory_MB + he_avail_memory_grace_MB }}"
If these lines are casting the result of `free -m` into 'int', that seems to fail at bigger RAM sizes.
I wound up having to delete all the available memory checks from that file to have the wizard progress on a machine with 512GB of RAM. _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/CARDJXYUPFUFJT...

Following up on this, I was able to recover everything, with only minor (and easy to fix) data loss. The old hosted engine refused to come up, ever after a few hours of sitting. That is when I dug into the issue and found the agent service stating the image didn't exist/no such file or directory. It seems that was just one aspect of storage being impacted from the unexpected outage. In regards to the memory issue, I was only getting it on one host, but was able to install, and recover, on another host in my cluster without the issue. The broken host has this version of ansible's engine setup package: ansible-2.9.18-1.el7.noarch ovirt-ansible-hosted-engine-setup-1.0.32-1.el7.noarch ovirt-ansible-engine-setup-1.1.9-1.el7.noarch ovirt-hosted-engine-setup-2.3.13-1.el7.noarch The one that works is: ansible-2.8.3-1.el7.noarch ovirt-ansible-hosted-engine-setup-1.0.26-1.el7.noarch ovirt-ansible-engine-setup-1.1.9-1.el7.noarch ovirt-hosted-engine-setup-2.3.11-1.el7.noarch All of the SANLOCK issues I saw before, were remediated on the new deployment and recovery of the cluster as well. Regards, Seann From: Roman Bednar Sent: Thursday, April 01, 2021 6:07 AM To: Thomas Hoberg <thomas@hoberg.net> Cc: users@ovirt.org Subject: [ovirt-users] Re: Power failure makes cluster and hosted engine unusable Hi Thomas, Thanks for looking into this, the problem is really somewhere around this tasks file. However I just tried faking the memory values directly inside the tasks file to something way higher and everything looks fine. I think the problem resides in registering the output of the "free -m" at the beginning of this file. There are also debug tasks which print registered values from the shell commands where we could take a closer look, see if it looks normal (stdout mainly). This part that of the output that Seann provided seems particularly strange: Available memory ( {'failed': False, 'changed': False, 'ansible_facts': {u'max_mem': u'180746'}}MB ) Normally it should just show the exact value/string, here we're getting some dictionary from python most likely. I'd check if the latest version of ansible is installed and see if this can be reproduced if there was an update available. If the issue persists please provide full log of the ansible run (ideally with -vvvv). -Roman On Wed, Mar 31, 2021 at 9:19 PM Thomas Hoberg <thomas@hoberg.net<mailto:thomas@hoberg.net>> wrote: Roman, I believe the bug is in /usr/share/ansible/roles/ovirt.hosted_engine_setup/tasks/pre_checks/validate_memory_size.yml - name: Set Max memory set_fact: max_mem: "{{ free_mem.stdout|int + cached_mem.stdout|int - he_reserved_memory_MB + he_avail_memory_grace_MB }}" If these lines are casting the result of `free -m` into 'int', that seems to fail at bigger RAM sizes. I wound up having to delete all the available memory checks from that file to have the wizard progress on a machine with 512GB of RAM. _______________________________________________ Users mailing list -- users@ovirt.org<mailto:users@ovirt.org> To unsubscribe send an email to users-leave@ovirt.org<mailto:users-leave@ovirt.org> Privacy Statement: https://www.ovirt.org/privacy-policy.html<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ovirt.org%2Fprivacy-policy.html&data=04%7C01%7Cnombrandue%40tsukinokage.net%7C65b62227bf7d4ae84b4108d8f4f68e5e%7Cc72a24170e014338b318fc2dd908917e%7C0%7C0%7C637528687102898206%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=mOp8x%2FOFiNd4mTCAuU3z9bWWtbZmllgtsALtA%2FKo4%2FE%3D&reserved=0> oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ovirt.org%2Fcommunity%2Fabout%2Fcommunity-guidelines%2F&data=04%7C01%7Cnombrandue%40tsukinokage.net%7C65b62227bf7d4ae84b4108d8f4f68e5e%7Cc72a24170e014338b318fc2dd908917e%7C0%7C0%7C637528687102898206%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=NwqKLSkJizqzi7USgjkMbaZwSQvaFLiaRnWmLTiIFG0%3D&reserved=0> List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/CARDJXYUPFUFJT2VE2UNXELL2PSUZSPS/<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.ovirt.org%2Farchives%2Flist%2Fusers%40ovirt.org%2Fmessage%2FCARDJXYUPFUFJT2VE2UNXELL2PSUZSPS%2F&data=04%7C01%7Cnombrandue%40tsukinokage.net%7C65b62227bf7d4ae84b4108d8f4f68e5e%7Cc72a24170e014338b318fc2dd908917e%7C0%7C0%7C637528687102908204%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=77ASpg7w0Yl26yIrnGs4jOjx9iEvpw4U%2BL9NlOLUjgc%3D&reserved=0>
participants (3)
-
Roman Bednar
-
Seann G. Clark
-
Thomas Hoberg