[ovirt-users] Re: Cannot restart ovirt after massive failure.

15 Aug 2021


      On Sat, Aug 14, 2021 at 8:58 AM Gilboa Davara <gilboad@gmail.com> wrote:
...
Shabbat Shalom,
On Wed, Aug 11, 2021 at 10:03 AM Yedidyah Bar David <didi@redhat.com> wrote:
...
On Tue, Aug 10, 2021 at 9:20 PM Gilboa Davara <gilboad@gmail.com> wrote:
...
Hello,
Many thanks again for taking the time to try and help me recover this machine (even though it would have been far easier to simply redeploy it...)
...
...
Sadly enough, it seems that --clean-metadata requires an active agent.
E.g.
$ hosted-engine --clean-metadata
The hosted engine configuration has not been retrieved from shared storage. Please ensure that ovirt-ha-agent
is running and the storage server is reachable.
Did you try to search the net/list archives?
Yes. All of them seem to repeat the same clean-metadata command (which fails).
I suppose we need better documentation. Sorry. Perhaps open a
bug/issue about that.
Done.
https://bugzilla.redhat.com/show_bug.cgi?id=1993575
Thanks.
...
...
...
...
...
Can I manually delete the metadata state files?
Yes, see e.g.:
https://lists.ovirt.org/pipermail/users/2016-April/072676.html
As an alternative to the 'find' command there, you can also find the IDs with:
$ grep metadata /etc/ovirt-hosted-engine/hosted-engine.conf
Best regards,
--
Didi
Yippie! Success (At least it seems that way...)
Following https://lists.ovirt.org/pipermail/users/2016-April/072676.html,
I stopped the broker and agent services, archived the existing hosted metadata files, created an empty 1GB metadata file using dd, (dd if=/dev/zero of=/run/vdsm/storage/<uuid>/<uuid> bs=1M count=1024), making double sure permissions (0660 / 0644), owner (vdsm:kvm) and SELinux labels (restorecon, just incase) stay the same.
Let everything settle down.
Restarted the services....
... and everything is up again :)
I plan to let the engine run overnight with zero VMs (making sure all backups are fully up-to-date).
Once done, I'll return to normal (until I replace this setup with a normal multi-node setup).
Many thanks again!
Glad to hear that, welcome, thanks for the report!
More tests you might want to do before starting your real VMs:
- Set and later clear global maintenance from each hosts, see that this
propagates to the others (both 'hosted-engine --vm-status' and agent.log)
- Migrate the engine VM between the hosts and see this propagates
- Shutdown the engine VM without global maint and see that it's started
automatically.
But I do not think all of this is mandatory, if 'hosted-engine --vm-status'
looks ok on all hosts.
I'd still be careful with other things that might have been corrupted,
though - obviously can't tell you what/where...
Host is back to normal.
The log looks clean (minus some odd smtp errors in the log).
That's normal, if you didn't configure a local (by default) mail server.
...
Either way, I'm already in the process of replacing this setup with a real 3 host + gluster setup, so I just need this machine to survive the next couple of weeks :)\
Good luck and best regards,
-- 
Didi