On Mon, Nov 23, 2020 at 10:09 AM Yedidyah Bar David <didi(a)redhat.com> wrote:
On Mon, Nov 23, 2020 at 9:54 AM Alex K
<rightkicktech(a)gmail.com> wrote:
>
>
>
> On Sun, Nov 22, 2020 at 8:57 AM Yedidyah Bar David <didi(a)redhat.com>
wrote:
>>
>> On Thu, Nov 19, 2020 at 9:43 PM Alex K <rightkicktech(a)gmail.com> wrote:
>>>
>>>
>>>
>>> On Thu, Nov 19, 2020 at 5:31 PM Alex K <rightkicktech(a)gmail.com>
wrote:
>>>>
>>>> Hi Didi,
>>>>
>>>> On Thu, Nov 19, 2020 at 5:13 PM Yedidyah Bar David
<didi(a)redhat.com>
wrote:
>>>>>
>>>>> On Thu, Nov 19, 2020 at 4:37 PM Alex K
<rightkicktech(a)gmail.com>
wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I have a corrupt self-hosted engine (with several file system
errors, postgres not able to start) and thus it does not give access to the
web UI. This happened following an unlucky split brain resolution (I am
running 2 nodes). The two hosts are running VMs also which I would like to
keep running as they are needed.
>>>>>>
>>>>>> When trying to boot into rescue mode (using
systemd.unit=emergency.target boot parameter) I get a cursor and nothing
else.
>>>>>
>>>>>
>>>>> This means that more than just the DB is corrupt...
>>>>>
>>>>>>
>>>>>>
>>>>>> I have backups of engine files with scope all (using the
engine-backup tool).
>>>>>> What is the best approach to try and fix the engine or
redeploy.
>>>>>
>>>>>
>>>>> If you are careful, and know what you are doing, you can try
something like the following. I am not giving many details, hopefully you
can find on the net tutorials about how to use the things I suggest:
>>>>>
>>>>> 1. Move to global maintenance
>>>>>
>>>>> 2. Stop the current dead vm (if needed)
>>>>>
>>>>> 3. Find current vm conf, edit it to boot from a rescue iso image of
your preference or from net/PXE etc., and start the vm with '--vm-conf'
pointing to your edited file.
>>>>>
>>>>> 4. Connect a console (hosted-engine --console, or 'virsh
console',
or use '--add-console-password' and remote viewer, if needed)
>>>>>
>>>>> 5. Clean the disk and install the OS, oVirt, etc.
>>>>>
>>>>> 6. Copy your backup into the vm and restore with engine-backup
>>>>>
>>>>> 7. Then cleanly stop the machine, exit global maint, and let HA
start it (or start it yourself with --vm-start).
>>>>>
>>>>> At the time, we had a bug [1] to document this. The result is [2].
It does not detail how to boot/reinstall os/etc., only restore (if e.g. db
is dead but fs is ok).
>>>>> For something somewhat similar to what you want, see also [3],
which
uses guestfish. Might be useful, depending on how badly your disk is
corrupted.
>>>>
>>>> I went with the guestfish approach. It has fixed some fs issues and
now the yum etc seem fine apart from postgres.
>>>> I had tried previously to uninstall/install packages so I ended
installing them again with yum install ovirt\*setup\*.
>>>> Now I think I have to run engine-setup but I get the error:
>>>>
>>>> Failed to execute stage 'Environment setup': Cannot connect to
Engine database using existing credentials: engine@localhost:5432
>>>
>>> Seems that I need to have psql running to be able to run engine-backup
--mode=restore. Are there any steps how one could manually prepare pgsql
for ovirt so as to attempt restoration?
>>
>>
>> Replying again, also to conclude this part of your episode: Generally
speaking, that's not needed. restore --provision-all-databases should do
that for you.
>
> Seems that when pgsql is down nothing can be done. You need at least
pgsql up and running (e clean state will do) so as to be able to proceed
with restoration.
Do you still have logs from this? Both engine-backup's (default to
/var/log/ovirt-engine-backup/something if you do not pass --log) and
ovirt-engine-provisiondb which it runs (at
/var/log/ovirt-engine/setup).
I was using --provision-all-databases flag when trying to restore. I might
retest to double check. When the pgsql was down, I was getting:
2020-11-19 22:06:35 4947: Start of engine-backup mode restore scope all
file /var/backup/daily.0/engine-backup.gz
2020-11-19 22:06:35 4947: OUTPUT: Start of engine-backup with mode 'restore'
2020-11-19 22:06:35 4947: OUTPUT: scope: all
2020-11-19 22:06:35 4947: OUTPUT: archive file:
/var/backup/daily.0/engine-backup.gz
2020-11-19 22:06:35 4947: OUTPUT: log file: restore.log
2020-11-19 22:06:35 4947: Setting scl env for rh-postgresql10
2020-11-19 22:06:35 4947: OUTPUT: Preparing to restore:
2020-11-19 22:06:35 4947: OUTPUT: - Unpacking file
'/var/backup/daily.0/engine-backup.gz'
2020-11-19 22:06:35 4947: Opening tarball
/var/backup/daily.0/engine-backup.gz to /tmp/engine-backup.63eeNqt4NH
2020-11-19 22:06:35 4947: Verifying hash
2020-11-19 22:06:35 4947: Verifying version
2020-11-19 22:06:35 4947: Reading config
2020-11-19 22:06:35 4947: OUTPUT: Restoring:
2020-11-19 22:06:35 4947: OUTPUT: - Files
2020-11-19 22:06:35 4947: Restoring files
2020-11-19 22:06:36 4947: Reloading configuration
2020-11-19 22:06:36 4947: Generating pgpass
2020-11-19 22:06:36 4947: Verifying connection
2020-11-19 22:06:36 4947: pg_cmd running: psql -w -U engine -h localhost -p
5432 engine -c select 1
psql: FATAL: Ident authentication failed for user "engine"
2020-11-19 22:06:36 4947: FATAL: Can't connect to database 'engine'. Please
see '/usr/bin/engine-backup --help'.
Not sure what you mean in "a clean state will do". If you
just install
PG, it is not enabled by default, so is not "up and running".
I mean pgsql re-installed and the data stored cleaned as below:
rm -rf /var/opt/rh/rh-postgresql10/lib/pgsql/data/*
/opt/rh/rh-postgresql10/root/usr/bin/postgresql-setup --initdb
systemctl restart rh-postgresql10-postgresql.service
Generally speaking:
If you never started/inited PG (e.g. on a clean machine), restore,
with --provision-all-databases, does this for you. Are you sure you
passed this?
I am pretty sure I used that flag but might be able to repeat for testing.
If you did, and created DB/user with the same name it wants to restore
to, but left the DB empty, it will use it.
If you populated the DB, it will fail with a suitable error message.
Confirmed. When I created the DB and users it was failing. So I cleaned
everything, strtied pgsql and left the tool to do its job.
These are the states that are intended to be supported.
Anything else might break it in other ways.
>>
>>
>> I replied to all your interim emails in private, since you replied in
private.
>
> Did not notice I was replying in private :)
NP :-)
>>
>>
>> Thanks for the final message to the list.
>>
>> It would be nice if you send another summary of the main obstacles you
ran into, what worked and didn't work, and especially what ideas you can
think of to improve the code/doc for the next time something similar
happens (also to you :-) ).
>>
>> If you feel like that, and have time, it sounds like a nice opportunity
for a blog post :-) (I know I (almost?) never wrote any myself, sorry, but
I like reading them - and they are much more approachable and useful, over
the long run, compared to just posting to the list).
>
> Noted. Will check to put this in a blog. Generally the missing part
from the docs was that one cannot proceed with the restoration if pgsql is
not able to start. So I had to clean re-install pgsql and initialize its
data store before proceeding with the restoration.
Well, I'd definitely not want a blog post saying you must manually
init PG - if you indeed must, that's a bug, so I'd rather fix it
first.
Noted.
Thanks and best regards,
>>
>>
>> Best regards,
>>
>>>>
>>>>
>>>> So I guess I need to follow [2]. What do you think?
>>>>
>>>>>
>>>>> How did you run into a split brain? There is a lock on the shared
storage that should prevent this.
>>>>>
>>>>> Good luck and best regards,
>>>>>
>>>>> [1]
https://bugzilla.redhat.com/show_bug.cgi?id=1482710
>>>>> [2]
https://www.ovirt.org/documentation/administration_guide/#Overwriting_a_S...
>>>>> [3]
https://bugzilla.redhat.com/show_bug.cgi?id=1569827#c4
>>>>> --
>>>>> Didi
>>
>>
>>
>> --
>> Didi
>
> _______________________________________________
> Users mailing list -- users(a)ovirt.org
> To unsubscribe send an email to users-leave(a)ovirt.org
> Privacy Statement:
https://www.ovirt.org/privacy-policy.html
> oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
> List Archives:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/6QZ4OKZTHPE...
--
Didi