On Wed, Feb 3, 2021 at 4:21 PM Roderick Mooi <roderick(a)sanren.ac.za> wrote:
Hi,
> Any idea how this happened?
Somehow related to the power being "pulled" at the wrong time?
> Perhaps this is a backup done by emacs?
Not sure what does it but I'm glad it did ;)
> Please compare it to your other hosts. It should be (mostly?)
> identical, but make sure that host_id= is unique per host. It should
> match the spm host id for this host in the engine database.
I had to restore one of my hosts (host 1) manually due a cleanup during my re-deploy
attempts. I managed to do this successfully by copying the missing files from another host
(host 2) but the first time the host ID matched one of the other hosts (which made at
least hosted-engine --vm-status unhappy) [I hadn't seen your email yet :(]. I
subsequently corrected the host_id and rebooted the guilty host. Things mostly seem to be
working now except that in hosted-engine --vm-status my first two hosts (the one I copied
the .conf from as well as the one I copied it to [without changing the ID :O]) now have
the same hostname :-/ I'm assuming there's a mismatch in the engine database -
where/how do I fix that?
I didn't check, but am pretty certain that it's not related to the
engine db. Do you see such duplicates there as well (using the web ui
or sql against it)? If so, fix these first. If no other means, put the
host to maintenance and reinstall with the correct name.
If it's just the shared storage, you can try the following. Carefully.
Didn't try myself. Try on a test system first.
1. Set global maintenance
2. Stop ovirt-ha-agent, ovirt-ha-broker, perhaps also vdsmd, supervdsmd
3. hosted-engine --clean_metadata --host-id=1
- Perhaps even pass --force-cleanup, not sure when it's needed
- Repeat for other IDs as needed
4. Start ovirt-ha-agent (I think this should start all the others, but
make sure)
5. Wait a bit. I am pretty certain that they should recreate their
entries in the shared storage and eventually --vm-status should look
ok.
6. Exit global maintenance
Good luck,
Appreciated! (and happy cos our cluster is almost back to normal :)
)
On 2021/02/03 11:30, Yedidyah Bar David wrote:
> On Wed, Feb 3, 2021 at 11:12 AM Roderick Mooi <roderick(a)sanren.ac.za> wrote:
>>
>> Hello and thanks for assisting!
>>
>> I think I may have found the problem :)
>>
>> /etc/ovirt-hosted-engine/hosted-engine.conf
>>
>> is blank.
>>
>> But I do have hosted-engine.conf~
>
> Any idea how this happened?
>
> Perhaps this is a backup done by emacs?
>
>>
>> Can I cp this to restore the original?
>
> Please compare it to your other hosts. It should be (mostly?)
> identical, but make sure that host_id= is unique per host. It should
> match the spm host id for this host in the engine database.
>
>>
>> Anything else I need to do?
>
> Not sure, but better find the root cause to make sure no other damage was done.
>
> Good luck,
>
>>
>> Appreciated
>>
>>
>> On 2021/02/02 11:37, Strahil Nikolov wrote:
>>> Usually,
>>>
>>> I would start with checking the output of the
/var/log/ovirt-hosted-engine-ha/{broker,agent}.log
>>>
>>> I'm typing it on my phone, so the path could have a typo.
>>>
>>> Check if the following services (also typed by memory, might have to remove
the 'd') are running:
>>> - sanlock
>>> - supervdsmd
>>> - vdsmd
>>>
>>>
>>> Sometimes, some of my VGs (gluster) are not activated, so if you run
hyperconverged -> you can 'vgchange -ay'.
>>>
>>> Best Regards,
>>> Strahil Nikolov
>>>
>>>
>>> Sent from Yahoo Mail on Android
<
https://go.onelink.me/107872968?pid=InProduct&c=Global_Internal_YGrow...
>>>
>>> On Tue, Feb 2, 2021 at 11:28, Roderick Mooi
>>> <roderick(a)sanren.ac.za> wrote:
>>> Hi!
>>>
>>> We had a power outage and all our servers (oVirt hosts) went down. When
they started up neither the hosted-engine nor VMs were started.
>>>
>>> hosted-engine --vm-status
>>> says:
>>> You must run deploy first
>>>
>>> I tried running deploy with various options but ultimately get stuck
at:
>>>
>>> The Host ID is already known. Is this a re-deployment on an additional
host that was previously set up (Yes, No)[Yes]?
>>> ...
>>> [ ERROR ] Failed to execute stage 'Closing up': <urlopen
error [Errno 113] No route to host>
>>>
>>> OR
>>>
>>> The specified storage location already contains a data domain. Is this
an additional host setup (Yes, No)[Yes]? No
>>> [ ERROR ] Re-deploying the engine VM over a previously (partially)
deployed system is not supported. Please clean up the storage device or select a different
one and retry.
>>>
>>> NOTES:
>>> 1. This is oVirt v3.6 (legacy install, I know...)
>>> 2. We do have daily engine backups (.bak files) [till the day the power
failed]
>>>
>>> Any advice/assistance appreciated.
>>>
>>> Thanks!
>>>
>>> Roderick
>>> _______________________________________________
>>> Users mailing list -- users(a)ovirt.org <mailto:users@ovirt.org>
>>> To unsubscribe send an email to users-leave(a)ovirt.org
<mailto:users-leave@ovirt.org>
>>> Privacy Statement:
https://www.ovirt.org/privacy-policy.html
<
https://www.ovirt.org/privacy-policy.html>
>>> oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
<
https://www.ovirt.org/community/about/community-guidelines/>
>>> List Archives:
>>>
https://lists.ovirt.org/archives/list/users@ovirt.org/message/73VDY7KLYBK...
<
https://lists.ovirt.org/archives/list/users@ovirt.org/message/73VDY7KLYBK...
>>>
>> _______________________________________________
>> Users mailing list -- users(a)ovirt.org
>> To unsubscribe send an email to users-leave(a)ovirt.org
>> Privacy Statement:
https://www.ovirt.org/privacy-policy.html
>> oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
>> List Archives:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/HTWNERBX42J...
>
>
>
--
Didi