Thanks so much, this worked!
For the record/list benefit, I first put the host into maintenance and then selected
Enroll Certificate - this regenerated the certs.
(VDSM cert can be checked with: certtool -i --infile /etc/pki/vdsm/certs/vdsmcert.pem)
I then took these steps on the affected (incorrectly reported) host to update the
hosted-engine --vm-status:
1. hosted-engine --set-maintenance --mode=global
2. systemctl stop ovirt-ha-agent.service
3. hosted-engine --clean-metadata
4. systemctl start ovirt-ha-agent.service
5. hosted-engine --vm-status (after a minute or two - to verify that the host details are
now correct)
Cheers :)
On 2021/02/04 10:17, Yedidyah Bar David wrote:
> On Thu, Feb 4, 2021 at 10:09 AM Roderick Mooi <roderick(a)sanren.ac.za> wrote:
>>
>> Hi Didi!
>>
>> Ok, I started the clean metadata process and then found the real issue - I had
copied the certs (just /etc/pki/vdsm; other pki folders were intact) from a working host
(host 2) to host 1 following the re-deploy cleanup as part of the process to get it online
again. The problem is the cert contains the hostname (so now the cert on host 1 contains
as Subject CN the hostname of host 2).
>
> Right. Sorry I didn't remember that.
>
>> I found some docs on the certs for libvirt but it's not clear what I need to
do to correctly re-generate the vdsm certs on host 1. Can you help? PS I presume I need to
re-generate client certs for that host as well and copy to the engine?
>
> Easiest is to put the host to maintenance, then "Enroll Certificate" -
> IIRC this should be enough. If you want to make sure, perhaps better
> remove all certs/keys and do 'Reinstall' instead, and make sure you
> choose 'Deploy' for 'Hosted Engine'.
>
> Good luck,
>
>>
>> Appreciated,
>>
>> Roderick
>>
>>
>> On 2021/02/03 16:58, Yedidyah Bar David wrote:
>>> On Wed, Feb 3, 2021 at 4:52 PM Roderick Mooi <roderick(a)sanren.ac.za>
wrote:
>>>>
>>>> Thanks,
>>>>
>>>>> I didn't check, but am pretty certain that it's not related
to the
>>>>> engine db. Do you see such duplicates there as well (using the web
ui
>>>>> or sql against it)? If so, fix these first. If no other means, put
the
>>>>> host to maintenance and reinstall with the correct name.
>>>>
>>>> Not seeing duplicates in the web UI, only in the --vm-status. Can you
please assist me with the sql commands or reference to the database schema + where to
check? I'd like to check that first before doing anything too drastic.
>>>
>>> /usr/share/ovirt-engine/dbscripts/engine-psql.sh -c 'select * from
vds'
>>>
>>>>
>>>> Note: it only duplicated the hostname after I changed the host_id, before
that it had the correct hostname but duplicate host_id.
>>>>
>>>> PS I have a recent backup of the database (just before which I could
restore if you think that'll do the trick without breaking anything?
>>>>
>>>>
>>>> On 2021/02/03 16:33, Yedidyah Bar David wrote:
>>>>> On Wed, Feb 3, 2021 at 4:21 PM Roderick Mooi
<roderick(a)sanren.ac.za> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>> Any idea how this happened?
>>>>>>
>>>>>> Somehow related to the power being "pulled" at the
wrong time?
>>>>>>
>>>>>>> Perhaps this is a backup done by emacs?
>>>>>>
>>>>>> Not sure what does it but I'm glad it did ;)
>>>>>>
>>>>>>> Please compare it to your other hosts. It should be
(mostly?)
>>>>>>> identical, but make sure that host_id= is unique per host. It
should
>>>>>>> match the spm host id for this host in the engine database.
>>>>>>
>>>>>> I had to restore one of my hosts (host 1) manually due a cleanup
during my re-deploy attempts. I managed to do this successfully by copying the missing
files from another host (host 2) but the first time the host ID matched one of the other
hosts (which made at least hosted-engine --vm-status unhappy) [I hadn't seen your
email yet :(]. I subsequently corrected the host_id and rebooted the guilty host. Things
mostly seem to be working now except that in hosted-engine --vm-status my first two hosts
(the one I copied the .conf from as well as the one I copied it to [without changing the
ID :O]) now have the same hostname :-/ I'm assuming there's a mismatch in the
engine database - where/how do I fix that?
>>>>>>
>>>>>
>>>>> I didn't check, but am pretty certain that it's not related
to the
>>>>> engine db. Do you see such duplicates there as well (using the web
ui
>>>>> or sql against it)? If so, fix these first. If no other means, put
the
>>>>> host to maintenance and reinstall with the correct name.
>>>>>
>>>>> If it's just the shared storage, you can try the following.
Carefully.
>>>>> Didn't try myself. Try on a test system first.
>>>>>
>>>>> 1. Set global maintenance
>>>>>
>>>>> 2. Stop ovirt-ha-agent, ovirt-ha-broker, perhaps also vdsmd,
supervdsmd
>>>>>
>>>>> 3. hosted-engine --clean_metadata --host-id=1
>>>>>
>>>>> - Perhaps even pass --force-cleanup, not sure when it's needed
>>>>>
>>>>> - Repeat for other IDs as needed
>>>>>
>>>>> 4. Start ovirt-ha-agent (I think this should start all the others,
but
>>>>> make sure)
>>>>>
>>>>> 5. Wait a bit. I am pretty certain that they should recreate their
>>>>> entries in the shared storage and eventually --vm-status should look
>>>>> ok.
>>>>>
>>>>> 6. Exit global maintenance
>>>>>
>>>>> Good luck,
>>>>>
>>>>>> Appreciated! (and happy cos our cluster is almost back to normal
:) )
>>>>>>
>>>>>> On 2021/02/03 11:30, Yedidyah Bar David wrote:
>>>>>>> On Wed, Feb 3, 2021 at 11:12 AM Roderick Mooi
<roderick(a)sanren.ac.za> wrote:
>>>>>>>>
>>>>>>>> Hello and thanks for assisting!
>>>>>>>>
>>>>>>>> I think I may have found the problem :)
>>>>>>>>
>>>>>>>> /etc/ovirt-hosted-engine/hosted-engine.conf
>>>>>>>>
>>>>>>>> is blank.
>>>>>>>>
>>>>>>>> But I do have hosted-engine.conf~
>>>>>>>
>>>>>>> Any idea how this happened?
>>>>>>>
>>>>>>> Perhaps this is a backup done by emacs?
>>>>>>>
>>>>>>>>
>>>>>>>> Can I cp this to restore the original?
>>>>>>>
>>>>>>> Please compare it to your other hosts. It should be
(mostly?)
>>>>>>> identical, but make sure that host_id= is unique per host. It
should
>>>>>>> match the spm host id for this host in the engine database.
>>>>>>>
>>>>>>>>
>>>>>>>> Anything else I need to do?
>>>>>>>
>>>>>>> Not sure, but better find the root cause to make sure no
other damage was done.
>>>>>>>
>>>>>>> Good luck,
>>>>>>>
>>>>>>>>
>>>>>>>> Appreciated
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2021/02/02 11:37, Strahil Nikolov wrote:
>>>>>>>>> Usually,
>>>>>>>>>
>>>>>>>>> I would start with checking the output of the
/var/log/ovirt-hosted-engine-ha/{broker,agent}.log
>>>>>>>>>
>>>>>>>>> I'm typing it on my phone, so the path could have
a typo.
>>>>>>>>>
>>>>>>>>> Check if the following services (also typed by
memory, might have to remove the 'd') are running:
>>>>>>>>> - sanlock
>>>>>>>>> - supervdsmd
>>>>>>>>> - vdsmd
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Sometimes, some of my VGs (gluster) are not
activated, so if you run hyperconverged -> you can 'vgchange -ay'.
>>>>>>>>>
>>>>>>>>> Best Regards,
>>>>>>>>> Strahil Nikolov
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Sent from Yahoo Mail on Android
<
https://go.onelink.me/107872968?pid=InProduct&c=Global_Internal_YGrow...
>>>>>>>>>
>>>>>>>>> On Tue, Feb 2, 2021 at 11:28, Roderick Mooi
>>>>>>>>> <roderick(a)sanren.ac.za> wrote:
>>>>>>>>> Hi!
>>>>>>>>>
>>>>>>>>> We had a power outage and all our servers
(oVirt hosts) went down. When they started up neither the hosted-engine nor VMs were
started.
>>>>>>>>>
>>>>>>>>> hosted-engine --vm-status
>>>>>>>>> says:
>>>>>>>>> You must run deploy first
>>>>>>>>>
>>>>>>>>> I tried running deploy with various options
but ultimately get stuck at:
>>>>>>>>>
>>>>>>>>> The Host ID is already known. Is this a
re-deployment on an additional host that was previously set up (Yes, No)[Yes]?
>>>>>>>>> ...
>>>>>>>>> [ ERROR ] Failed to execute stage
'Closing up': <urlopen error [Errno 113] No route to host>
>>>>>>>>>
>>>>>>>>> OR
>>>>>>>>>
>>>>>>>>> The specified storage location already
contains a data domain. Is this an additional host setup (Yes, No)[Yes]? No
>>>>>>>>> [ ERROR ] Re-deploying the engine VM over a
previously (partially) deployed system is not supported. Please clean up the storage
device or select a different one and retry.
>>>>>>>>>
>>>>>>>>> NOTES:
>>>>>>>>> 1. This is oVirt v3.6 (legacy install, I
know...)
>>>>>>>>> 2. We do have daily engine backups (.bak
files) [till the day the power failed]
>>>>>>>>>
>>>>>>>>> Any advice/assistance appreciated.
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>> Roderick
>>>>>>>>>
_______________________________________________
>>>>>>>>> Users mailing list -- users(a)ovirt.org
<mailto:users@ovirt.org>
>>>>>>>>> To unsubscribe send an email to
users-leave(a)ovirt.org <mailto:users-leave@ovirt.org>
>>>>>>>>> Privacy Statement:
https://www.ovirt.org/privacy-policy.html
<
https://www.ovirt.org/privacy-policy.html>
>>>>>>>>> oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
<
https://www.ovirt.org/community/about/community-guidelines/>
>>>>>>>>> List Archives:
>>>>>>>>>
https://lists.ovirt.org/archives/list/users@ovirt.org/message/73VDY7KLYBK...
<
https://lists.ovirt.org/archives/list/users@ovirt.org/message/73VDY7KLYBK...
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Users mailing list -- users(a)ovirt.org
>>>>>>>> To unsubscribe send an email to users-leave(a)ovirt.org
>>>>>>>> Privacy Statement:
https://www.ovirt.org/privacy-policy.html
>>>>>>>> oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
>>>>>>>> List Archives:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/HTWNERBX42J...
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
>