On Tue, Oct 25, 2022 at 6:27 AM Matthew J Black <matthew(a)peregrineit.net> wrote:
OK, so, with all the tooing-and-frowing things stand as follows (@03:15UTC 25-Oct-2022):
- I managed to solve the "DNF Timeout" issue (see my post "Local
(Deployment) VM Can't Reach "centos-ceph-pacific" Repo") and so
simplified the deployment command to `hosted-engine --deploy`. Unfortunately this still
results in a "Host is not up" error, with the logs as per before.
- As mentioned elsewhere in this thread I uploaded the (previous) logs to Dropbox along
with a couple of other relevant(?) files:
https://www.dropbox.com/sh/eymwdy8hzn3sa7z/AACscSP2eaFfoiN-QzyeEVfaa?dl=0
- I followed the suggestion of ajude.pereira (see post in this thread) but this did not
resolve the issue.
- As per one of my other posts in this thread, digging into the logs further revealed
this issue: "Failed to authenticate session
with host 'ovirt_node_1.mynet.local': SSH authentication to
'root(a)ovirt_node_1.mynet.local' failed. Please verify provided credentials. Make
sure key is authorized at host"
- I also did a `hosted-engine --deploy --ansible-extra-vars=he_pause_host=true` (as per
the suggestion of Konstantin - see post in this thread) and tried to work out why ssh
wasn't working. I ssh'd into the deployment VM and then attempted to ssh back into
the deployment host (ie `ssh root(a)ovirt_node_1.mynet.local`). While I could connect, I was
asked for the root's password.
Good.
I was under the impression that this was supposed to be a
"password-less" operation.
It should.
At this point, the operation that is attempted and which is failing
with the error you see in engine.log ("Failed to authenticate
session"), is done using Java code, using the Java library
apache-sshd, not the command line ssh. Some of the relevant code is
here:
https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules...
I do not know this code well, sorry, nor the specifics of apache-sshd
vs openssh (and there are such "specifics", as can easily be seen by
looking at the engine git log).
As I do not provide the root(a)ovirt_node_1.mynet.local password
anywhere in the deployment script, I suspect that this is why I'm getting the
"Host is not up" error.
- To reiterate: the host'd sshd_config file is configured as per the oVirt
documentation.
So am I wrong in my understanding of the password-less ssh-nature of the situation and
how the deployment script is supposed to work?
I think this should work more or less like this:
After running engine-setup, and when the engine is already up, we
fetch the public key of the engine from it, and store it in your
authorized_keys file. This is done here:
https://github.com/oVirt/ovirt-ansible-collection/blob/master/roles/hoste...
- name: Set Engine public key as authorized key without validating
the TLS/SSL certificates
I do see this in your log in dropbox.
Do you see /root/.ssh/authorized_keys on the host (with a timestamp
similar to the log line)?
If so, you can try this, from the engine VM:
ssh -v -i /etc/pki/ovirt-engine/keys/engine_id_rsa ovirt_node_1.mynet.local
If this does not work, you can continue debugging this until you
manage to understand/fix. Perhaps check sshd config etc.
If it does work, it means the issue might be due to incompatibility
between apache-sshd and openssh and/or the configuration.
Also, does *anyone* have any pointers, suggestions, or can otherwise help me out -
thanks.
At this point, you should be able to log into the admin UI (the pause
message provides a link) and try to manually add the host. It seems
like this didn't work for you. This is because "host_result_up_check"
is "failed", and we pause only if it succeeded and the host is
returned with status "non_operational". Feel free to create an issue
to make the code pause also if "host_result_up_check" is "failed" -
not sure why we do not, perhaps we did have a reason. Anyway, you can
force the code to pause after trying to add the host but before
checking if this worked, by passing
"--ansible-extra-vars=he_pause_host=true".
You can also check/share more of engine.log - there might be more
information prior to the failure (but as I said, I do not know this
code well).
You can try running sshd (the server) with debug info and check its
own log - the issue might be due to incompatible keys on one or both
of the sides, or something like that.
Sorry that I do not remember if you wrote this before - is this your
first attempt to install oVirt? If so, perhaps try first to start with
a clean host, without any custom configuration (e.g. of sshd), and see
if this works for you. If you do have access to a successful setup,
you can more easily compare.
Good luck and best regards,
--
Didi