oVirt 4.4: Self-hosted engine deployment fails with backup restore from 4.3 engine

Hi there, I'm a bit puzzled about an possible upgrade paths from a 4.3 cluster to version 4.4 in a self-hosted engine environment. My idea was: Set up a new host with a clean ovirt node 4.4 installation, then deploy the hosted engine on this with a restored backup from the production cluster and go from there. This however fails with the following error: 2020-05-27 00:17:08,886+0200 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:103 {'msg': 'non-zero return code', 'cmd': ['engine-setup', '--accept-defaults', '--config-append=/root/ovirt-engine-answers'], 'stdout': "[ INFO ] Stage: Initializing\n[ INFO ] Stage: Environment setup\n C onfiguration files: /etc/ovirt-engine-setup.conf.d/10-packaging-jboss.conf, /etc/ovirt-engine-setup.conf.d/10-packaging.conf, /etc/ovirt-engine-setup.conf.d/20-setup-ovirt-post.conf, /root/ovirt-engine-answers\n Log file: /var/log/ovirt-engine/setup/ovirt-engine-setup-20200527001657-fyeueu.log\n Version: otop i-1.9.1 (otopi-1.9.1-1.el8)\n[ INFO ] DNF Downloading 1 files, 0.00KB\n[ INFO ] DNF Downloaded CentOS-8 - AppStream\n[ INFO ] DNF Downloading 1 files, 0.00KB\n[ INFO ] DNF Downloaded CentOS-8 - Base\n[ INFO ] DNF Downloading 1 files, 0.00KB\n [...] ... anwsers from backup config follow .... [...] 2020-05-27 00:17:12,396+0200 DEBUG otopi.context context._executeMethod:145 method exception Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/otopi/context.py", line 132, in _executeMethod method['method']() File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/gr-he-ansiblesetup/core/misc.py", line 403, in _closeup r = ah.run() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/ansible_utils.py", line 229, in run raise RuntimeError(_('Failed executing ansible-playbook')) Is this approach (restoring from 4.3) generally supposed to work? If not, what is the appropriate upgrade path? Thank you! Regards Oli

On Wed, May 27, 2020 at 9:09 AM Oliver Leinfelder <oliver.leinfelder@gmail.com> wrote:
Hi there,
I'm a bit puzzled about an possible upgrade paths from a 4.3 cluster to version 4.4 in a self-hosted engine environment.
My idea was:
Set up a new host with a clean ovirt node 4.4 installation, then deploy the hosted engine on this with a restored backup from the production cluster and go from there.
Generally speaking, this is the intended approach.
This however fails with the following error:
2020-05-27 00:17:08,886+0200 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:103 {'msg': 'non-zero return code', 'cmd': ['engine-setup', '--accept-defaults', '--config-append=/root/ovirt-engine-answers'], 'stdout': "[ INFO ] Stage: Initializing\n[ INFO ] Stage: Environment setup\n C onfiguration files: /etc/ovirt-engine-setup.conf.d/10-packaging-jboss.conf, /etc/ovirt-engine-setup.conf.d/10-packaging.conf, /etc/ovirt-engine-setup.conf.d/20-setup-ovirt-post.conf, /root/ovirt-engine-answers\n Log file: /var/log/ovirt-engine/setup/ovirt-engine-setup-20200527001657-fyeueu.log\n Version: otop i-1.9.1 (otopi-1.9.1-1.el8)\n[ INFO ] DNF Downloading 1 files, 0.00KB\n[ INFO ] DNF Downloaded CentOS-8 - AppStream\n[ INFO ] DNF Downloading 1 files, 0.00KB\n[ INFO ] DNF Downloaded CentOS-8 - Base\n[ INFO ] DNF Downloading 1 files, 0.00KB\n [...] ... anwsers from backup config follow .... [...]
You should also see one or more ERROR messages, can you check/post them? Also, if possible, please try to check/share the engine-setup log. If you can access the engine VM, it's there, in: /var/log/ovirt-engine/setup Otherwise, you might find it in the host doing the deployment, in: /var/log/ovirt-hosted-engine-setup/engine-logs*
2020-05-27 00:17:12,396+0200 DEBUG otopi.context context._executeMethod:145 method exception Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/otopi/context.py", line 132, in _executeMethod method['method']() File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/gr-he-ansiblesetup/core/misc.py", line 403, in _closeup r = ah.run() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/ansible_utils.py", line 229, in run raise RuntimeError(_('Failed executing ansible-playbook'))
Is this approach (restoring from 4.3) generally supposed to work? If not, what is the appropriate upgrade path?
Yes. Thanks and best regards, -- Didi

Hi there,
You should also see one or more ERROR messages, can you check/post them?
There is one error message that immediately follows, if that helps: 2020-05-27 00:17:12,397+0200 ERROR otopi.context context._executeMethod:154 Failed to execute stage 'Closing up': Failed executing ansible-playbook 2020-05-27 00:17:12,397+0200 DEBUG otopi.context context.dumpEnvironment:765 ENVIRONMENT DUMP - BEGIN 2020-05-27 00:17:12,398+0200 DEBUG otopi.context context.dumpEnvironment:775 ENV BASE/error=bool:'True' 2020-05-27 00:17:12,398+0200 DEBUG otopi.context context.dumpEnvironment:775 ENV BASE/exceptionInfo=list:'[(<class 'RuntimeError'>, RuntimeError('Failed executing ansible-playbook',), <traceback object at 0x7f9bbdea5cc8>)]' 2020-05-27 00:17:12,398+0200 DEBUG otopi.context context.dumpEnvironment:779 ENVIRONMENT DUMP - END 2020-05-27 00:17:12,398+0200 INFO otopi.context context.runSequence:616 Stage: Clean up 2020-05-27 00:17:12,399+0200 DEBUG otopi.context context.runSequence:620 STAGE cleanup Other than that, there is nothing that looks like an error (or contains the word "error").
Also, if possible, please try to check/share the engine-setup log. If you can access the engine VM, it's there, in:
/var/log/ovirt-engine/setup
The engine is not running after that, "hosted-engine --vm-status" gives me the following error: It seems like a previous attempt to deploy hosted-engine failed or it's still in progress. Please clean it up before trying again
Otherwise, you might find it in the host doing the deployment, in:
/var/log/ovirt-hosted-engine-setup/engine-logs* I have 4 directories like this (from my failed deployment attempts ;-)), but all of them are empty.
The last attempt was with a new backup, just in case. The production oVirt is 4.3.9, the host I'm installing from is a clean install from the ovirt node 4.4.0 release ISO with the last available package upgrades. Best regards Oliver

On Wed, May 27, 2020 at 12:37 PM Oliver Leinfelder <oliver.leinfelder@gmail.com> wrote:
Hi there,
You should also see one or more ERROR messages, can you check/post them?
There is one error message that immediately follows, if that helps:
2020-05-27 00:17:12,397+0200 ERROR otopi.context context._executeMethod:154 Failed to execute stage 'Closing up': Failed executing ansible-playbook 2020-05-27 00:17:12,397+0200 DEBUG otopi.context context.dumpEnvironment:765 ENVIRONMENT DUMP - BEGIN 2020-05-27 00:17:12,398+0200 DEBUG otopi.context context.dumpEnvironment:775 ENV BASE/error=bool:'True' 2020-05-27 00:17:12,398+0200 DEBUG otopi.context context.dumpEnvironment:775 ENV BASE/exceptionInfo=list:'[(<class 'RuntimeError'>, RuntimeError('Failed executing ansible-playbook',), <traceback object at 0x7f9bbdea5cc8>)]' 2020-05-27 00:17:12,398+0200 DEBUG otopi.context context.dumpEnvironment:779 ENVIRONMENT DUMP - END 2020-05-27 00:17:12,398+0200 INFO otopi.context context.runSequence:616 Stage: Clean up 2020-05-27 00:17:12,399+0200 DEBUG otopi.context context.runSequence:620 STAGE cleanup
Other than that, there is nothing that looks like an error (or contains the word "error").
Perhaps it's not engine-setup that failed, but something later on in ansible code. Can you check/share the full ansible logs? Search also for 'failed', or 'fatal'.
Also, if possible, please try to check/share the engine-setup log. If you can access the engine VM, it's there, in:
/var/log/ovirt-engine/setup
The engine is not running after that, "hosted-engine --vm-status" gives me the following error:
It seems like a previous attempt to deploy hosted-engine failed or it's still in progress. Please clean it up before trying again
Otherwise, you might find it in the host doing the deployment, in:
/var/log/ovirt-hosted-engine-setup/engine-logs* I have 4 directories like this (from my failed deployment attempts ;-)), but all of them are empty.
I happened to see that myself, but still not sure why it happens. Perhaps open a bug to track this? In any case (perhaps not relevant to you right now, if indeed engine-setup succeeded), usually the engine vm is left running at the end of a failed deploy. If it's still the local vm, you can find its IP address by searching the ansible logs for local_vm_ip, then you can ssh to it from the host. For fixing the "empty engine-logs dirs", now pushed this: https://github.com/oVirt/ovirt-ansible-hosted-engine-setup/pull/325 Didn't test yet, it's just a guess.
The last attempt was with a new backup, just in case.
The production oVirt is 4.3.9, the host I'm installing from is a clean install from the ovirt node 4.4.0 release ISO with the last available package upgrades.
Sounds ok to me. Best regards, -- Didi

Hi, Yedidyah Bar David wrote:
In any case (perhaps not relevant to you right now, if indeed engine-setup succeeded), usually the engine vm is left running at the end of a failed deploy. If it's still the local vm, you can find its IP address by searching the ansible logs for local_vm_ip, then you can ssh to it from the host.
For fixing the "empty engine-logs dirs", now pushed this:
https://github.com/oVirt/ovirt-ansible-hosted-engine-setup/pull/325
Didn't test yet, it's just a guess.
I followed up on your remark that the engine may indeed be running. And it is, sorry for not seeing this earlier. Anyway, I was thus able to take a look in /var/log/ovirt-engine/setup in the HE VM and found the following error (I found a couple of more "suspicious" lines, but this one sticks out). 2020-05-27 00:17:09,660+0200 DEBUG otopi.context context._executeMethod:145 method exception Traceback (most recent call last): File "/usr/lib64/python3.6/site-packages/M2Crypto/BIO.py", line 279, in openfile f = open(filename, mode) FileNotFoundError: [Errno 2] No such file or directory: '/etc/pki/ovirt-engine/qemu-ca.pem' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/otopi/context.py", line 132, in _executeMethod method['method']() File "/usr/share/ovirt-engine/setup/bin/../plugins/ovirt-engine-setup/ovirt-engine/pki/ca.py", line 699, in _miscUpgrade if self._expired(self._x509_load_cert(ca_file)): File "/usr/share/ovirt-engine/setup/bin/../plugins/ovirt-engine-setup/ovirt-engine/pki/ca.py", line 94, in _x509_load_cert res = X509.load_cert(f) File "/usr/lib64/python3.6/site-packages/M2Crypto/X509.py", line 802, in load_cert with BIO.openfile(file) as bio: File "/usr/lib64/python3.6/site-packages/M2Crypto/BIO.py", line 281, in openfile raise BIOError(ex.args) M2Crypto.BIO.BIOError: (2, 'No such file or directory') 2020-05-27 00:17:09,663+0200 ERROR otopi.context context._executeMethod:154 Failed to execute stage 'Misc configuration': (2, 'No such file or directory') Best regards Oli

On Wed, May 27, 2020 at 2:33 PM Oliver Leinfelder <oliver.leinfelder@gmail.com> wrote:
Hi,
Yedidyah Bar David wrote:
In any case (perhaps not relevant to you right now, if indeed engine-setup succeeded), usually the engine vm is left running at the end of a failed deploy. If it's still the local vm, you can find its IP address by searching the ansible logs for local_vm_ip, then you can ssh to it from the host.
For fixing the "empty engine-logs dirs", now pushed this:
https://github.com/oVirt/ovirt-ansible-hosted-engine-setup/pull/325
Didn't test yet, it's just a guess.
I followed up on your remark that the engine may indeed be running. And it is, sorry for not seeing this earlier.
Anyway, I was thus able to take a look in /var/log/ovirt-engine/setup in the HE VM and found the following error (I found a couple of more "suspicious" lines, but this one sticks out).
2020-05-27 00:17:09,660+0200 DEBUG otopi.context context._executeMethod:145 method exception Traceback (most recent call last): File "/usr/lib64/python3.6/site-packages/M2Crypto/BIO.py", line 279, in openfile f = open(filename, mode) FileNotFoundError: [Errno 2] No such file or directory: '/etc/pki/ovirt-engine/qemu-ca.pem'
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/otopi/context.py", line 132, in _executeMethod method['method']() File "/usr/share/ovirt-engine/setup/bin/../plugins/ovirt-engine-setup/ovirt-engine/pki/ca.py", line 699, in _miscUpgrade if self._expired(self._x509_load_cert(ca_file)): File "/usr/share/ovirt-engine/setup/bin/../plugins/ovirt-engine-setup/ovirt-engine/pki/ca.py", line 94, in _x509_load_cert res = X509.load_cert(f) File "/usr/lib64/python3.6/site-packages/M2Crypto/X509.py", line 802, in load_cert with BIO.openfile(file) as bio: File "/usr/lib64/python3.6/site-packages/M2Crypto/BIO.py", line 281, in openfile raise BIOError(ex.args) M2Crypto.BIO.BIOError: (2, 'No such file or directory') 2020-05-27 00:17:09,663+0200 ERROR otopi.context context._executeMethod:154 Failed to execute stage 'Misc configuration': (2, 'No such file or directory')
I think I know (it's hard to tell without more logs, but anyway): It's because your PKI was expired and thus renewed. If you used the command line to restore/deploy, you were also asked: 'Renew engine CA on restore if needed? Please notice ' 'that if you choose Yes, all hosts will have to be ' 'later manually reinstalled from the engine. ' '(@VALUES@)[@DEFAULT@]: ' and probably replied Yes. You have two options: 1. Try again, and reply No. 2. Run first engine-setup (can add --offline to prevent it from upgrading) on your old engine. You should be prompted there, and reply Yes, and then take a backup after it finishes and try again to restore with that backup. In any case, that's a bug. Can you please open one? Preferably after verifying my guesses? Thanks and best regards, -- Didi

Hi,
I think I know (it's hard to tell without more logs, but anyway):
It's because your PKI was expired and thus renewed. If you used the command line to restore/deploy, you were also asked:
'Renew engine CA on restore if needed? Please notice ' 'that if you choose Yes, all hosts will have to be ' 'later manually reinstalled from the engine. ' '(@VALUES@)[@DEFAULT@]: '
and probably replied Yes.
You have two options:
1. Try again, and reply No. 2. Run first engine-setup (can add --offline to prevent it from upgrading) on your old engine. You should be prompted there, and reply Yes, and then take a backup after it finishes and try again to restore with that backup.
In any case, that's a b
your guess was right, I think (btw: I check the ca.pem in the old HE - this one is valid til 2028). I took the easy way and replied with "No". I will open a bug (my first one :-)). Is this one for the ovirt-hosted-engine-setup category? Anyway, setup ran much farther but still did not complete. It fails after this error now: [ ERROR ] ovirtsdk4.AuthError: Error during SSO authentication server_error : PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target [ ERROR ] fatal: [localhost]: FAILED! => {"attempts": 50, "changed": false, "msg": "Error during SSO authentication server_error : PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target"} I can see this in /var/log/engine.log in the HE. 2020-05-27 16:10:43,695+02 ERROR [org.ovirt.engine.core.sso.utils.SsoUtils] (default task-8) [] OAuthException server_error: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target 2020-05-27 16:10:53,962+02 INFO [org.ovirt.engine.extension.aaa.jdbc.core.Authentication] (default task-8) [] locking user: admin due to interval failures 2020-05-27 16:10:58,956+02 ERROR [org.ovirt.engine.core.sso.utils.SsoUtils] (default task-8) [] OAuthException server_error: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target 2020-05-27 16:11:09,222+02 INFO [org.ovirt.engine.extension.aaa.jdbc.core.Authentication] (default task-8) [] locking user: admin due to interval failures 2020-05-27 16:11:14,217+02 ERROR [org.ovirt.engine.core.sso.utils.SsoUtils] (default task-8) [] OAuthException server_error: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target 2020-05-27 16:11:24,484+02 INFO [org.ovirt.engine.extension.aaa.jdbc.core.Authentication] (default task-8) [] locking user: admin due to interval failures 2020-05-27 16:11:29,480+02 ERROR [org.ovirt.engine.core.sso.utils.SsoUtils] (default task-8) [] OAuthException server_error: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target I tried to dig around a bit more in /var/log of the HE to get more details but can't seem to find anything there. Thanks in advance! Best regards Oli

Hi Oliver How did you defeat this problem? I have a similar problem https://lists.ovirt.org/archives/list/users@ovirt.org/message/NTJHPQ5EBUEJ6U...
participants (3)
-
Ilya Fedotov
-
Oliver Leinfelder
-
Yedidyah Bar David