Publish master more often
by Yedidyah Bar David
Hi all,
Right now, when we merge a patch e.g. to the engine (and many other
projects), it can take up to several days until it is used by the
hosted-engine ovirt-system-tests suite. Something similar will happen
soon if/when we introduce suites that use ovirt-node.
If I got it right:
- Merge causes CI to build the engine - immediately, takes ~ 1 hour (say)
- A publisher job [1] publishes it to resources.ovirt.org (daily,
midnight (UTC))
- The next run of an appliance build [2] includes it (daily, afternoon)
- The next run of the publisher [1] publishes the appliance (daily, midnight)
- The next run of ost-images [3] includes the appliance (daily,
midnight, 2 hours after the publisher) (and publishes it immediately)
- The next run of ost (e.g. [4]) will use it (daily, slightly *before*
ost-images, but I guess we can change that. And this does not affect
manual runs of OST, so can probably be ignored in the calculation, at
least to some extent).
So if I got it right, a patch merged to the engine in some morning,
will be used by the nightly run of OST HE only after almost 3 days,
and available for manual runs after 2 days. IMO that's too much time.
I might be somewhat wrong, but not very, I think.
One partial solution is to add automation .repos lines to relevant
projects that will link at lastSuccessfulBuild (let's call it lastSB)
of the more important projects they consume - e.g. appliance to use
lastSB of engine+dwh+a few others, node to use lastSB of vdsm, etc.
This will require more maintenance (adding/removing/fixing projects as
needed) and cause some more load on CI (as now packages will be
downloaded from it instead of from resources.ovirt.org).
Another solution is to run relevant jobs (publisher/appliance/node)
far more often - say, once every two hours. This will also add load,
and might cause "perceived" instability - as things will likely
fluctuate between green and red more often.
I think I prefer the latter. What do you think?
Thanks and best regards,
[1] https://jenkins.ovirt.org/job/ovirt_master_publish-rpms_nightly/
[2] https://jenkins.ovirt.org/job/ovirt-appliance_master_build-artifacts-el8-...
[3] https://jenkins.ovirt.org/job/ost-images_master_standard-poll-upstream-so...
[4] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/
--
Didi
22 hours, 26 minutes
test_verify_engine_certs (was: [oVirt Jenkins] ovirt-system-tests_basic-suite-master_nightly - Build # 894 - Failure!)
by Yedidyah Bar David
On Mon, Feb 22, 2021 at 3:12 AM <jenkins(a)jenkins.phx.ovirt.org> wrote:
>
> Project: https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/
> Build: https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night...
> Build Number: 894
> Build Status: Failure
> Triggered By: Started by timer
>
> -------------------------------------
> Changes Since Last Success:
> -------------------------------------
> Changes for Build #894
> [Andrej Cernek] ost_utils: Remove explicit object inheritance
>
>
>
>
> -----------------
> Failed Tests:
> -----------------
> 1 tests failed.
> FAILED: basic-suite-master.test-scenarios.test_002_bootstrap.test_verify_engine_certs[CA certificate]
>
> Error Message:
> ost_utils.shell.ShellError: Command failed with rc=1. Stdout: Stderr: unable to load certificate 139734854465344:error:0909006C:PEM routines:get_name:no start line:crypto/pem/pem_lib.c:745:Expecting: TRUSTED CERTIFICATE
>
> Stack Trace:
> key_format = 'X509-PEM-CA'
> verification_fn = <function <lambda> at 0x7f6aab2add90>, engine_fqdn = 'engine'
> engine_download = <function engine_download.<locals>.download at 0x7f6aa98d5ea0>
>
> @pytest.mark.parametrize("key_format, verification_fn", [
> pytest.param(
> 'X509-PEM-CA',
> lambda path: shell.shell(["openssl", "x509", "-in", path, "-text", "-noout"]),
> id="CA certificate"
> ),
> pytest.param(
> 'OPENSSH-PUBKEY',
> lambda path: shell.shell(["ssh-keygen", "-l", "-f", path]),
> id="ssh pubkey"
> ),
> ])
> @order_by(_TEST_LIST)
> def test_verify_engine_certs(key_format, verification_fn, engine_fqdn,
> engine_download):
> url = 'http://{}/ovirt-engine/services/pki-resource?resource=ca-certificate&format={}'
I guess (didn't check, only looked at engine git log) that this is a
result of [1].
Anyone looking at this?
This is trying to download the engine ca cert via http, and then do
some verification on it.
Generally speaking, this is a chicken-and-egg problem: You can't
securely download
a ca cert if you need this cert to securely download it.
For OST, it might be easy to fix by s/http/https/ and perhaps passing
some param to
make it not check certs in https. But I find it quite reasonable that
others are doing
similar things and will now be broken by this change [1]. If so, we
might decide that
this is "by design" - that whoever that gets broken, should fix their
stuff one way or
another (like OST above, or via safer means if possible/relevant, such
as using ssh
to securely connect to the engine machine and then get the cert from
there somehow
(do we have an api for this?)). Or we can decide that it's an engine
bug - that [1]
should have allowed this specific url to bypass hsts.
[1] https://gerrit.ovirt.org/c/ovirt-engine/+/113508
>
> with http_proxy_disabled(), tempfile.NamedTemporaryFile() as tmp:
> engine_download(url.format(engine_fqdn, key_format), tmp.name)
> try:
> > verification_fn(tmp.name)
>
> ../basic-suite-master/test-scenarios/test_002_bootstrap.py:292:
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ../basic-suite-master/test-scenarios/test_002_bootstrap.py:275: in <lambda>
> lambda path: shell.shell(["openssl", "x509", "-in", path, "-text", "-noout"]),
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>
> args = ['openssl', 'x509', '-in', '/tmp/tmpnj42cxm2', '-text', '-noout']
> bytes_output = False, kwargs = {}
> process = <subprocess.Popen object at 0x7f6aa98143c8>, out = ''
> err = 'unable to load certificate\n139734854465344:error:0909006C:PEM routines:get_name:no start line:crypto/pem/pem_lib.c:745:Expecting: TRUSTED CERTIFICATE\n'
>
> def shell(args, bytes_output=False, **kwargs):
> process = subprocess.Popen(args,
> stdout=subprocess.PIPE,
> stderr=subprocess.PIPE,
> **kwargs)
> out, err = process.communicate()
>
> if not bytes_output:
> out = out.decode("utf-8")
> err = err.decode("utf-8")
>
> if process.returncode:
> > raise ShellError(process.returncode, out, err)
> E ost_utils.shell.ShellError: Command failed with rc=1. Stdout:
> E
> E Stderr:
> E unable to load certificate
> E 139734854465344:error:0909006C:PEM routines:get_name:no start line:crypto/pem/pem_lib.c:745:Expecting: TRUSTED CERTIFICATE
(As I said, didn't check myself - I suppose that hsts causes httpd to
return some kind of redirect, and this is the way openssl fails when
we input this redirect instead of a cert).
Best regards,
--
Didi
2 days, 3 hours
test_incremental_backup_vm2 failed (was: Change in ovirt-system-tests[master]: WIP: Unite basic-suite-master 002_ and 004_ with HE)
by Yedidyah Bar David
Hi all,
On Thu, Feb 18, 2021 at 11:17 AM Code Review <gerrit(a)ovirt.org> wrote:
>
> From Jenkins CI <jenkins(a)ovirt.org>:
>
> Jenkins CI has posted comments on this change. ( https://gerrit.ovirt.org/c/ovirt-system-tests/+/113452 )
This patch ^^^ makes 002_ and 004_ test modules identical between
basic-suite and he-basic-suite.
>
> Change subject: WIP: Unite basic-suite-master 002_ and 004_ with HE
> ......................................................................
>
>
> Patch Set 29: Continuous-Integration-1
>
> Build Failed
>
> https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/15615/ : FAILURE
CI ran 3 suites on it. basic-suite and ansible-suite passed,
he-basic-suite failed [1] with this in engine.log [2]:
========================================================================
2021-02-18 10:06:33,389+01 INFO
[org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller]
(EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-81)
[test_incremental_backup] Exception in invoking callback of command
RedefineVmCheckpoint (20051278-ed39-40e5-9350-c47bd51fd6c0):
ClassCastException: class
org.ovirt.engine.core.bll.RedefineVmCheckpointCommand cannot be cast
to class org.ovirt.engine.core.bll.SerialChildExecutingCommand
(org.ovirt.engine.core.bll.RedefineVmCheckpointCommand and
org.ovirt.engine.core.bll.SerialChildExecutingCommand are in unnamed
module of loader 'deployment.engine.ear.bll.jar' @176ed0f7)
2021-02-18 10:06:33,389+01 ERROR
[org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller]
(EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-81)
[test_incremental_backup] Error invoking callback method 'doPolling'
for 'ACTIVE' command '20051278-ed39-40e5-9350-c47bd51fd6c0'
2021-02-18 10:06:33,390+01 ERROR
[org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller]
(EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-81)
[test_incremental_backup] Exception: java.lang.ClassCastException:
class org.ovirt.engine.core.bll.RedefineVmCheckpointCommand cannot be
cast to class org.ovirt.engine.core.bll.SerialChildExecutingCommand
(org.ovirt.engine.core.bll.RedefineVmCheckpointCommand and
org.ovirt.engine.core.bll.SerialChildExecutingCommand are in unnamed
module of loader 'deployment.engine.ear.bll.jar' @176ed0f7)
at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback.childCommandsExecutionEnded(SerialChildCommandsExecutionCallback.java:29)
at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.ChildCommandsCallbackBase.doPolling(ChildCommandsCallbackBase.java:80)
at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethodsImpl(CommandCallbacksPoller.java:175)
at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethods(CommandCallbacksPoller.java:109)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
at org.glassfish.javax.enterprise.concurrent//org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.access$201(ManagedScheduledThreadPoolExecutor.java:360)
at org.glassfish.javax.enterprise.concurrent//org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.run(ManagedScheduledThreadPoolExecutor.java:511)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
at org.glassfish.javax.enterprise.concurrent//org.glassfish.enterprise.concurrent.ManagedThreadFactoryImpl$ManagedThread.run(ManagedThreadFactoryImpl.java:227)
========================================================================
Any idea?
I also now see that basic-suite fails [3], but on a different test -
test_import_floating_disk. Not sure that's related.
Thanks and best regards,
[1] https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/156...
[2] https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/156...
[3] https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night...
>
>
> --
> To view, visit https://gerrit.ovirt.org/c/ovirt-system-tests/+/113452
> To unsubscribe, or for help writing mail filters, visit https://gerrit.ovirt.org/settings
>
> Gerrit-Project: ovirt-system-tests
> Gerrit-Branch: master
> Gerrit-Change-Id: Ied836cda6b622dbebdb869e0b83fa4c0e0b7ca2c
> Gerrit-Change-Number: 113452
> Gerrit-PatchSet: 29
> Gerrit-Owner: Yedidyah Bar David <didi(a)redhat.com>
> Gerrit-Reviewer: Anton Marchukov <amarchuk(a)redhat.com>
> Gerrit-Reviewer: Dafna Ron <dron(a)redhat.com>
> Gerrit-Reviewer: Dusan Fodor <dfodor(a)redhat.com>
> Gerrit-Reviewer: Gal Ben Haim <galbh2(a)gmail.com>
> Gerrit-Reviewer: Galit Rosenthal <grosenth(a)redhat.com>
> Gerrit-Reviewer: Jenkins CI <jenkins(a)ovirt.org>
> Gerrit-Reviewer: Name of user not set #1001916
> Gerrit-Reviewer: Yedidyah Bar David <didi(a)redhat.com>
> Gerrit-Reviewer: Zuul CI <zuul(a)ovirt.org>
> Gerrit-Comment-Date: Thu, 18 Feb 2021 09:17:18 +0000
> Gerrit-HasComments: No
> Gerrit-Has-Labels: Yes
> Gerrit-MessageType: comment
>
--
Didi
6 days, 15 hours
Support for SSH keys other than RSA
by Artur Socha
Hi,
I have been recently working on adding support for SSH keys other than
RSA (communication between ovirt-engine and hosts(VDS-es)).
The entire effort is tracked in Bugzilla [1].
There are couple important changes I would like to share with you.
First and the most important is changing the way connection is verified.
Previously fingerprints (by default SHA-256 unless changed via
configuration) were used to verify if the connection between the engine
and the host could be established. Now public keys are compared instead
(with one exception for backward compatibility).
For backward compatibility ie. for previously added (legacy) hosts with
fingerprint calculated out of RSA public key (the key not stored in db)
the verification is done as before that means we compare fingerprints
only. After upgrade the whole setup is expected to work without any
manual intervention.
However, there are couple of options to 'migrate' legacy fingerprint to
whatever ssh server finds the strongest on the host:
1) In database remove sshkeyfingerprint value ie.
update vds_static set sshkeyfingerprint='' where vds_id = 'PUT_HERE_HOST_ID'
2) REST:prepare request with blank fingerprint for 'legacy' hosts
Please see the (documentation [2]). Fingerprint and public key will be
re-entered,
3) reinstall host / install new host
4) manually deploy key and update host's VDS_static.sshkeyfingerprint
and vds_static.public_key
On engine's UI side there is still a way to fetch fingerprints (on 'New
Host' panel but we anticipate that soon there will be a public key (open
ssh format) instead.
Please let me know if you have any questions, doubts or if you encounter
any issues around this area.
Patches (referenced in BZ[1]) has been merged into master and this
feature is expected to go with 4.4.5 upstream release.
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1837221
[2]
https://jenkins.ovirt.org/job/ovirt-engine-api-model_standard-check-patch...
best,
Artur
1 week, 5 days
ansible: Unsupported parameters found in auth
by Yedidyah Bar David
Hi all,
he-basic-suite fails [1] with:
[ INFO ] TASK [ovirt.ovirt.hosted_engine_setup : Ensure that the
target datacenter is present]
[ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg":
"Unsupported parameters for (ovirt_datacenter) module: compress,
timeout found in auth. Supported parameters include: ca_file, headers,
hostname, insecure, kerberos, password, token, url, username"}
Any clue?
Thanks,
[1] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1918/
--
Didi
1 week, 5 days
"Too many open files" in vdsm.log after 380 migrations
by Yedidyah Bar David
Hi all,
I ran a loop of [1] (from [2]). The loop succeeded for ~ 380
iterations, then failed with 'Too many open files'. First failure was:
2021-02-08 02:21:15,702+0100 ERROR (jsonrpc/4) [storage.HSM] Could not
connect to storageServer (hsm:2446)
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line
2443, in connectStorageServer
conObj.connect()
File "/usr/lib/python3.6/site-packages/vdsm/storage/storageServer.py",
line 449, in connect
return self._mountCon.connect()
File "/usr/lib/python3.6/site-packages/vdsm/storage/storageServer.py",
line 171, in connect
self._mount.mount(self.options, self._vfsType, cgroup=self.CGROUP)
File "/usr/lib/python3.6/site-packages/vdsm/storage/mount.py", line
210, in mount
cgroup=cgroup)
File "/usr/lib/python3.6/site-packages/vdsm/common/supervdsm.py",
line 56, in __call__
return callMethod()
File "/usr/lib/python3.6/site-packages/vdsm/common/supervdsm.py",
line 54, in <lambda>
**kwargs)
File "<string>", line 2, in mount
File "/usr/lib64/python3.6/multiprocessing/managers.py", line 772,
in _callmethod
raise convert_to_error(kind, result)
OSError: [Errno 24] Too many open files
But obviously, once it did, it continued failing for this reason on
many later operations.
Is this considered a bug? Do we actively try to prevent such cases? So
should I open one and attach logs? Or it can be considered a "corner
case"?
Using vdsm-4.40.50.3-37.git7883b3b43.el8.x86_64 from
ost-images-el8-he-installed-1-202102021144.x86_64 .
I can also let access to the machine(s) if needed, for now.
Thanks and best regards,
[1] https://gerrit.ovirt.org/gitweb?p=ovirt-system-tests.git;a=blob;f=he-basi...
[2] https://gerrit.ovirt.org/c/ovirt-system-tests/+/113300
--
Didi
2 weeks, 1 day