[ovirt-devel] Re: [oVirt Jenkins] ovirt-system-tests_he-basic-suite-master - Build # 1974 - Still Failing!

7 Apr 2021

      On 4/6/21 10:37 AM, Marcin Sobczyk wrote:
...
...
On Tue, Apr 6, 2021 at 9:24 AM Marcin Sobczyk <msobczyk@redhat.com> wrote:
...
Hi,
On 4/6/21 7:23 AM, Yedidyah Bar David wrote:
...
...
Project: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/
Build: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1974/
FYI: This failed twice in a row (1973 and 1974), for the same reason.
I reproduced locally, looked a bit, failed to find the root cause.
When I connected
to host-1's console, it was stuck in emergency after reboot. I checked
a bit, there
was some error about kdump failing to read the kernel image
( /boot/vmlinuz-4.18.0-240.15.1.el8_3.x86_64 ), when I tried manually
as root I did
manage to read it. I rebooted, and the VM came up fine. I decided to
On Mon, Apr 5, 2021 at 5:53 AM <jenkins@jenkins.phx.ovirt.org> wrote:
try OST again,
cleaned up and ran it, and opened a 'lago console' on the vm after it
was up, but
OST passed. Tried again, passed again. Then I manually ran in CI 1975
and it passed,
and also the nightly 1976 passed. So I am going to ignore for now.
I think we need a patch to make lago/OST log consoles of all the VMs.
I might try
to work on this.
Also stumbled upon this. Please take a look at
https://gerrit.ovirt.org/#/c/ovirt-system-tests/+/114050/
Yes, I did notice this change and wondered if it's related...
But it's not merged yet, and still HE passed at least 4 times (two locally,
two on CI). Obviously this does not prove that the issue is fixed.
Anyway, in addition to merely fixing it (which perhaps your patch does),
I also wanted to emphasize the importance of making it easier to fix
future such cases. How did you manage to find the root cause?
My case was similar - HE suite was failing for me constantly. I noticed
host-1 drops to emergency shell, so I just 'virsh console'd inside
and went through the logs. That's when I spotted the problem with
On 4/6/21 9:55 AM, Yedidyah Bar David wrote:
the additional '/var/tmp' disk. I tried the fix on my machine and HE
suite started working again. Moments later I tried running HE suite
without the patch and it was successful again.
I couldn't figure out what's the real cause behind these problems,
but removing the unnecessary additional disk from host-1 seemed
to do the trick.
+1 for logging consoles of the VMs - that should help with these kind
of problems in the future.
Yesterday we hit this problem at least 2 times:

https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/16183
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/16184

Didi, please review the patch mentioned above. If you don't have
any objections let's merge it and work on improving logging later.

Regards, Marcin
...
Regards, Marcin
...
Best regards,