On Tue, Apr 6, 2021 at 9:24 AM Marcin Sobczyk
<msobczyk(a)redhat.com> wrote:
> Hi,
>
> On 4/6/21 7:23 AM, Yedidyah Bar David wrote:
>> On Mon, Apr 5, 2021 at 5:53 AM <jenkins(a)jenkins.phx.ovirt.org> wrote:
>>> Project:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/
>>> Build:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1974/
>> FYI: This failed twice in a row (1973 and 1974), for the same reason.
>> I reproduced locally, looked a bit, failed to find the root cause.
>> When I connected
>> to host-1's console, it was stuck in emergency after reboot. I checked
>> a bit, there
>> was some error about kdump failing to read the kernel image
>> ( /boot/vmlinuz-4.18.0-240.15.1.el8_3.x86_64 ), when I tried manually
>> as root I did
>> manage to read it. I rebooted, and the VM came up fine. I decided to
>> try OST again,
>> cleaned up and ran it, and opened a 'lago console' on the vm after it
>> was up, but
>> OST passed. Tried again, passed again. Then I manually ran in CI 1975
>> and it passed,
>> and also the nightly 1976 passed. So I am going to ignore for now.
>>
>> I think we need a patch to make lago/OST log consoles of all the VMs.
>> I might try
>> to work on this.
> Also stumbled upon this. Please take a look at
>
https://gerrit.ovirt.org/#/c/ovirt-system-tests/+/114050/
Yes, I did notice this change and wondered if it's related...
But it's not merged yet, and still HE passed at least 4 times (two locally,
two on CI). Obviously this does not prove that the issue is fixed.
Anyway, in addition to merely fixing it (which perhaps your patch does),
I also wanted to emphasize the importance of making it easier to fix
future such cases. How did you manage to find the root cause?
My case was similar -
HE suite was failing for me constantly. I noticed
host-1 drops to emergency shell, so I just 'virsh console'd inside
and went through the logs. That's when I spotted the problem with
the additional '/var/tmp' disk. I tried the fix on my machine and HE
suite started working again. Moments later I tried running HE suite
without the patch and it was successful again.
I couldn't figure out what's the real cause behind these problems,
but removing the unnecessary additional disk from host-1 seemed
to do the trick.
+1 for logging consoles of the VMs - that should help with these kind
of problems in the future.
Regards, Marcin