On 4/6/21 9:55 AM, Yedidyah Bar David wrote:
> On Tue, Apr 6, 2021 at 9:24 AM Marcin Sobczyk <msobczyk(a)redhat.com> wrote:
>> Hi,
>>
>> On 4/6/21 7:23 AM, Yedidyah Bar David wrote:
>>> On Mon, Apr 5, 2021 at 5:53 AM <jenkins(a)jenkins.phx.ovirt.org> wrote:
>>>> Project:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/
>>>> Build:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1974/
>>> FYI: This failed twice in a row (1973 and 1974), for the same reason.
>>> I reproduced locally, looked a bit, failed to find the root cause.
>>> When I connected
>>> to host-1's console, it was stuck in emergency after reboot. I checked
>>> a bit, there
>>> was some error about kdump failing to read the kernel image
>>> ( /boot/vmlinuz-4.18.0-240.15.1.el8_3.x86_64 ), when I tried manually
>>> as root I did
>>> manage to read it. I rebooted, and the VM came up fine. I decided to
>>> try OST again,
>>> cleaned up and ran it, and opened a 'lago console' on the vm after
it
>>> was up, but
>>> OST passed. Tried again, passed again. Then I manually ran in CI 1975
>>> and it passed,
>>> and also the nightly 1976 passed. So I am going to ignore for now.
>>>
>>> I think we need a patch to make lago/OST log consoles of all the VMs.
>>> I might try
>>> to work on this.
>> Also stumbled upon this. Please take a look at
>>
https://gerrit.ovirt.org/#/c/ovirt-system-tests/+/114050/
> Yes, I did notice this change and wondered if it's related...
>
> But it's not merged yet, and still HE passed at least 4 times (two locally,
> two on CI). Obviously this does not prove that the issue is fixed.
>
> Anyway, in addition to merely fixing it (which perhaps your patch does),
> I also wanted to emphasize the importance of making it easier to fix
> future such cases. How did you manage to find the root cause?
My case was similar - HE suite was failing for me constantly. I noticed
host-1 drops to emergency shell, so I just 'virsh console'd inside
and went through the logs. That's when I spotted the problem with
the additional '/var/tmp' disk. I tried the fix on my machine and HE
suite started working again. Moments later I tried running HE suite
without the patch and it was successful again.
I couldn't figure out what's the real cause behind these problems,
but removing the unnecessary additional disk from host-1 seemed
to do the trick.
+1 for logging consoles of the VMs - that should help with these kind
of problems in the future.
Didi, please review the patch mentioned above. If you don't have
any objections let's merge it and work on improving logging later.
Regards, Marcin