On Wed, Apr 7, 2021 at 12:36 PM Marcin Sobczyk <msobczyk(a)redhat.com> wrote:
On 4/6/21 10:37 AM, Marcin Sobczyk wrote:
>
> On 4/6/21 9:55 AM, Yedidyah Bar David wrote:
>> On Tue, Apr 6, 2021 at 9:24 AM Marcin Sobczyk <msobczyk(a)redhat.com>
wrote:
>>> Hi,
>>>
>>> On 4/6/21 7:23 AM, Yedidyah Bar David wrote:
>>>> On Mon, Apr 5, 2021 at 5:53 AM <jenkins(a)jenkins.phx.ovirt.org>
wrote:
>>>>> Project:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/
>>>>> Build:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1974/
>>>> FYI: This failed twice in a row (1973 and 1974), for the same reason.
>>>> I reproduced locally, looked a bit, failed to find the root cause.
>>>> When I connected
>>>> to host-1's console, it was stuck in emergency after reboot. I
checked
>>>> a bit, there
>>>> was some error about kdump failing to read the kernel image
>>>> ( /boot/vmlinuz-4.18.0-240.15.1.el8_3.x86_64 ), when I tried manually
>>>> as root I did
>>>> manage to read it. I rebooted, and the VM came up fine. I decided to
>>>> try OST again,
>>>> cleaned up and ran it, and opened a 'lago console' on the vm
after it
>>>> was up, but
>>>> OST passed. Tried again, passed again. Then I manually ran in CI 1975
>>>> and it passed,
>>>> and also the nightly 1976 passed. So I am going to ignore for now.
>>>>
>>>> I think we need a patch to make lago/OST log consoles of all the VMs.
>>>> I might try
>>>> to work on this.
>>> Also stumbled upon this. Please take a look at
>>>
https://gerrit.ovirt.org/#/c/ovirt-system-tests/+/114050/
>> Yes, I did notice this change and wondered if it's related...
>>
>> But it's not merged yet, and still HE passed at least 4 times (two locally,
>> two on CI). Obviously this does not prove that the issue is fixed.
>>
>> Anyway, in addition to merely fixing it (which perhaps your patch does),
>> I also wanted to emphasize the importance of making it easier to fix
>> future such cases. How did you manage to find the root cause?
> My case was similar - HE suite was failing for me constantly. I noticed
> host-1 drops to emergency shell, so I just 'virsh console'd inside
> and went through the logs. That's when I spotted the problem with
> the additional '/var/tmp' disk. I tried the fix on my machine and HE
> suite started working again. Moments later I tried running HE suite
> without the patch and it was successful again.
>
> I couldn't figure out what's the real cause behind these problems,
> but removing the unnecessary additional disk from host-1 seemed
> to do the trick.
>
> +1 for logging consoles of the VMs - that should help with these kind
> of problems in the future.
Yesterday we hit this problem at least 2 times:
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/16183
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/16184
Didi, please review the patch mentioned above. If you don't have
any objections let's merge it and work on improving logging later.
+1 from me.
I also pushed this to log consoles, but it's not as easy as hoped:
When you have time, please see my comment there and reply...
Thanks and best regards,
--
Didi