Re: [oVirt Jenkins] ovirt-system-tests_he-basic-suite-master - Build # 1974 - Still Failing!

On Mon, Apr 5, 2021 at 5:53 AM <jenkins@jenkins.phx.ovirt.org> wrote:
Project: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/ Build: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1974/
FYI: This failed twice in a row (1973 and 1974), for the same reason. I reproduced locally, looked a bit, failed to find the root cause. When I connected to host-1's console, it was stuck in emergency after reboot. I checked a bit, there was some error about kdump failing to read the kernel image ( /boot/vmlinuz-4.18.0-240.15.1.el8_3.x86_64 ), when I tried manually as root I did manage to read it. I rebooted, and the VM came up fine. I decided to try OST again, cleaned up and ran it, and opened a 'lago console' on the vm after it was up, but OST passed. Tried again, passed again. Then I manually ran in CI 1975 and it passed, and also the nightly 1976 passed. So I am going to ignore for now. I think we need a patch to make lago/OST log consoles of all the VMs. I might try to work on this. Best regards, -- Didi

Hi, On 4/6/21 7:23 AM, Yedidyah Bar David wrote:
Project: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/ Build: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1974/ FYI: This failed twice in a row (1973 and 1974), for the same reason. I reproduced locally, looked a bit, failed to find the root cause. When I connected to host-1's console, it was stuck in emergency after reboot. I checked a bit, there was some error about kdump failing to read the kernel image ( /boot/vmlinuz-4.18.0-240.15.1.el8_3.x86_64 ), when I tried manually as root I did manage to read it. I rebooted, and the VM came up fine. I decided to
On Mon, Apr 5, 2021 at 5:53 AM <jenkins@jenkins.phx.ovirt.org> wrote: try OST again, cleaned up and ran it, and opened a 'lago console' on the vm after it was up, but OST passed. Tried again, passed again. Then I manually ran in CI 1975 and it passed, and also the nightly 1976 passed. So I am going to ignore for now.
I think we need a patch to make lago/OST log consoles of all the VMs. I might try to work on this. Also stumbled upon this. Please take a look at https://gerrit.ovirt.org/#/c/ovirt-system-tests/+/114050/
Regards, Marcin
Best regards,

On Tue, Apr 6, 2021 at 9:24 AM Marcin Sobczyk <msobczyk@redhat.com> wrote:
Hi,
On 4/6/21 7:23 AM, Yedidyah Bar David wrote:
Project: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/ Build: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1974/ FYI: This failed twice in a row (1973 and 1974), for the same reason. I reproduced locally, looked a bit, failed to find the root cause. When I connected to host-1's console, it was stuck in emergency after reboot. I checked a bit, there was some error about kdump failing to read the kernel image ( /boot/vmlinuz-4.18.0-240.15.1.el8_3.x86_64 ), when I tried manually as root I did manage to read it. I rebooted, and the VM came up fine. I decided to
On Mon, Apr 5, 2021 at 5:53 AM <jenkins@jenkins.phx.ovirt.org> wrote: try OST again, cleaned up and ran it, and opened a 'lago console' on the vm after it was up, but OST passed. Tried again, passed again. Then I manually ran in CI 1975 and it passed, and also the nightly 1976 passed. So I am going to ignore for now.
I think we need a patch to make lago/OST log consoles of all the VMs. I might try to work on this. Also stumbled upon this. Please take a look at https://gerrit.ovirt.org/#/c/ovirt-system-tests/+/114050/
Yes, I did notice this change and wondered if it's related... But it's not merged yet, and still HE passed at least 4 times (two locally, two on CI). Obviously this does not prove that the issue is fixed. Anyway, in addition to merely fixing it (which perhaps your patch does), I also wanted to emphasize the importance of making it easier to fix future such cases. How did you manage to find the root cause? Best regards, -- Didi

On Tue, Apr 6, 2021 at 9:24 AM Marcin Sobczyk <msobczyk@redhat.com> wrote:
Hi,
On 4/6/21 7:23 AM, Yedidyah Bar David wrote:
Project: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/ Build: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1974/ FYI: This failed twice in a row (1973 and 1974), for the same reason. I reproduced locally, looked a bit, failed to find the root cause. When I connected to host-1's console, it was stuck in emergency after reboot. I checked a bit, there was some error about kdump failing to read the kernel image ( /boot/vmlinuz-4.18.0-240.15.1.el8_3.x86_64 ), when I tried manually as root I did manage to read it. I rebooted, and the VM came up fine. I decided to
On Mon, Apr 5, 2021 at 5:53 AM <jenkins@jenkins.phx.ovirt.org> wrote: try OST again, cleaned up and ran it, and opened a 'lago console' on the vm after it was up, but OST passed. Tried again, passed again. Then I manually ran in CI 1975 and it passed, and also the nightly 1976 passed. So I am going to ignore for now.
I think we need a patch to make lago/OST log consoles of all the VMs. I might try to work on this. Also stumbled upon this. Please take a look at https://gerrit.ovirt.org/#/c/ovirt-system-tests/+/114050/ Yes, I did notice this change and wondered if it's related...
But it's not merged yet, and still HE passed at least 4 times (two locally, two on CI). Obviously this does not prove that the issue is fixed.
Anyway, in addition to merely fixing it (which perhaps your patch does), I also wanted to emphasize the importance of making it easier to fix future such cases. How did you manage to find the root cause? My case was similar - HE suite was failing for me constantly. I noticed host-1 drops to emergency shell, so I just 'virsh console'd inside and went through the logs. That's when I spotted the problem with
On 4/6/21 9:55 AM, Yedidyah Bar David wrote: the additional '/var/tmp' disk. I tried the fix on my machine and HE suite started working again. Moments later I tried running HE suite without the patch and it was successful again. I couldn't figure out what's the real cause behind these problems, but removing the unnecessary additional disk from host-1 seemed to do the trick. +1 for logging consoles of the VMs - that should help with these kind of problems in the future. Regards, Marcin
Best regards,

On 4/6/21 10:37 AM, Marcin Sobczyk wrote:
On Tue, Apr 6, 2021 at 9:24 AM Marcin Sobczyk <msobczyk@redhat.com> wrote:
Hi,
On 4/6/21 7:23 AM, Yedidyah Bar David wrote:
Project: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/ Build: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1974/ FYI: This failed twice in a row (1973 and 1974), for the same reason. I reproduced locally, looked a bit, failed to find the root cause. When I connected to host-1's console, it was stuck in emergency after reboot. I checked a bit, there was some error about kdump failing to read the kernel image ( /boot/vmlinuz-4.18.0-240.15.1.el8_3.x86_64 ), when I tried manually as root I did manage to read it. I rebooted, and the VM came up fine. I decided to
On Mon, Apr 5, 2021 at 5:53 AM <jenkins@jenkins.phx.ovirt.org> wrote: try OST again, cleaned up and ran it, and opened a 'lago console' on the vm after it was up, but OST passed. Tried again, passed again. Then I manually ran in CI 1975 and it passed, and also the nightly 1976 passed. So I am going to ignore for now.
I think we need a patch to make lago/OST log consoles of all the VMs. I might try to work on this. Also stumbled upon this. Please take a look at https://gerrit.ovirt.org/#/c/ovirt-system-tests/+/114050/ Yes, I did notice this change and wondered if it's related...
But it's not merged yet, and still HE passed at least 4 times (two locally, two on CI). Obviously this does not prove that the issue is fixed.
Anyway, in addition to merely fixing it (which perhaps your patch does), I also wanted to emphasize the importance of making it easier to fix future such cases. How did you manage to find the root cause? My case was similar - HE suite was failing for me constantly. I noticed host-1 drops to emergency shell, so I just 'virsh console'd inside and went through the logs. That's when I spotted the problem with
On 4/6/21 9:55 AM, Yedidyah Bar David wrote: the additional '/var/tmp' disk. I tried the fix on my machine and HE suite started working again. Moments later I tried running HE suite without the patch and it was successful again.
I couldn't figure out what's the real cause behind these problems, but removing the unnecessary additional disk from host-1 seemed to do the trick.
+1 for logging consoles of the VMs - that should help with these kind of problems in the future.
Yesterday we hit this problem at least 2 times: https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/16183 https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/16184 Didi, please review the patch mentioned above. If you don't have any objections let's merge it and work on improving logging later. Regards, Marcin
Regards, Marcin
Best regards,

On Wed, Apr 7, 2021 at 12:36 PM Marcin Sobczyk <msobczyk@redhat.com> wrote:
On 4/6/21 10:37 AM, Marcin Sobczyk wrote:
On Tue, Apr 6, 2021 at 9:24 AM Marcin Sobczyk <msobczyk@redhat.com> wrote:
Hi,
On 4/6/21 7:23 AM, Yedidyah Bar David wrote:
Project: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/ Build: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1974/ FYI: This failed twice in a row (1973 and 1974), for the same reason. I reproduced locally, looked a bit, failed to find the root cause. When I connected to host-1's console, it was stuck in emergency after reboot. I checked a bit, there was some error about kdump failing to read the kernel image ( /boot/vmlinuz-4.18.0-240.15.1.el8_3.x86_64 ), when I tried manually as root I did manage to read it. I rebooted, and the VM came up fine. I decided to
On Mon, Apr 5, 2021 at 5:53 AM <jenkins@jenkins.phx.ovirt.org> wrote: try OST again, cleaned up and ran it, and opened a 'lago console' on the vm after it was up, but OST passed. Tried again, passed again. Then I manually ran in CI 1975 and it passed, and also the nightly 1976 passed. So I am going to ignore for now.
I think we need a patch to make lago/OST log consoles of all the VMs. I might try to work on this. Also stumbled upon this. Please take a look at https://gerrit.ovirt.org/#/c/ovirt-system-tests/+/114050/ Yes, I did notice this change and wondered if it's related...
But it's not merged yet, and still HE passed at least 4 times (two locally, two on CI). Obviously this does not prove that the issue is fixed.
Anyway, in addition to merely fixing it (which perhaps your patch does), I also wanted to emphasize the importance of making it easier to fix future such cases. How did you manage to find the root cause? My case was similar - HE suite was failing for me constantly. I noticed host-1 drops to emergency shell, so I just 'virsh console'd inside and went through the logs. That's when I spotted the problem with
On 4/6/21 9:55 AM, Yedidyah Bar David wrote: the additional '/var/tmp' disk. I tried the fix on my machine and HE suite started working again. Moments later I tried running HE suite without the patch and it was successful again.
I couldn't figure out what's the real cause behind these problems, but removing the unnecessary additional disk from host-1 seemed to do the trick.
+1 for logging consoles of the VMs - that should help with these kind of problems in the future.
Yesterday we hit this problem at least 2 times:
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/16183 https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/16184
Didi, please review the patch mentioned above. If you don't have any objections let's merge it and work on improving logging later.
+1 from me. I also pushed this to log consoles, but it's not as easy as hoped: https://gerrit.ovirt.org/c/lago-ost/+/114150 When you have time, please see my comment there and reply... Thanks and best regards, -- Didi
participants (2)
-
Marcin Sobczyk
-
Yedidyah Bar David