After investigating it looks like the issues started when this patch was merged.

Marcin, can you help debug it.

https://gerrit.ovirt.org/#/c/107399/

Thanks
Galit

On Mon, Mar 30, 2020 at 6:42 PM Martin Perina <mperina@redhat.com> wrote:


On Mon, Mar 30, 2020 at 5:38 PM Galit Rosenthal <grosenth@redhat.com> wrote:
It looks like the local repo stops running.
When I run curl before the failure just to check the status, I can see it isn't accessible.

I'm trying to see where it fails or what cause it to fail.

I manage to reproduce on BM

I thought that moving setup_storage will mitigate the issue: https://gerrit.ovirt.org/#/c/107989/
But it just postponed the error to further phase, now adding host failing to the same issue: Failed to download metadata for repo 'alocalsync'

https://jenkins.ovirt.org/view/oVirt system tests/job/ovirt-system-tests_manual/6710

So Galit, please take a look, oVirt CQ is suffering from this issue for more than a week now

On Mon, Mar 30, 2020 at 6:23 PM Marcin Sobczyk <msobczyk@redhat.com> wrote:
Hi Galit

I can see the issue again - now in manual OST runs:

https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_manual/6711/consoleFull#L2,856

Regards, Marcin

On 3/23/20 10:09 PM, Marcin Sobczyk wrote:


On 3/23/20 8:51 PM, Galit Rosenthal wrote:
I run it now locally using the extra sources as it runs in the CQ and it didn't fail for me.

I will continue to investigate tomorrow,

Marcin, did you see this issue also in check_patch or only in CQ?
I wasn't aware of the issue till Nir raised it - I was working with the patch previously
and both check-patch and manual runs were fine. I think it concerns only CQ then.

Regards,
Galit

On Mon, Mar 23, 2020 at 4:29 PM Galit Rosenthal <grosenth@redhat.com> wrote:
I will look at it.

On Mon, Mar 23, 2020 at 4:18 PM Martin Perina <mperina@redhat.com> wrote:


On Mon, Mar 23, 2020 at 3:16 PM Marcin Sobczyk <msobczyk@redhat.com> wrote:


On 3/23/20 3:10 PM, Marcin Sobczyk wrote:
>
>
> On 3/23/20 2:53 PM, Nir Soffer wrote:
>> On Mon, Mar 23, 2020 at 3:26 PM Marcin Sobczyk <msobczyk@redhat.com>
>> wrote:
>>>
>>>
>>> On 3/23/20 2:17 PM, Nir Soffer wrote:
>>>> On Mon, Mar 23, 2020 at 1:25 PM Marcin Sobczyk
>>>> <msobczyk@redhat.com> wrote:
>>>>>
>>>>> On 3/21/20 1:18 AM, Nir Soffer wrote:
>>>>>
>>>>> On Fri, Mar 20, 2020 at 9:35 PM Nir Soffer <nsoffer@redhat.com>
>>>>> wrote:
>>>>>> Looks like infrastructure issue setting up storage on engine host.
>>>>>>
>>>>>> Here are 2 failing builds with unrelated changes:
>>>>>> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6677/
>>>>>> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6678/
>>>>> Rebuilding still fails in setup_storage:
>>>>>
>>>>> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6679/testReport/
>>>>>
>>>>> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6680/testReport/
>>>>>
>>>>>
>>>>>> Is this a known issue?
>>>>>>
>>>>>> Error Message
>>>>>>
>>>>>> AssertionError: setup_storage.sh failed. Exit code is 1 assert 1
>>>>>> == 0   -1   +0
>>>>>>
>>>>>> Stacktrace
>>>>>>
>>>>>> prefix = <ovirtlago.prefix.OvirtPrefix object at 0x7f6fd2b998d0>
>>>>>>
>>>>>>       @pytest.mark.run(order=14)
>>>>>>       def test_configure_storage(prefix):
>>>>>>           engine = prefix.virt_env.engine_vm()
>>>>>>           result = engine.ssh(
>>>>>>               [
>>>>>>                   '/tmp/setup_storage.sh',
>>>>>>               ],
>>>>>>           )
>>>>>>>         assert result.code == 0, 'setup_storage.sh failed. Exit
>>>>>>> code is %s' % result.code
>>>>>> E       AssertionError: setup_storage.sh failed. Exit code is 1
>>>>>> E       assert 1 == 0
>>>>>> E         -1
>>>>>> E         +0
>>>>>>
>>>>>>
>>>>>> The pytest traceback is nice, but in this case it is does not
>>>>>> show any useful info.
>>>>>>
>>>>>> Since we run a script using ssh, the error message should include
>>>>>> the process stdout and stderr
>>>>>> which probably can explain the failure.
>>>>> I posted https://gerrit.ovirt.org/#/c/107830/ to improve logging
>>>>> during storage setup.
>>>>> Unfortunately AFAICS it didn't fail, so I guess we'll have to
>>>>> merge it and wait for a failed job to get some helpful logs.
>>>> Thanks.
>>>>
>>>> It still fails for me with current code:
>>>> https://jenkins.ovirt.org/job/ovirt-system-tests_manual/6689/testReport/
>>>>
>>>>
>>>> Same when using current vdsm master.
>>> Updated the patch according to your suggestions and currently trying
>>> out
>>> OST for the 4th time -
>>> all previous runs succeeded. I guess I'm out of luck :)
>> It succeeds on your local OST setup but fail on Jenkins?
> No, I mean jenkins - both check-patch runs didn't fail on this script.
> I also tried running OST manually twice and same thing happened.
> Anyway - the patch has been merged now so if any failure occurs in CQ
> we should know what's going on.
Ok, finally caught a failure in CQ [1]:

[2020-03-23T14:14:09.836Z]         if result.code != 0:
[2020-03-23T14:14:09.836Z]             msg = (
[2020-03-23T14:14:09.836Z]                 'setup_storage.sh failed with
exit code: {}.\n'
[2020-03-23T14:14:09.836Z]                 'stdout:\n{}'
[2020-03-23T14:14:09.836Z]                 'stderr:\n{}'
[2020-03-23T14:14:09.836Z]             ).format(result.code, result.out,
result.err)
[2020-03-23T14:14:09.836Z] >           raise RuntimeError(msg)
[2020-03-23T14:14:09.836Z] E           RuntimeError: setup_storage.sh
failed with exit code: 1.
[2020-03-23T14:14:09.836Z] E           stdout:
[2020-03-23T14:14:09.836Z] E           Reposync & Extra Sources
Content                0.0  B/s |   0  B     00:00
[2020-03-23T14:14:09.836Z] E           stderr:
[2020-03-23T14:14:09.836Z] E           + set -xe
[2020-03-23T14:14:09.836Z] E           +
MAIN_NFS_DEV=disk/by-id/scsi-0QEMU_QEMU_HARDDISK_2
[2020-03-23T14:14:09.836Z] E           +
ISCSI_DEV=disk/by-id/scsi-0QEMU_QEMU_HARDDISK_3
[2020-03-23T14:14:09.836Z] E           + NUM_LUNS=5
[2020-03-23T14:14:09.836Z] E           ++ uname -r
[2020-03-23T14:14:09.836Z] E           ++ awk -F. '{print $(NF-1)}'
[2020-03-23T14:14:09.836Z] E           + DIST=el8_1
[2020-03-23T14:14:09.836Z] E           + main
[2020-03-23T14:14:09.836Z] E           ++ hostname
[2020-03-23T14:14:09.836Z] E           + [[
lago-basic-suite-master-engine == *\i\p\v\6* ]]
[2020-03-23T14:14:09.836Z] E           + install_deps
[2020-03-23T14:14:09.836Z] E           + systemctl disable --now
kdump.service
[2020-03-23T14:14:09.836Z] E           Removed
/etc/systemd/system/multi-user.target.wants/kdump.service.
[2020-03-23T14:14:09.836Z] E           + yum install --nogpgcheck -y
nfs-utils rpcbind lvm2 targetcli sg3_utils iscsi-initiator-utils lsscsi
policycoreutils-python-utils
[2020-03-23T14:14:09.836Z] E           Failed to download metadata for
repo 'alocalsync'
[2020-03-23T14:14:09.836Z] E           Error: Failed to download
metadata for repo 'alocalsync'


[1]
https://jenkins.ovirt.org/blue/organizations/jenkins/ovirt-master_change-queue-tester/detail/ovirt-master_change-queue-tester/21420/pipeline

Galit, could you please take a look?


>
>>
>>>>>> Also I wonder why this code is called as a test
>>>>>> (test_configure_storage). This looks like setup
>>>>>> step so it should run as a fixture.
>>>>> That's true, but the pytest porting effort was about providing a
>>>>> bare minimum to move away from nose.
>>>>> Organizing the tests into proper setup/fixtures is a huge task and
>>>>> will be probably implemented
>>>>> incrementally in the nearest future.
>>>> Understood
>>>>
>



--
Martin Perina
Manager, Software Engineering
Red Hat Czech s.r.o.


--

GALIT ROSENTHAL

SOFTWARE ENGINEER

Red Hat 

galit@redhat.com    T: 972-9-7692230    



--

GALIT ROSENTHAL

SOFTWARE ENGINEER

Red Hat 

galit@redhat.com    T: 972-9-7692230    





--

GALIT ROSENTHAL

SOFTWARE ENGINEER

Red Hat 

galit@redhat.com    T: 972-9-7692230    



--
Martin Perina
Manager, Software Engineering
Red Hat Czech s.r.o.


--

GALIT ROSENTHAL

SOFTWARE ENGINEER

Red Hat 

galit@redhat.com    T: 972-9-7692230