On Sun, Dec 18, 2016 at 7:17 PM, Nir Soffer <nsoffer@redhat.com> wrote:
On Sun, Dec 18, 2016 at 6:08 PM, Barak Korren <bkorren@redhat.com> wrote:
> On 18 December 2016 at 17:26, Nir Soffer <nsoffer@redhat.com> wrote:
>> On Sun, Dec 18, 2016 at 4:17 PM, Barak Korren <bkorren@redhat.com> wrote:
>
>> We a lot of these errors in the rest of the log. This meas something
>> is wrong with this vg.
>>
>> Needs deeper investigation from storage developer on both engine and vdsm side,
>> but I would start by making sure we use clean luns. We are not trying
>> to test esoteric
>> negative flows in the system tests.
>
> Here is the storage setup script:
> https://gerrit.ovirt.org/gitweb?p=ovirt-system-tests.git;a=blob;f=common/deploy-scripts/setup_storage_unified_he_extra_iscsi_el7.sh;hb=refs/heads/master

25     iscsiadm -m discovery -t sendtargets -p 127.0.0.1
26     iscsiadm -m node -L all

This is alerting. Before we serve these luns, we should log out
from these nodes, and remove the nodes.

This is show a non-up-to-date (or I have to update it) code. In an updated code, where it also happens, we do the following as well:
    iscsiadm -m node -U all
    iscsiadm -m node -o delete
    systemctl stop iscsi.service
    systemctl disable iscsi.service


> All storage used in the system tests comes from the engine VM itself,
> and is placed on a newly allocated QCOW2 file (exposed as /dev/sde to
> the engine VM), so its unlikely the LUNs are not clean.

We did not change code related to getDeviceList lately, these getPV errors
tell us that there is an issue in a lower level component or the storage
server.

Does this test pass with older version of vdsm? engine?

We did not test that. It's not very easy to do it in ovirt-system-tests, though I reckon it is possible with some additional work.
Note that I suspect cold and live merge were not actually tested for ages / ever in ovirt-system-tests.
 

>> Did we change something in the system tests project or lago while we
>> were not looking?

Mainly CentOS 7.2 -> CentOS 7.3 change.
 
>
> Not likely as well:
> https://gerrit.ovirt.org/gitweb?p=ovirt-system-tests.git;a=shortlog
>
> ovirt-system-tests project has got its own CI, testing against the
> last nigthly (we will move it to last build that passed the tests
> soon). So we are unlikely to merge breaking code there.

It depends on the tests.

Do you have test logging in to the target and creating a vg using
the luns?

> Then again
> we're not gating the OS packages so some breakage may have gone in via
> CentOS repos...

These failures are with centos 7.2 or 7.3? both?

Unsure.
 

>> Can we reproduce this issue manually with same engine and vdsm versions?
>
> You have several options:
> 1: Get engine+vdsm builds from Jenkins:
>    http://jenkins.ovirt.org/job/ovirt-engine_master_build-artifacts-fc24-x86_64/
>    http://jenkins.ovirt.org/job/vdsm_master_build-artifacts-el7-x86_64/
>    (Getting the exact builds that went into a given OST run takes tracing
>     back the job invocation links from that run)
>
> 2: Use the latest experimental repo:
>    http://resources.ovirt.org/repos/ovirt/experimental/master/latest/rpm/el7/
>
> 3: Run lago and OST locally:
>    (as documented here:
>     http://ovirt-system-tests.readthedocs.io/en/latest/
>     you'd need to pass in the vdsm and engine packages to use)

That's what I do, on a daily basis.
 

Do you know how to setup the system so it run all the setup code up to
the code that cause the getPV errors?

Yes, that should be fairly easy to do.
 

We need to inspect the system at this point.

Let me know and I'll set up a live system quickly tomorrow.
Y.
 

Nir