Is the following log snippet, taken from [1], relevant to this thread:

 [36m  # resize_and_refresh_storage_domain:  [32mSuccess [0m (in 0:00:04) [0m
 [36m  # add_vm2_lease:  [0m [0m [0m
 [36m    * Collect artifacts:  [0m [0m [0m
 [36m        ~ [Thread-620] Copy from lago-basic-suite-4-3-engine:/tmp/otopi* to /dev/shm/ost/deployment-basic-suite-4.3/default/test_logs/002_bootstrap.add_vm2_lease-20190308151804/lago-basic-suite-4-3-engine/_tmp_otopi*:  [31mERROR [0m (in 0:00:00) [0m
 [36m      - [Thread-620] lago-basic-suite-4-3-engine:  [31mERROR [0m (in 0:00:00) [0m
 [36m    * Collect artifacts:  [31mERROR [0m (in 0:00:01) [0m
 [36m  # add_vm2_lease:  [31mERROR [0m (in 0:00:03) [0m
 [36m  # Results located at /dev/shm/ost/deployment-basic-suite-4.3/002_bootstrap.py.junit.xml [0m
 [36m@ Run test: 002_bootstrap.py:  [31mERROR [0m (in 0:05:17) [0m
 [31mError occured, aborting
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirtlago/cmd.py", line 383, in do_run
    self.cli_plugins[args.ovirtverb].do_run(args)
  File "/usr/lib/python2.7/site-packages/lago/plugins/cli.py", line 184, in do_run
    self._do_run(**vars(args))
  File "/usr/lib/python2.7/site-packages/lago/utils.py", line 549, in wrapper
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/lago/utils.py", line 560, in wrapper
    return func(*args, prefix=prefix, **kwargs)
  File "/usr/lib/python2.7/site-packages/ovirtlago/cmd.py", line 107, in do_ovirt_runtest
    raise RuntimeError('Some tests failed')
RuntimeError: Some tests failed [0m

[1] https://jenkins.ovirt.org/blue/organizations/jenkins/ovirt-4.3_change-queue-tester/detail/ovirt-4.3_change-queue-tester/165/pipeline

On Fri, Mar 8, 2019 at 12:21 AM Nir Soffer <nsoffer@redhat.com> wrote:
On Thu, Mar 7, 2019 at 8:50 PM Dafna Ron <dron@redhat.com> wrote:
I report this issue to the list and the test owner is more then welcome to respond.
Galit, can you please review the skiptest https://gerrit.ovirt.org/#/c/98191/

I looked at engine and vdsm logs, and I don't think the add_vm2_lease test is related
so there is no point to disable it.

It seems that just after finishing resize_and_refresh_storage_domain test:

2019-02-27 13:37:21,036-0500 INFO (jsonrpc/5) [vdsm.api] START resizePV(sdUUID=u'34311cc1-c4d2-4cfe-88b5-dd5ad72261d3', spUUID=u'717575a9-7818-460d-ba3a-d5bdd8ef9ed3', guid=u'360014054e5952dc174c4a12b971ea45c', options=None) from=::ffff:192.168.201.4,43920, flow_id=f204cfe2-ef48-429d-ab33-e1175d0530a0, task_id=5908a867-5e11-474f-8119-49cfe219d4c8 (api:48)
2019-02-27 13:37:21,499-0500 INFO (jsonrpc/5) [vdsm.api] FINISH resizePV return={'size': '24293408768'} from=::ffff:192.168.201.4,43920, flow_id=f204cfe2-ef48-429d-ab33-e1175d0530a0, task_id=5908a867-5e11-474f-8119-49cfe219d4c8 (api:54)

Engine try to lookup the vm lease. The lookup fails (expected since there is no such lease):

2019-02-27 13:37:21,968-0500 INFO  (jsonrpc/1) [vdsm.api] START lease_info(lease={u'sd_id': u'b370b91d-00b5-4f62-9270-ac0acd47d075', u'lease_id': u'3500eb82-e5e2-4e24-b41c-ea02d9f6adee'}) from=::ffff:192.168.201.4,43920, flow_id=117dec74-ad59-4b12-8148-b2c130337c10, task_id=9c297d41-0aa7-4c74-b268-b710e666bc6c (api:48)

2019-02-27 13:37:21,987-0500 INFO  (jsonrpc/1) [vdsm.api] FINISH lease_info error=No such lease 3500eb82-e5e2-4e24-b41c-ea02d9f6adee from=::ffff:192.168.201.4,43920, flow_id=117dec74-ad59-4b12-8148-b2c130337c10, task_id=9c297d41-0aa7-4c74-b268-b710e666bc6c (api:52)

On engine side, we see immediately after that:

2019-02-27 13:37:22,078-05 DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (default task-1) [117dec74-ad59-4b12-8148-b2c130337c10] method: getListForSpmSelection, params: [717575a9-7818-460d-ba3a-d5bdd8ef9ed3], timeElapsed: 12ms

Which means nothing to me, but it seems that engine treat the expected error to lookup the lease
as fatal error in the spm.

Soon engine try to stop the spm on host-0, which make no sense:

2019-02-27 13:37:22,125-05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (default task-1) [117dec74-ad59-4b12-8148-b2c130337c10] SpmStopVDSCommand::Stopping SPM on vds 'lago-basic-suite-master-host-0', pool id '717575a9-7818-460d-ba3a-d5bdd8ef9ed3'

At the same time, host1 has trouble to see the LUNs, and fail to connect to the pool:

2019-02-27 13:37:20,014-0500 INFO  (jsonrpc/4) [vdsm.api] START getDeviceList(storageType=3, guids=[u'360014055ad799e3a968444abdaefa323', u'360014054e5952dc174c4a12b971ea45c'], checkStatus=False, options={}) from=::ffff:192.168.201.4,33922, flow_id=f204cfe2-ef48-429d-ab33-e1175d0530a0, task_id=78fb7d20-8bec-4561-9867-da4174d181a6 (api:48)

2019-02-27 13:37:20,340-0500 INFO  (jsonrpc/4) [vdsm.api] FINISH getDeviceList return={'devList': []} from=::ffff:192.168.201.4,33922, flow_id=f204cfe2-ef48-429d-ab33-e1175d
0530a0, task_id=78fb7d20-8bec-4561-9867-da4174d181a6 (api:54)

The host does not see any LUNs.

So of course it cannot see master storage domain.

2019-02-27 13:37:22,718-0500 INFO  (jsonrpc/6) [vdsm.api] FINISH connectStoragePool error=Cannot find master domain: u'spUUID=717575a9-7818-460d-ba3a-d5bdd8ef9ed3, msdUUID=3

So we probably go into SPM recovery flow which take some time, and adding lease is just a victim of that,
since adding leases requires SPM.

We need to answer these questions:

- why host 1 does not see storage? we have seen this many time in the past
- why engine try to stop the spm on a host 0 that has no issue?
  (maybe related to incorrect handling of "no such lease" error?)
- why the previous tests did not detect that host 1 was not connected to storage?
- why the previous test did not wait until both hosts are connected to storage?

We can try to increase the timeout of the add lease test, so we can recover if the SPM is lost
for some reason during this test, but this only hides the real storage issues.

Since this is not reproducible, and we don't understand the storage failures, we should run OST
again after failures, and keep failing logs so we can investigate them later.

Strangely, we don't experience this issue on master - I did 20-30 OST runs in the last 2 weeks,
and the only issue I have seen is random host_devices failure. Why the change queue has
move failures? What is the difference between OST master runs and the change queue?

Nir
 

Thanks,
Dafna


On Tue, Mar 5, 2019 at 2:43 PM Nir Soffer <nsoffer@redhat.com> wrote:


On Tue, Mar 5, 2019, 13:27 Eyal Shenitzky <eshenitz@redhat.com wrote:


On Tue, Mar 5, 2019 at 12:58 PM Dafna Ron <dron@redhat.com> wrote:
Tal,

I see the bug is in post but the patch was not merged yet: https://gerrit.ovirt.org/#/c/98191/

can you tell me when will we merge the patch? as I rather not add SkipTest if this will be merged soon,

thanks,
Dafna


On Mon, Mar 4, 2019 at 10:42 AM Dafna Ron <dron@redhat.com> wrote:
As I had another failure of this today I will be disabling this test until issue is resolved (https://bugzilla.redhat.com/1684267)

Thanks,
Dafna


On Thu, Feb 28, 2019 at 8:48 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Thu, Feb 28, 2019 at 11:52 AM Dafna Ron <dron@redhat.com> wrote:
Hi,

We have a failure on the project in basic suite, master branch. The recent failure was in patch:
https://gerrit.ovirt.org/#/c/98087/1 - Add pre-seed for ovirt-web-ui

CQ is pointing at the below as the root cause (which was merged a while back):
https://gerrit.ovirt.org/#/c/97491/ - Add pre-seed for ovirt-web-ui

Can you please check the issue as it seems both patches are changing the same thing and the project seem to be broken since https://gerrit.ovirt.org/#/c/97491/3

Latest failure:

Logs:

errors from logs:
Engine:

2019-02-27 13:37:28,479-05 ERROR [org.ovirt.engine.core.bll.UpdateVmCommand] (default task-1) [74283e25] Transaction rolled-back for command 'org.ovirt.engine.core.bll.UpdateVmCommand'.
2019-02-27 13:37:28,483-05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-1) [74283e25] EVENT_ID: USER_FAILED_UPDATE_VM(58), Failed to update VM vm2 (User: admin@inter
nal-authz).
2019-02-27 13:37:28,485-05 INFO  [org.ovirt.engine.core.bll.UpdateVmCommand] (default task-1) [74283e25] Lock freed to object 'EngineLock:{exclusiveLocks='[vm2=VM_NAME]', sharedLocks='[3500eb82-e5e2-4e24-b41c-ea
02d9f6adee=VM]'}'
2019-02-27 13:37:28,485-05 DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (default task-1) [74283e25] method: runAction, params: [UpdateVm, VmManagementParametersBase:{commandId='340
59769-05b9-429e-8356-f6b9b9953f55', user='admin', commandType='UpdateVm', vmId='3500eb82-e5e2-4e24-b41c-ea02d9f6adee'}], timeElapsed: 6618ms
2019-02-27 13:37:28,486-05 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-1) [] Operation Failed: []
2019-02-27 13:37:28,579-05 DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (EE-ManagedThreadFactory-engineScheduled-Thread-85) [] method: get, params: [e29c0ba1-464c-4eb4-a8f2-c6933d9
9969a], timeElapsed: 3ms


vdsm:

2019-02-27 13:37:21,987-0500 INFO  (jsonrpc/1) [vdsm.api] FINISH lease_info error=No such lease 3500eb82-e5e2-4e24-b41c-ea02d9f6adee from=::ffff:192.168.201.4,43920, flow_id=117dec74-ad59-4b12-8148-b2c130337c10,
 task_id=9c297d41-0aa7-4c74-b268-b710e666bc6c (api:52)
2019-02-27 13:37:21,988-0500 ERROR (jsonrpc/1) [storage.TaskManager.Task] (Task='9c297d41-0aa7-4c74-b268-b710e666bc6c') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
    return fn(*args, **kargs)
  File "<string>", line 2, in lease_info
  File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 3702, in lease_info
    info = dom.lease_info(lease.lease_id)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 674, in lease_info
    return vol.lookup(lease_id)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/xlease.py", line 553, in lookup
    raise NoSuchLease(lease_id)
NoSuchLease: No such lease 3500eb82-e5e2-4e24-b41c-ea02d9f6adee

This is not an error of vdsm. Someone asked for information about a non-existing lease,
and vdsm fail the request with the expected error:

    NoSuchLease: No such lease 3500eb82-e5e2-4e24-b41c-ea02d9f6adee

But since the error is not a public error, vdsm logger handle it as unexpected error and
create lot of noise in vdsm log.

On engine side, this creates even worse noise, as we can see here:
(look for 2019-02-27 13:37:21,996-05)

All these noise could be avoided if vdsm was using a proper public error.

But, this error is not related to the actual failing test, this is just noise hiding the real problem.

  
Hi,

As Nir already said, This log is just noise, this is not the real problem.

I think that we already talk about this issue but I can't find the thread.

This lease failure in the log is expected and this is not an error please ignore this in the next failure analysis.

From the log I can see the following error:
{"jsonrpc": "2.0", "id": "6a42a9cf-e220-4916-ba0e-db2182978a5d", "error": {"message": "Cannot find master domain: u'spUUID=717575a9-7818-460d-ba3a-d5bdd8ef9ed3, msdUUID=34311cc1-c4d2-4cfe-88b5-dd5ad72261d3'", "code": 304}}�
2019-02-27 13:35:14,175-05 DEBUG [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) [] Message received: {"jsonrpc": "2.0", "id": "6a42a9cf-e220-4916-ba0e-db2182978a5d", "error": {"message": "Cannot find master domain: u'spUUID=717575a9-7818-460d-ba3a-d5bdd8ef9ed3, msdUUID=34311cc1-c4d2-4cfe-88b5-dd5ad72261d3'", "code": 304}}
2019-02-27 13:35:14,182-05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-23) [73da8d8b] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM lago-basic-suite-master-host-1 command ConnectStoragePoolVDS failed: Cannot find master domain: u'spUUID=717575a9-7818-460d-ba3a-d5bdd8ef9ed3, msdUUID=34311cc1-c4d2-4cfe-88b5-dd5ad72261d3'
2019-02-27 13:35:14,182-05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (EE-ManagedThreadFactory-engine-Thread-23) [73da8d8b] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand' return value 'StatusOnlyReturn [status=Status [code=304, message=Cannot find master domain: u'spUUID=717575a9-7818-460d-ba3a-d5bdd8ef9ed3, msdUUID=34311cc1-c4d2-4cfe-88b5-dd5ad72261d3']]'
2019-02-27 13:35:14,182-05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (EE-ManagedThreadFactory-engine-Thread-23) [73da8d8b] HostName = lago-basic-suite-master-host-1
This failure occured before the lease info was requested (and failed as expected).

Therefore the fix for -  https://bugzilla.redhat.com/1684267 will not solve anything that related to the failure.

I agree with Eyal.

Please find the owner of the failing test. Every test should have an owner. If there is no owner the test should be deleted.


T hanks for reporting this.

Nir
_______________________________________________
Devel mailing list -- devel@ovirt.org
To unsubscribe send an email to devel-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/PLQMYYARPAYW7KWSCHC4DWOMU4U63BGN/


--
Regards,
Eyal Shenitzky
_______________________________________________
Infra mailing list -- infra@ovirt.org
To unsubscribe send an email to infra-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
List Archives: https://lists.ovirt.org/archives/list/infra@ovirt.org/message/KMIRPDW5NX27G2BK47JOCTRH5YTUMVUW/


--
Liora Milbaum
Senior Principal Software Engineer
RHV/CNV DevOps
EMEA VIRTUALIZATION R&D