On Thu, Mar 7, 2019 at 8:50 PM Dafna Ron <dron(a)redhat.com> wrote:
I report this issue to the list and the test owner is more then
welcome to
respond.
Galit, can you please review the skiptest
https://gerrit.ovirt.org/#/c/98191/
I looked at engine and vdsm logs, and I don't think the add_vm2_lease test
is related
so there is no point to disable it.
It seems that just after finishing resize_and_refresh_storage_domain test:
2019-02-27 13:37:21,036-0500 INFO (jsonrpc/5) [vdsm.api] START
resizePV(sdUUID=u'34311cc1-c4d2-4cfe-88b5-dd5ad72261d3',
spUUID=u'717575a9-7818-460d-ba3a-d5bdd8ef9ed3',
guid=u'360014054e5952dc174c4a12b971ea45c', options=None)
from=::ffff:192.168.201.4,43920,
flow_id=f204cfe2-ef48-429d-ab33-e1175d0530a0,
task_id=5908a867-5e11-474f-8119-49cfe219d4c8 (api:48)
2019-02-27 13:37:21,499-0500 INFO (jsonrpc/5) [vdsm.api] FINISH resizePV
return={'size': '24293408768'} from=::ffff:192.168.201.4,43920,
flow_id=f204cfe2-ef48-429d-ab33-e1175d0530a0,
task_id=5908a867-5e11-474f-8119-49cfe219d4c8 (api:54)
Engine try to lookup the vm lease. The lookup fails (expected since there
is no such lease):
2019-02-27 13:37:21,968-0500 INFO (jsonrpc/1) [vdsm.api] START
lease_info(lease={u'sd_id': u'b370b91d-00b5-4f62-9270-ac0acd47d075',
u'lease_id': u'3500eb82-e5e2-4e24-b41c-ea02d9f6adee'})
from=::ffff:192.168.201.4,43920,
flow_id=117dec74-ad59-4b12-8148-b2c130337c10,
task_id=9c297d41-0aa7-4c74-b268-b710e666bc6c (api:48)
2019-02-27 13:37:21,987-0500 INFO (jsonrpc/1) [vdsm.api] FINISH lease_info
error=No such lease 3500eb82-e5e2-4e24-b41c-ea02d9f6adee
from=::ffff:192.168.201.4,43920,
flow_id=117dec74-ad59-4b12-8148-b2c130337c10,
task_id=9c297d41-0aa7-4c74-b268-b710e666bc6c (api:52)
On engine side, we see immediately after that:
2019-02-27 13:37:22,078-05 DEBUG
[org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor]
(default task-1) [117dec74-ad59-4b12-8148-b2c130337c10] method:
getListForSpmSelection, params: [717575a9-7818-460d-ba3a-d5bdd8ef9ed3],
timeElapsed: 12ms
Which means nothing to me, but it seems that engine treat the expected
error to lookup the lease
as fatal error in the spm.
Soon engine try to stop the spm on host-0, which make no sense:
2019-02-27 13:37:22,125-05 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (default
task-1) [117dec74-ad59-4b12-8148-b2c130337c10] SpmStopVDSCommand::Stopping
SPM on vds 'lago-basic-suite-master-host-0', pool id
'717575a9-7818-460d-ba3a-d5bdd8ef9ed3'
At the same time, host1 has trouble to see the LUNs, and fail to connect to
the pool:
2019-02-27 13:37:20,014-0500 INFO (jsonrpc/4) [vdsm.api] START
getDeviceList(storageType=3, guids=[u'360014055ad799e3a968444abdaefa323',
u'360014054e5952dc174c4a12b971ea45c'], checkStatus=False, options={})
from=::ffff:192.168.201.4,33922,
flow_id=f204cfe2-ef48-429d-ab33-e1175d0530a0,
task_id=78fb7d20-8bec-4561-9867-da4174d181a6 (api:48)
2019-02-27 13:37:20,340-0500 INFO (jsonrpc/4) [vdsm.api] FINISH
getDeviceList return={'devList': []} from=::ffff:192.168.201.4,33922,
flow_id=f204cfe2-ef48-429d-ab33-e1175d
0530a0, task_id=78fb7d20-8bec-4561-9867-da4174d181a6 (api:54)
The host does not see any LUNs.
So of course it cannot see master storage domain.
2019-02-27 13:37:22,718-0500 INFO (jsonrpc/6) [vdsm.api] FINISH
connectStoragePool error=Cannot find master domain:
u'spUUID=717575a9-7818-460d-ba3a-d5bdd8ef9ed3, msdUUID=3
So we probably go into SPM recovery flow which take some time, and adding
lease is just a victim of that,
since adding leases requires SPM.
We need to answer these questions:
- why host 1 does not see storage? we have seen this many time in the past
- why engine try to stop the spm on a host 0 that has no issue?
(maybe related to incorrect handling of "no such lease" error?)
- why the previous tests did not detect that host 1 was not connected to
storage?
- why the previous test did not wait until both hosts are connected to
storage?
We can try to increase the timeout of the add lease test, so we can recover
if the SPM is lost
for some reason during this test, but this only hides the real storage
issues.
Since this is not reproducible, and we don't understand the storage
failures, we should run OST
again after failures, and keep failing logs so we can investigate them
later.
Strangely, we don't experience this issue on master - I did 20-30 OST runs
in the last 2 weeks,
and the only issue I have seen is random host_devices failure. Why the
change queue has
move failures? What is the difference between OST master runs and the
change queue?
Nir
Thanks,
Dafna
On Tue, Mar 5, 2019 at 2:43 PM Nir Soffer <nsoffer(a)redhat.com> wrote:
>
>
> On Tue, Mar 5, 2019, 13:27 Eyal Shenitzky <eshenitz(a)redhat.com wrote:
>
>>
>>
>> On Tue, Mar 5, 2019 at 12:58 PM Dafna Ron <dron(a)redhat.com> wrote:
>>
>>> Tal,
>>>
>>> I see the bug is in post but the patch was not merged yet:
>>>
https://gerrit.ovirt.org/#/c/98191/
>>>
>>> can you tell me when will we merge the patch? as I rather not add
>>> SkipTest if this will be merged soon,
>>>
>>> thanks,
>>> Dafna
>>>
>>>
>>> On Mon, Mar 4, 2019 at 10:42 AM Dafna Ron <dron(a)redhat.com> wrote:
>>>
>>>> As I had another failure of this today I will be disabling this test
>>>> until issue is resolved (
https://bugzilla.redhat.com/1684267)
>>>>
>>>> Thanks,
>>>> Dafna
>>>>
>>>>
>>>> On Thu, Feb 28, 2019 at 8:48 PM Nir Soffer <nsoffer(a)redhat.com>
wrote:
>>>>
>>>>> On Thu, Feb 28, 2019 at 11:52 AM Dafna Ron <dron(a)redhat.com>
wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> We have a failure on the project in basic suite, master branch.
The
>>>>>> recent failure was in patch:
>>>>>>
https://gerrit.ovirt.org/#/c/98087/1 - Add pre-seed for
ovirt-web-ui
>>>>>>
>>>>>> CQ is pointing at the below as the root cause (which was merged
a
>>>>>> while back):
>>>>>>
https://gerrit.ovirt.org/#/c/97491/ - Add pre-seed for
ovirt-web-ui
>>>>>>
>>>>>> Can you please check the issue as it seems both patches are
changing
>>>>>> the same thing and the project seem to be broken since
>>>>>>
https://gerrit.ovirt.org/#/c/97491/3
>>>>>>
>>>>>> Latest failure:
>>>>>>
https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/13171/
>>>>>>
>>>>>> Logs:
>>>>>>
>>>>>>
https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/13171/arti...
>>>>>>
>>>>>> errors from logs:
>>>>>> Engine:
>>>>>>
>>>>>> 2019-02-27 13:37:28,479-05 ERROR
>>>>>> [org.ovirt.engine.core.bll.UpdateVmCommand] (default task-1)
[74283e25]
>>>>>> Transaction rolled-back for command
>>>>>> 'org.ovirt.engine.core.bll.UpdateVmCommand'.
>>>>>> 2019-02-27 13:37:28,483-05 ERROR
>>>>>>
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>>>>> (default task-1) [74283e25] EVENT_ID: USER_FAILED_UPDATE_VM(58),
Failed to
>>>>>> update VM vm2 (User: admin@inter
>>>>>> nal-authz).
>>>>>> 2019-02-27 13:37:28,485-05 INFO
>>>>>> [org.ovirt.engine.core.bll.UpdateVmCommand] (default task-1)
[74283e25]
>>>>>> Lock freed to object
'EngineLock:{exclusiveLocks='[vm2=VM_NAME]',
>>>>>> sharedLocks='[3500eb82-e5e2-4e24-b41c-ea
>>>>>> 02d9f6adee=VM]'}'
>>>>>> 2019-02-27 13:37:28,485-05 DEBUG
>>>>>>
[org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor]
>>>>>> (default task-1) [74283e25] method: runAction, params:
[UpdateVm,
>>>>>> VmManagementParametersBase:{commandId='340
>>>>>> 59769-05b9-429e-8356-f6b9b9953f55', user='admin',
>>>>>> commandType='UpdateVm',
vmId='3500eb82-e5e2-4e24-b41c-ea02d9f6adee'}],
>>>>>> timeElapsed: 6618ms
>>>>>> 2019-02-27 13:37:28,486-05 ERROR
>>>>>> [org.ovirt.engine.api.restapi.resource.AbstractBackendResource]
(default
>>>>>> task-1) [] Operation Failed: []
>>>>>> 2019-02-27 13:37:28,579-05 DEBUG
>>>>>>
[org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor]
>>>>>> (EE-ManagedThreadFactory-engineScheduled-Thread-85) [] method:
get, params:
>>>>>> [e29c0ba1-464c-4eb4-a8f2-c6933d9
>>>>>> 9969a], timeElapsed: 3ms
>>>>>>
>>>>>>
>>>>>> vdsm:
>>>>>>
>>>>>> 2019-02-27 13:37:21,987-0500 INFO (jsonrpc/1) [vdsm.api] FINISH
>>>>>> lease_info error=No such lease
3500eb82-e5e2-4e24-b41c-ea02d9f6adee
>>>>>> from=::ffff:192.168.201.4,43920,
>>>>>> flow_id=117dec74-ad59-4b12-8148-b2c130337c10,
>>>>>> task_id=9c297d41-0aa7-4c74-b268-b710e666bc6c (api:52)
>>>>>> 2019-02-27 13:37:21,988-0500 ERROR (jsonrpc/1)
>>>>>> [storage.TaskManager.Task]
(Task='9c297d41-0aa7-4c74-b268-b710e666bc6c')
>>>>>> Unexpected error (task:875)
>>>>>> Traceback (most recent call last):
>>>>>> File
"/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line
>>>>>> 882, in _run
>>>>>> return fn(*args, **kargs)
>>>>>> File "<string>", line 2, in lease_info
>>>>>> File
"/usr/lib/python2.7/site-packages/vdsm/common/api.py", line
>>>>>> 50, in method
>>>>>> ret = func(*args, **kwargs)
>>>>>> File
"/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line
>>>>>> 3702, in lease_info
>>>>>> info = dom.lease_info(lease.lease_id)
>>>>>> File
"/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line
>>>>>> 674, in lease_info
>>>>>> return vol.lookup(lease_id)
>>>>>> File
"/usr/lib/python2.7/site-packages/vdsm/storage/xlease.py",
>>>>>> line 553, in lookup
>>>>>> raise NoSuchLease(lease_id)
>>>>>> NoSuchLease: No such lease 3500eb82-e5e2-4e24-b41c-ea02d9f6adee
>>>>>>
>>>>>
>>>>> This is not an error of vdsm. Someone asked for information about a
>>>>> non-existing lease,
>>>>> and vdsm fail the request with the expected error:
>>>>>
>>>>> NoSuchLease: No such lease 3500eb82-e5e2-4e24-b41c-ea02d9f6adee
>>>>>
>>>>> But since the error is not a public error, vdsm logger handle it as
>>>>> unexpected error and
>>>>> create lot of noise in vdsm log.
>>>>>
>>>>> On engine side, this creates even worse noise, as we can see here:
>>>>>
>>>>>
https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/13171/arti...
>>>>> (look for 2019-02-27 13:37:21,996-05)
>>>>>
>>>>> All these noise could be avoided if vdsm was using a proper public
>>>>> error.
>>>>> I filed
https://bugzilla.redhat.com/1684267
>>>>>
>>>>> But, this error is not related to the actual failing test, this is
>>>>> just noise hiding the real problem.
>>>>>
>>>>>
>> Hi,
>>
>> As Nir already said, This log is just noise, this is not the real
>> problem.
>>
>> I think that we already talk about this issue but I can't find the
>> thread.
>>
>> This lease failure in the log is expected and this is not an error
>> please ignore this in the next failure analysis.
>>
>> From the log I can see the following error:
>>
>> {"jsonrpc": "2.0", "id":
"6a42a9cf-e220-4916-ba0e-db2182978a5d", "error": {"message":
"Cannot find master domain: u'spUUID=717575a9-7818-460d-ba3a-d5bdd8ef9ed3,
msdUUID=34311cc1-c4d2-4cfe-88b5-dd5ad72261d3'", "code": 304}}�
>> 2019-02-27 13:35:14,175-05 DEBUG
[org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) [] Message
received: {"jsonrpc": "2.0", "id":
"6a42a9cf-e220-4916-ba0e-db2182978a5d", "error": {"message":
"Cannot find master domain: u'spUUID=717575a9-7818-460d-ba3a-d5bdd8ef9ed3,
msdUUID=34311cc1-c4d2-4cfe-88b5-dd5ad72261d3'", "code": 304}}
>> 2019-02-27 13:35:14,182-05 ERROR
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(EE-ManagedThreadFactory-engine-Thread-23) [73da8d8b] EVENT_ID:
VDS_BROKER_COMMAND_FAILURE(10,802), VDSM lago-basic-suite-master-host-1 command
ConnectStoragePoolVDS failed: Cannot find master domain:
u'spUUID=717575a9-7818-460d-ba3a-d5bdd8ef9ed3,
msdUUID=34311cc1-c4d2-4cfe-88b5-dd5ad72261d3'
>> 2019-02-27 13:35:14,182-05 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand]
(EE-ManagedThreadFactory-engine-Thread-23) [73da8d8b] Command
'org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand' return
value 'StatusOnlyReturn [status=Status [code=304, message=Cannot find master domain:
u'spUUID=717575a9-7818-460d-ba3a-d5bdd8ef9ed3,
msdUUID=34311cc1-c4d2-4cfe-88b5-dd5ad72261d3']]'
>> 2019-02-27 13:35:14,182-05 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand]
(EE-ManagedThreadFactory-engine-Thread-23) [73da8d8b] HostName =
lago-basic-suite-master-host-1
>>
>> This failure occured before the lease info was requested (and failed as
>> expected).
>>
>> Therefore the fix for -
https://bugzilla.redhat.com/1684267 will not
>> solve anything that related to the failure.
>>
>
> I agree with Eyal.
>
> Please find the owner of the failing test. Every test should have an
> owner. If there is no owner the test should be deleted.
>
>
>> T hanks for reporting this.
>>>>>
>>>>> Nir
>>>>>
>>>> _______________________________________________
>>> Devel mailing list -- devel(a)ovirt.org
>>> To unsubscribe send an email to devel-leave(a)ovirt.org
>>> Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
>>> oVirt Code of Conduct:
>>>
https://www.ovirt.org/community/about/community-guidelines/
>>> List Archives:
>>>
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/PLQMYYARPAY...
>>>
>>
>>
>> --
>> Regards,
>> Eyal Shenitzky
>>
>