[ovirt-devel] [ OST Failure Report ] [ oVirt master ] [ 20.4.17 ] [ add_secondary_storage_domains ]

Yaniv Kaul ykaul at redhat.com
Thu Apr 20 12:46:16 UTC 2017


On Thu, Apr 20, 2017 at 2:42 PM, Nir Soffer <nsoffer at redhat.com> wrote:

>
>
> בתאריך יום ה׳, 20 באפר׳ 2017, 13:51, מאת Yaniv Kaul ‏<ykaul at redhat.com>:
>
>> On Thu, Apr 20, 2017 at 1:32 PM, Nir Soffer <nsoffer at redhat.com> wrote:
>>
>>>
>>>
>>> בתאריך יום ה׳, 20 באפר׳ 2017, 13:05, מאת Piotr Kliczewski ‏<
>>> piotr.kliczewski at gmail.com>:
>>>
>>>> On Thu, Apr 20, 2017 at 11:49 AM, Yaniv Kaul <ykaul at redhat.com> wrote:
>>>> >
>>>> >
>>>> > On Thu, Apr 20, 2017 at 11:55 AM, Piotr Kliczewski
>>>> > <piotr.kliczewski at gmail.com> wrote:
>>>> >>
>>>> >> On Thu, Apr 20, 2017 at 10:32 AM, Yaniv Kaul <ykaul at redhat.com>
>>>> wrote:
>>>> >> > No, that's not the issue.
>>>> >> > I've seen it happening few times.
>>>> >> >
>>>> >> > 1. It always with the ISO domain (which we don't use anyway in
>>>> o-s-t)
>>>> >> > 2. Apparently, only one host is asking for a mount:
>>>> >> >  authenticated mount request from 192.168.201.4:713 for
>>>> /exports/nfs/iso
>>>> >> > (/exports/nfs/iso)
>>>> >> >
>>>> >> > (/var/log/messages of the NFS server)
>>>> >> >
>>>> >> > And indeed, you can see in[1] that host1 made the request and all
>>>> is
>>>> >> > well on
>>>> >> > it.
>>>> >> >
>>>> >> > However, there are connection issues with host0 which cause a
>>>> timeout to
>>>> >> > connectStorageServer():
>>>> >> > From[2]:
>>>> >> >
>>>> >> > 2017-04-19 18:58:58,465-04 DEBUG
>>>> >> > [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker]
>>>> (ResponseWorker)
>>>> >> > []
>>>> >> > Message received:
>>>> >> >
>>>> >> > {"jsonrpc":"2.0","error":{"code":"lago-basic-suite-
>>>> master-host0:192912448","message":"Vds
>>>> >> > timeout occured"},"id":null}
>>>> >> >
>>>> >> > 2017-04-19 18:58:58,475-04 ERROR
>>>> >> > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.
>>>> AuditLogDirector]
>>>> >> > (org.ovirt.thread.pool-7-thread-37) [755b908a] EVENT_ID:
>>>> >> > VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call
>>>> Stack:
>>>> >> > null,
>>>> >> > Custom Event ID: -1, Message: VDSM lago-basic-suite-master-host0
>>>> command
>>>> >> > ConnectStorageServerVDS failed: Message timeout which can be
>>>> caused by
>>>> >> > communication issues
>>>> >> > 2017-04-19 18:58:58,475-04 INFO
>>>> >> >
>>>> >> > [org.ovirt.engine.core.vdsbroker.vdsbroker.
>>>> ConnectStorageServerVDSCommand]
>>>> >> > (org.ovirt.thread.pool-7-thread-37) [755b908a] Command
>>>> >> >
>>>> >> > 'org.ovirt.engine.core.vdsbroker.vdsbroker.
>>>> ConnectStorageServerVDSCommand'
>>>> >> > return value '
>>>> >> > ServerConnectionStatusReturn:{status='Status [code=5022,
>>>> message=Message
>>>> >> > timeout which can be caused by communication issues]'}
>>>> >> >
>>>> >> >
>>>> >> > I wonder why, but on /var/log/messages[3], I'm seeing:
>>>> >> > Apr 19 18:56:58 lago-basic-suite-master-host0 journal: vdsm
>>>> Executor
>>>> >> > WARN
>>>> >> > Worker blocked: <Worker name=jsonrpc/3 running <Task <JsonRpcTask
>>>> >> > {'params':
>>>> >> > {u'connectionParams': [{u'id': u'4ca8fc84-d872-4a7f-907f-
>>>> 9445bda7b6d1',
>>>> >> > u'connection': u'192.168.201.3:/exports/nfs/share1', u'iqn': u'',
>>>> >> > u'user':
>>>> >> > u'', u'tpgt': u'1', u'protocol_version': u'4.2', u'password':
>>>> >> > '********',
>>>> >> > u'port': u''}], u'storagepoolID':
>>>> >> > u'00000000-0000-0000-0000-000000000000',
>>>> >> > u'domainType': 1}, 'jsonrpc': '2.0', 'method':
>>>> >> > u'StoragePool.connectStorageServer', 'id':
>>>> >> > u'057da9c2-1e67-4c2f-9511-7d9de250386b'} at 0x2f44110> timeout=60,
>>>> >> > duration=60 at 0x2f44310> task#=9 at 0x2ac11d0>
>>>> >> > ...
>>>> >> >
>>>> >>
>>>> >> I see following sequence:
>>>> >>
>>>> >> The message is sent:
>>>> >>
>>>> >> 2017-04-19 18:55:58,020-04 DEBUG
>>>> >> [org.ovirt.vdsm.jsonrpc.client.reactors.stomp.StompCommonClient]
>>>> >> (org.ovirt.thread.pool-7-thread-37) [755b908a] Message sent: SEND
>>>> >> destination:jms.topic.vdsm_requests
>>>> >> content-length:381
>>>> >> ovirtCorrelationId:755b908a
>>>> >> reply-to:jms.topic.vdsm_responses
>>>> >>
>>>> >> <JsonRpcRequest id: "057da9c2-1e67-4c2f-9511-7d9de250386b", method:
>>>> >> StoragePool.connectStorageServer, params:
>>>> >> {storagepoolID=00000000-0000-0000-0000-000000000000, domainType=1,
>>>> >> connectionParams=[{password=, protocol_version=4.2, port=, iqn=,
>>>> >> connection=192.168.201.3:/exports/nfs/share1,
>>>> >> id=4ca8fc84-d872-4a7f-907f-9445bda7b6d1, user=, tpgt=1}]}>
>>>> >>
>>>> >> There is no response for specified amount of time and we timeout:
>>>> >>
>>>> >> 2017-04-19 18:58:58,465-04 DEBUG
>>>> >> [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker]
>>>> >> (ResponseWorker) [] Message received:
>>>> >>
>>>> >> {"jsonrpc":"2.0","error":{"code":"lago-basic-suite-
>>>> master-host0:192912448","message":"Vds
>>>> >> timeout occured"},"id":null}
>>>> >>
>>>> >> As Yaniv pointed here is why we never get the response:
>>>> >>
>>>> >> Apr 19 18:58:58 lago-basic-suite-master-host0 journal: vdsm Executor
>>>> >> WARN Worker blocked: <Worker name=jsonrpc/3 running <Task
>>>> <JsonRpcTask
>>>> >> {'params': {u'connectionParams': [{u'id':
>>>> >> u'4ca8fc84-d872-4a7f-907f-9445bda7b6d1', u'connection':
>>>> >> u'192.168.201.3:/exports/nfs/share1', u'iqn': u'', u'user': u'',
>>>> >> u'tpgt': u'1', u'protocol_version': u'4.2', u'password': '********',
>>>> >> u'port': u''}], u'storagepoolID':
>>>> >> u'00000000-0000-0000-0000-000000000000', u'domainType': 1},
>>>> 'jsonrpc':
>>>> >> '2.0', 'method': u'StoragePool.connectStorageServer', 'id':
>>>> >> u'057da9c2-1e67-4c2f-9511-7d9de250386b'} at 0x2f44110> timeout=60,
>>>> >> duration=180 at 0x2f44310> task#=9 at 0x2ac11d0>
>>>>
>>>
>>> This means the connection attempt was stuck for 180 seconds. Need to
>>> check if the mount was stuck, or maybe there is some issue in supervdsm
>>> running this.
>>>
>>> This is a new warning introduced lately, before a stuck worker was
>>> hidden.
>>>
>>> Do we check that the nfs server is up before we start the tests?
>>>
>>
>> We do not check, but it is up - it's installed before anything else, and
>> happens to be on the Engine. It has several minutes to be up and ready.
>> Moreover, a command to the same NFS server, same params, only a different
>> mount point, succeeded, a second earlier:
>>
>> 18:55:56,300::supervdsm_server::92::SuperVdsm.ServerCallback::(wrapper) call mount with (u'192.168.201.3:/exports/nfs/share2', u'/rhev/data-center/mnt/192.168.201.3:_exports_nfs_share2') {'vfstype': 'nfs', 'mntOpts': 'soft,nosharecache,timeo=600,retrans=6,nfsvers=4,minorversion=1', 'timeout': None, 'cgroup': None}
>>
>> ...
>>
>> 18:55:56,338::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) return mount with None
>>
>>
>> But:
>>
>> 18:55:58,076::supervdsm_server::92::SuperVdsm.ServerCallback::(wrapper) call mount with (u'192.168.201.3:/exports/nfs/share1', u'/rhev/data-center/mnt/192.168.201.3:_exports_nfs_share1') {'vfstype': 'nfs', 'mntOpts': 'soft,nosharecache,timeo=600,retrans=6,nfsvers=4,minorversion=2', 'timeout': None, 'cgroup': None}
>>
>>
>> Got stuck.
>>
>>
> The stuck mount is using nfs 4.2, the successful one is 4.1.
>

On the other host it succeeded[1], but the order was different (it runs in
parallel threads) .
Perhaps we should add some random sleep between them?

Y.

[1]
http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/6403/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-002_bootstrap.py/lago-basic-suite-master-host1/_var_log/vdsm/supervdsm.log

>
>> See https://bugzilla.redhat.com/show_bug.cgi?id=1443913
>>
>> Y.
>>
>>
>>
>>
>>>
>>> >>
>>>> >> >
>>>> >> > 3. Also, there is still the infamous unable to update response
>>>> issues.
>>>> >> >
>>>> >>
>>>> >> When we see timeout on a call our default behavior is to reconnect
>>>> >> when we clean pending messages.
>>>> >> As a result when we reconnect and receive a response from the message
>>>> >> sent before disconnect
>>>> >> we say it is unknown to us.
>>>> >
>>>> >
>>>> > But the example I've given was earlier than the storage issue?
>>>>
>>>> The specific message that you refer to was a ping command but it
>>>> timeout (3 secs)
>>>> before it arrived and it was removed from tracking. When it finally
>>>> arrived it was not tracked anymore.
>>>>
>>>> We may want to increase the timeout to give it more time for arrival.
>>>>
>>>> > Y.
>>>> >
>>>> >>
>>>> >>
>>>> >> >
>>>> >> > {"jsonrpc":"2.0","method":"Host.ping","params":{},"id":"
>>>> 7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0"}�
>>>> >> > 2017-04-19 18:54:27,843-04 DEBUG
>>>> >> > [org.ovirt.vdsm.jsonrpc.client.reactors.stomp.StompCommonClient]
>>>> >> > (org.ovirt.thread.pool-7-thread-15) [62d198cc] Message sent: SEND
>>>> >> > destination:jms.topic.vdsm_requests
>>>> >> > content-length:94
>>>> >> > ovirtCorrelationId:62d198cc
>>>> >> > reply-to:jms.topic.vdsm_responses
>>>> >> >
>>>> >> > <JsonRpcRequest id: "7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0",
>>>> method:
>>>> >> > Host.ping, params: {}>
>>>> >> > 2017-04-19 18:54:27,885-04 DEBUG
>>>> >> > [org.ovirt.vdsm.jsonrpc.client.reactors.stomp.impl.Message]
>>>> >> > (org.ovirt.thread.pool-7-thread-16) [1f9aac13] SEND
>>>> >> > ovirtCorrelationId:1f9aac13
>>>> >> > destination:jms.topic.vdsm_requests
>>>> >> > reply-to:jms.topic.vdsm_responses
>>>> >> > content-length:94
>>>> >> >
>>>> >> > ...
>>>> >> >
>>>> >> > {"jsonrpc": "2.0", "id": "7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0",
>>>> >> > "result":
>>>> >> > true}�
>>>> >> > 2017-04-19 18:54:32,132-04 DEBUG
>>>> >> > [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker]
>>>> (ResponseWorker)
>>>> >> > []
>>>> >> > Message received: {"jsonrpc": "2.0", "id":
>>>> >> > "7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0", "result": true}
>>>> >> > 2017-04-19 18:54:32,133-04 ERROR
>>>> >> > [org.ovirt.vdsm.jsonrpc.client.JsonRpcClient] (ResponseWorker) []
>>>> Not
>>>> >> > able
>>>> >> > to update response for "7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0"
>>>> >> >
>>>> >> >
>>>> >> > Would be nice to understand why.
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > 4. Lastly, MOM is not running. Why?
>>>> >> >
>>>> >> >
>>>> >> > Please open a bug with the details from item #2 above.
>>>> >> > Y.
>>>> >> >
>>>> >> >
>>>> >> > [1]
>>>> >> >
>>>> >> > http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_
>>>> master/6403/artifact/exported-artifacts/basic-suit-master-
>>>> el7/test_logs/basic-suite-master/post-002_bootstrap.py/
>>>> lago-basic-suite-master-host1/_var_log/vdsm/supervdsm.log
>>>> >> > [2]
>>>> >> >
>>>> >> > http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_
>>>> master/6403/artifact/exported-artifacts/basic-suit-master-
>>>> el7/test_logs/basic-suite-master/post-002_bootstrap.py/
>>>> lago-basic-suite-master-engine/_var_log/ovirt-engine/engine.log
>>>> >> > [3]
>>>> >> >
>>>> >> > http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_
>>>> master/6403/artifact/exported-artifacts/basic-suit-master-
>>>> el7/test_logs/basic-suite-master/post-002_bootstrap.py/
>>>> lago-basic-suite-master-host0/_var_log/messages
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > On Thu, Apr 20, 2017 at 9:27 AM, Gil Shinar <gshinar at redhat.com>
>>>> wrote:
>>>> >> >>
>>>> >> >> Test failed: add_secondary_storage_domains
>>>> >> >> Link to suspected patches:
>>>> >> >> Link to Job:
>>>> >> >> http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_
>>>> master/6403
>>>> >> >> Link to all logs:
>>>> >> >>
>>>> >> >> http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_
>>>> master/6403/artifact/exported-artifacts/basic-suit-master-
>>>> el7/test_logs/basic-suite-master/post-002_bootstrap.py
>>>> >> >>
>>>> >> >>
>>>> >> >> Error seems to be:
>>>> >> >> 2017-04-19 18:58:58,774-0400 ERROR (jsonrpc/2)
>>>> >> >> [storage.TaskManager.Task]
>>>> >> >> (Task='8f9699ed-cc2f-434b-aa1e-b3c8ff30324a') Unexpected error
>>>> >> >> (task:871)
>>>> >> >> Traceback (most recent call last):
>>>> >> >>   File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py",
>>>> line
>>>> >> >> 878,
>>>> >> >> in _run
>>>> >> >>     return fn(*args, **kargs)
>>>> >> >>   File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line
>>>> 52, in
>>>> >> >> wrapper
>>>> >> >>     res = f(*args, **kwargs)
>>>> >> >>   File "/usr/share/vdsm/storage/hsm.py", line 2709, in
>>>> >> >> getStorageDomainInfo
>>>> >> >>     dom = self.validateSdUUID(sdUUID)
>>>> >> >>   File "/usr/share/vdsm/storage/hsm.py", line 298, in
>>>> validateSdUUID
>>>> >> >>     sdDom = sdCache.produce(sdUUID=sdUUID)
>>>> >> >>   File "/usr/share/vdsm/storage/sdc.py", line 112, in produce
>>>> >> >>     domain.getRealDomain()
>>>> >> >>   File "/usr/share/vdsm/storage/sdc.py", line 53, in
>>>> getRealDomain
>>>> >> >>     return self._cache._realProduce(self._sdUUID)
>>>> >> >>   File "/usr/share/vdsm/storage/sdc.py", line 136, in
>>>> _realProduce
>>>> >> >>     domain = self._findDomain(sdUUID)
>>>> >> >>   File "/usr/share/vdsm/storage/sdc.py", line 153, in _findDomain
>>>> >> >>     return findMethod(sdUUID)
>>>> >> >>   File "/usr/share/vdsm/storage/sdc.py", line 178, in
>>>> >> >> _findUnfetchedDomain
>>>> >> >>     raise se.StorageDomainDoesNotExist(sdUUID)
>>>> >> >> StorageDomainDoesNotExist: Storage domain does not exist:
>>>> >> >> (u'ac3bbc93-26ba-4ea8-8e76-c5b761f01931',)
>>>> >> >> 2017-04-19 18:58:58,777-0400 INFO  (jsonrpc/2)
>>>> >> >> [storage.TaskManager.Task]
>>>> >> >> (Task='8f9699ed-cc2f-434b-aa1e-b3c8ff30324a') aborting: Task is
>>>> >> >> aborted:
>>>> >> >> 'Storage domain does not exist' - code 358 (task:1176)
>>>> >> >> 2017-04-19 18:58:58,777-0400 ERROR (jsonrpc/2)
>>>> [storage.Dispatcher]
>>>> >> >> {'status': {'message': "Storage domain does not exist:
>>>> >> >> (u'ac3bbc93-26ba-4ea8-8e76-c5b761f01931',)", 'code': 358}}
>>>> >> >> (dispatcher:78)
>>>> >> >>
>>>> >> >> _______________________________________________
>>>> >> >> Devel mailing list
>>>> >> >> Devel at ovirt.org
>>>> >> >> http://lists.ovirt.org/mailman/listinfo/devel
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > _______________________________________________
>>>> >> > Devel mailing list
>>>> >> > Devel at ovirt.org
>>>> >> > http://lists.ovirt.org/mailman/listinfo/devel
>>>> >
>>>> >
>>>> _______________________________________________
>>>> Devel mailing list
>>>> Devel at ovirt.org
>>>> http://lists.ovirt.org/mailman/listinfo/devel
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/devel/attachments/20170420/926a6457/attachment-0001.html>


More information about the Devel mailing list