[ovirt-devel] [ OST Failure Report ] [ oVirt master ] [ 20.4.17 ] [ add_secondary_storage_domains ]
Piotr Kliczewski
piotr.kliczewski at gmail.com
Thu Apr 20 10:04:21 UTC 2017
On Thu, Apr 20, 2017 at 11:49 AM, Yaniv Kaul <ykaul at redhat.com> wrote:
>
>
> On Thu, Apr 20, 2017 at 11:55 AM, Piotr Kliczewski
> <piotr.kliczewski at gmail.com> wrote:
>>
>> On Thu, Apr 20, 2017 at 10:32 AM, Yaniv Kaul <ykaul at redhat.com> wrote:
>> > No, that's not the issue.
>> > I've seen it happening few times.
>> >
>> > 1. It always with the ISO domain (which we don't use anyway in o-s-t)
>> > 2. Apparently, only one host is asking for a mount:
>> > authenticated mount request from 192.168.201.4:713 for /exports/nfs/iso
>> > (/exports/nfs/iso)
>> >
>> > (/var/log/messages of the NFS server)
>> >
>> > And indeed, you can see in[1] that host1 made the request and all is
>> > well on
>> > it.
>> >
>> > However, there are connection issues with host0 which cause a timeout to
>> > connectStorageServer():
>> > From[2]:
>> >
>> > 2017-04-19 18:58:58,465-04 DEBUG
>> > [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker)
>> > []
>> > Message received:
>> >
>> > {"jsonrpc":"2.0","error":{"code":"lago-basic-suite-master-host0:192912448","message":"Vds
>> > timeout occured"},"id":null}
>> >
>> > 2017-04-19 18:58:58,475-04 ERROR
>> > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>> > (org.ovirt.thread.pool-7-thread-37) [755b908a] EVENT_ID:
>> > VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack:
>> > null,
>> > Custom Event ID: -1, Message: VDSM lago-basic-suite-master-host0 command
>> > ConnectStorageServerVDS failed: Message timeout which can be caused by
>> > communication issues
>> > 2017-04-19 18:58:58,475-04 INFO
>> >
>> > [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand]
>> > (org.ovirt.thread.pool-7-thread-37) [755b908a] Command
>> >
>> > 'org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand'
>> > return value '
>> > ServerConnectionStatusReturn:{status='Status [code=5022, message=Message
>> > timeout which can be caused by communication issues]'}
>> >
>> >
>> > I wonder why, but on /var/log/messages[3], I'm seeing:
>> > Apr 19 18:56:58 lago-basic-suite-master-host0 journal: vdsm Executor
>> > WARN
>> > Worker blocked: <Worker name=jsonrpc/3 running <Task <JsonRpcTask
>> > {'params':
>> > {u'connectionParams': [{u'id': u'4ca8fc84-d872-4a7f-907f-9445bda7b6d1',
>> > u'connection': u'192.168.201.3:/exports/nfs/share1', u'iqn': u'',
>> > u'user':
>> > u'', u'tpgt': u'1', u'protocol_version': u'4.2', u'password':
>> > '********',
>> > u'port': u''}], u'storagepoolID':
>> > u'00000000-0000-0000-0000-000000000000',
>> > u'domainType': 1}, 'jsonrpc': '2.0', 'method':
>> > u'StoragePool.connectStorageServer', 'id':
>> > u'057da9c2-1e67-4c2f-9511-7d9de250386b'} at 0x2f44110> timeout=60,
>> > duration=60 at 0x2f44310> task#=9 at 0x2ac11d0>
>> > ...
>> >
>>
>> I see following sequence:
>>
>> The message is sent:
>>
>> 2017-04-19 18:55:58,020-04 DEBUG
>> [org.ovirt.vdsm.jsonrpc.client.reactors.stomp.StompCommonClient]
>> (org.ovirt.thread.pool-7-thread-37) [755b908a] Message sent: SEND
>> destination:jms.topic.vdsm_requests
>> content-length:381
>> ovirtCorrelationId:755b908a
>> reply-to:jms.topic.vdsm_responses
>>
>> <JsonRpcRequest id: "057da9c2-1e67-4c2f-9511-7d9de250386b", method:
>> StoragePool.connectStorageServer, params:
>> {storagepoolID=00000000-0000-0000-0000-000000000000, domainType=1,
>> connectionParams=[{password=, protocol_version=4.2, port=, iqn=,
>> connection=192.168.201.3:/exports/nfs/share1,
>> id=4ca8fc84-d872-4a7f-907f-9445bda7b6d1, user=, tpgt=1}]}>
>>
>> There is no response for specified amount of time and we timeout:
>>
>> 2017-04-19 18:58:58,465-04 DEBUG
>> [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker]
>> (ResponseWorker) [] Message received:
>>
>> {"jsonrpc":"2.0","error":{"code":"lago-basic-suite-master-host0:192912448","message":"Vds
>> timeout occured"},"id":null}
>>
>> As Yaniv pointed here is why we never get the response:
>>
>> Apr 19 18:58:58 lago-basic-suite-master-host0 journal: vdsm Executor
>> WARN Worker blocked: <Worker name=jsonrpc/3 running <Task <JsonRpcTask
>> {'params': {u'connectionParams': [{u'id':
>> u'4ca8fc84-d872-4a7f-907f-9445bda7b6d1', u'connection':
>> u'192.168.201.3:/exports/nfs/share1', u'iqn': u'', u'user': u'',
>> u'tpgt': u'1', u'protocol_version': u'4.2', u'password': '********',
>> u'port': u''}], u'storagepoolID':
>> u'00000000-0000-0000-0000-000000000000', u'domainType': 1}, 'jsonrpc':
>> '2.0', 'method': u'StoragePool.connectStorageServer', 'id':
>> u'057da9c2-1e67-4c2f-9511-7d9de250386b'} at 0x2f44110> timeout=60,
>> duration=180 at 0x2f44310> task#=9 at 0x2ac11d0>
>>
>> >
>> > 3. Also, there is still the infamous unable to update response issues.
>> >
>>
>> When we see timeout on a call our default behavior is to reconnect
>> when we clean pending messages.
>> As a result when we reconnect and receive a response from the message
>> sent before disconnect
>> we say it is unknown to us.
>
>
> But the example I've given was earlier than the storage issue?
The specific message that you refer to was a ping command but it
timeout (3 secs)
before it arrived and it was removed from tracking. When it finally
arrived it was not tracked anymore.
We may want to increase the timeout to give it more time for arrival.
> Y.
>
>>
>>
>> >
>> > {"jsonrpc":"2.0","method":"Host.ping","params":{},"id":"7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0"}�
>> > 2017-04-19 18:54:27,843-04 DEBUG
>> > [org.ovirt.vdsm.jsonrpc.client.reactors.stomp.StompCommonClient]
>> > (org.ovirt.thread.pool-7-thread-15) [62d198cc] Message sent: SEND
>> > destination:jms.topic.vdsm_requests
>> > content-length:94
>> > ovirtCorrelationId:62d198cc
>> > reply-to:jms.topic.vdsm_responses
>> >
>> > <JsonRpcRequest id: "7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0", method:
>> > Host.ping, params: {}>
>> > 2017-04-19 18:54:27,885-04 DEBUG
>> > [org.ovirt.vdsm.jsonrpc.client.reactors.stomp.impl.Message]
>> > (org.ovirt.thread.pool-7-thread-16) [1f9aac13] SEND
>> > ovirtCorrelationId:1f9aac13
>> > destination:jms.topic.vdsm_requests
>> > reply-to:jms.topic.vdsm_responses
>> > content-length:94
>> >
>> > ...
>> >
>> > {"jsonrpc": "2.0", "id": "7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0",
>> > "result":
>> > true}�
>> > 2017-04-19 18:54:32,132-04 DEBUG
>> > [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker)
>> > []
>> > Message received: {"jsonrpc": "2.0", "id":
>> > "7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0", "result": true}
>> > 2017-04-19 18:54:32,133-04 ERROR
>> > [org.ovirt.vdsm.jsonrpc.client.JsonRpcClient] (ResponseWorker) [] Not
>> > able
>> > to update response for "7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0"
>> >
>> >
>> > Would be nice to understand why.
>> >
>> >
>> >
>> > 4. Lastly, MOM is not running. Why?
>> >
>> >
>> > Please open a bug with the details from item #2 above.
>> > Y.
>> >
>> >
>> > [1]
>> >
>> > http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/6403/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-002_bootstrap.py/lago-basic-suite-master-host1/_var_log/vdsm/supervdsm.log
>> > [2]
>> >
>> > http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/6403/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-002_bootstrap.py/lago-basic-suite-master-engine/_var_log/ovirt-engine/engine.log
>> > [3]
>> >
>> > http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/6403/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-002_bootstrap.py/lago-basic-suite-master-host0/_var_log/messages
>> >
>> >
>> >
>> >
>> >
>> > On Thu, Apr 20, 2017 at 9:27 AM, Gil Shinar <gshinar at redhat.com> wrote:
>> >>
>> >> Test failed: add_secondary_storage_domains
>> >> Link to suspected patches:
>> >> Link to Job:
>> >> http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/6403
>> >> Link to all logs:
>> >>
>> >> http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/6403/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-002_bootstrap.py
>> >>
>> >>
>> >> Error seems to be:
>> >> 2017-04-19 18:58:58,774-0400 ERROR (jsonrpc/2)
>> >> [storage.TaskManager.Task]
>> >> (Task='8f9699ed-cc2f-434b-aa1e-b3c8ff30324a') Unexpected error
>> >> (task:871)
>> >> Traceback (most recent call last):
>> >> File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line
>> >> 878,
>> >> in _run
>> >> return fn(*args, **kargs)
>> >> File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line 52, in
>> >> wrapper
>> >> res = f(*args, **kwargs)
>> >> File "/usr/share/vdsm/storage/hsm.py", line 2709, in
>> >> getStorageDomainInfo
>> >> dom = self.validateSdUUID(sdUUID)
>> >> File "/usr/share/vdsm/storage/hsm.py", line 298, in validateSdUUID
>> >> sdDom = sdCache.produce(sdUUID=sdUUID)
>> >> File "/usr/share/vdsm/storage/sdc.py", line 112, in produce
>> >> domain.getRealDomain()
>> >> File "/usr/share/vdsm/storage/sdc.py", line 53, in getRealDomain
>> >> return self._cache._realProduce(self._sdUUID)
>> >> File "/usr/share/vdsm/storage/sdc.py", line 136, in _realProduce
>> >> domain = self._findDomain(sdUUID)
>> >> File "/usr/share/vdsm/storage/sdc.py", line 153, in _findDomain
>> >> return findMethod(sdUUID)
>> >> File "/usr/share/vdsm/storage/sdc.py", line 178, in
>> >> _findUnfetchedDomain
>> >> raise se.StorageDomainDoesNotExist(sdUUID)
>> >> StorageDomainDoesNotExist: Storage domain does not exist:
>> >> (u'ac3bbc93-26ba-4ea8-8e76-c5b761f01931',)
>> >> 2017-04-19 18:58:58,777-0400 INFO (jsonrpc/2)
>> >> [storage.TaskManager.Task]
>> >> (Task='8f9699ed-cc2f-434b-aa1e-b3c8ff30324a') aborting: Task is
>> >> aborted:
>> >> 'Storage domain does not exist' - code 358 (task:1176)
>> >> 2017-04-19 18:58:58,777-0400 ERROR (jsonrpc/2) [storage.Dispatcher]
>> >> {'status': {'message': "Storage domain does not exist:
>> >> (u'ac3bbc93-26ba-4ea8-8e76-c5b761f01931',)", 'code': 358}}
>> >> (dispatcher:78)
>> >>
>> >> _______________________________________________
>> >> Devel mailing list
>> >> Devel at ovirt.org
>> >> http://lists.ovirt.org/mailman/listinfo/devel
>> >
>> >
>> >
>> > _______________________________________________
>> > Devel mailing list
>> > Devel at ovirt.org
>> > http://lists.ovirt.org/mailman/listinfo/devel
>
>
More information about the Devel
mailing list