בתאריך יום ה׳, 20 באפר׳ 2017, 13:51, מאת Yaniv Kaul ‏<ykaul@redhat.com>:
On Thu, Apr 20, 2017 at 1:32 PM, Nir Soffer <nsoffer@redhat.com> wrote:


בתאריך יום ה׳, 20 באפר׳ 2017, 13:05, מאת Piotr Kliczewski ‏<piotr.kliczewski@gmail.com>:
On Thu, Apr 20, 2017 at 11:49 AM, Yaniv Kaul <ykaul@redhat.com> wrote:
>
>
> On Thu, Apr 20, 2017 at 11:55 AM, Piotr Kliczewski
> <piotr.kliczewski@gmail.com> wrote:
>>
>> On Thu, Apr 20, 2017 at 10:32 AM, Yaniv Kaul <ykaul@redhat.com> wrote:
>> > No, that's not the issue.
>> > I've seen it happening few times.
>> >
>> > 1. It always with the ISO domain (which we don't use anyway in o-s-t)
>> > 2. Apparently, only one host is asking for a mount:
>> >  authenticated mount request from 192.168.201.4:713 for /exports/nfs/iso
>> > (/exports/nfs/iso)
>> >
>> > (/var/log/messages of the NFS server)
>> >
>> > And indeed, you can see in[1] that host1 made the request and all is
>> > well on
>> > it.
>> >
>> > However, there are connection issues with host0 which cause a timeout to
>> > connectStorageServer():
>> > From[2]:
>> >
>> > 2017-04-19 18:58:58,465-04 DEBUG
>> > [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker)
>> > []
>> > Message received:
>> >
>> > {"jsonrpc":"2.0","error":{"code":"lago-basic-suite-master-host0:192912448","message":"Vds
>> > timeout occured"},"id":null}
>> >
>> > 2017-04-19 18:58:58,475-04 ERROR
>> > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>> > (org.ovirt.thread.pool-7-thread-37) [755b908a] EVENT_ID:
>> > VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack:
>> > null,
>> > Custom Event ID: -1, Message: VDSM lago-basic-suite-master-host0 command
>> > ConnectStorageServerVDS failed: Message timeout which can be caused by
>> > communication issues
>> > 2017-04-19 18:58:58,475-04 INFO
>> >
>> > [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand]
>> > (org.ovirt.thread.pool-7-thread-37) [755b908a] Command
>> >
>> > 'org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand'
>> > return value '
>> > ServerConnectionStatusReturn:{status='Status [code=5022, message=Message
>> > timeout which can be caused by communication issues]'}
>> >
>> >
>> > I wonder why, but on /var/log/messages[3], I'm seeing:
>> > Apr 19 18:56:58 lago-basic-suite-master-host0 journal: vdsm Executor
>> > WARN
>> > Worker blocked: <Worker name=jsonrpc/3 running <Task <JsonRpcTask
>> > {'params':
>> > {u'connectionParams': [{u'id': u'4ca8fc84-d872-4a7f-907f-9445bda7b6d1',
>> > u'connection': u'192.168.201.3:/exports/nfs/share1', u'iqn': u'',
>> > u'user':
>> > u'', u'tpgt': u'1', u'protocol_version': u'4.2', u'password':
>> > '********',
>> > u'port': u''}], u'storagepoolID':
>> > u'00000000-0000-0000-0000-000000000000',
>> > u'domainType': 1}, 'jsonrpc': '2.0', 'method':
>> > u'StoragePool.connectStorageServer', 'id':
>> > u'057da9c2-1e67-4c2f-9511-7d9de250386b'} at 0x2f44110> timeout=60,
>> > duration=60 at 0x2f44310> task#=9 at 0x2ac11d0>
>> > ...
>> >
>>
>> I see following sequence:
>>
>> The message is sent:
>>
>> 2017-04-19 18:55:58,020-04 DEBUG
>> [org.ovirt.vdsm.jsonrpc.client.reactors.stomp.StompCommonClient]
>> (org.ovirt.thread.pool-7-thread-37) [755b908a] Message sent: SEND
>> destination:jms.topic.vdsm_requests
>> content-length:381
>> ovirtCorrelationId:755b908a
>> reply-to:jms.topic.vdsm_responses
>>
>> <JsonRpcRequest id: "057da9c2-1e67-4c2f-9511-7d9de250386b", method:
>> StoragePool.connectStorageServer, params:
>> {storagepoolID=00000000-0000-0000-0000-000000000000, domainType=1,
>> connectionParams=[{password=, protocol_version=4.2, port=, iqn=,
>> connection=192.168.201.3:/exports/nfs/share1,
>> id=4ca8fc84-d872-4a7f-907f-9445bda7b6d1, user=, tpgt=1}]}>
>>
>> There is no response for specified amount of time and we timeout:
>>
>> 2017-04-19 18:58:58,465-04 DEBUG
>> [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker]
>> (ResponseWorker) [] Message received:
>>
>> {"jsonrpc":"2.0","error":{"code":"lago-basic-suite-master-host0:192912448","message":"Vds
>> timeout occured"},"id":null}
>>
>> As Yaniv pointed here is why we never get the response:
>>
>> Apr 19 18:58:58 lago-basic-suite-master-host0 journal: vdsm Executor
>> WARN Worker blocked: <Worker name=jsonrpc/3 running <Task <JsonRpcTask
>> {'params': {u'connectionParams': [{u'id':
>> u'4ca8fc84-d872-4a7f-907f-9445bda7b6d1', u'connection':
>> u'192.168.201.3:/exports/nfs/share1', u'iqn': u'', u'user': u'',
>> u'tpgt': u'1', u'protocol_version': u'4.2', u'password': '********',
>> u'port': u''}], u'storagepoolID':
>> u'00000000-0000-0000-0000-000000000000', u'domainType': 1}, 'jsonrpc':
>> '2.0', 'method': u'StoragePool.connectStorageServer', 'id':
>> u'057da9c2-1e67-4c2f-9511-7d9de250386b'} at 0x2f44110> timeout=60,
>> duration=180 at 0x2f44310> task#=9 at 0x2ac11d0>

This means the connection attempt was stuck for 180 seconds. Need to check if the mount was stuck, or maybe there is some issue in supervdsm running this.

This is a new warning introduced lately, before a stuck worker was hidden.

Do we check that the nfs server is up before we start the tests?

We do not check, but it is up - it's installed before anything else, and happens to be on the Engine. It has several minutes to be up and ready.
Moreover, a command to the same NFS server, same params, only a different mount point, succeeded, a second earlier:
18:55:56,300::supervdsm_server::92::SuperVdsm.ServerCallback::(wrapper) call mount with (u'192.168.201.3:/exports/nfs/share2', u'/rhev/data-center/mnt/192.168.201.3:_exports_nfs_share2') {'vfstype': 'nfs', 'mntOpts': 'soft,nosharecache,timeo=600,retrans=6,nfsvers=4,minorversion=1', 'timeout': None, 'cgroup': None}
...
18:55:56,338::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) return mount with None

But:
18:55:58,076::supervdsm_server::92::SuperVdsm.ServerCallback::(wrapper) call mount with (u'192.168.201.3:/exports/nfs/share1', u'/rhev/data-center/mnt/192.168.201.3:_exports_nfs_share1') {'vfstype': 'nfs', 'mntOpts': 'soft,nosharecache,timeo=600,retrans=6,nfsvers=4,minorversion=2', 'timeout': None, 'cgroup': None}

Got stuck.

The stuck mount is using nfs 4.2, the successful one is 4.1.
Y.




>>
>> >
>> > 3. Also, there is still the infamous unable to update response issues.
>> >
>>
>> When we see timeout on a call our default behavior is to reconnect
>> when we clean pending messages.
>> As a result when we reconnect and receive a response from the message
>> sent before disconnect
>> we say it is unknown to us.
>
>
> But the example I've given was earlier than the storage issue?

The specific message that you refer to was a ping command but it
timeout (3 secs)
before it arrived and it was removed from tracking. When it finally
arrived it was not tracked anymore.

We may want to increase the timeout to give it more time for arrival.

> Y.
>
>>
>>
>> >
>> > {"jsonrpc":"2.0","method":"Host.ping","params":{},"id":"7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0"}�
>> > 2017-04-19 18:54:27,843-04 DEBUG
>> > [org.ovirt.vdsm.jsonrpc.client.reactors.stomp.StompCommonClient]
>> > (org.ovirt.thread.pool-7-thread-15) [62d198cc] Message sent: SEND
>> > destination:jms.topic.vdsm_requests
>> > content-length:94
>> > ovirtCorrelationId:62d198cc
>> > reply-to:jms.topic.vdsm_responses
>> >
>> > <JsonRpcRequest id: "7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0", method:
>> > Host.ping, params: {}>
>> > 2017-04-19 18:54:27,885-04 DEBUG
>> > [org.ovirt.vdsm.jsonrpc.client.reactors.stomp.impl.Message]
>> > (org.ovirt.thread.pool-7-thread-16) [1f9aac13] SEND
>> > ovirtCorrelationId:1f9aac13
>> > destination:jms.topic.vdsm_requests
>> > reply-to:jms.topic.vdsm_responses
>> > content-length:94
>> >
>> > ...
>> >
>> > {"jsonrpc": "2.0", "id": "7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0",
>> > "result":
>> > true}�
>> > 2017-04-19 18:54:32,132-04 DEBUG
>> > [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker)
>> > []
>> > Message received: {"jsonrpc": "2.0", "id":
>> > "7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0", "result": true}
>> > 2017-04-19 18:54:32,133-04 ERROR
>> > [org.ovirt.vdsm.jsonrpc.client.JsonRpcClient] (ResponseWorker) [] Not
>> > able
>> > to update response for "7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0"
>> >
>> >
>> > Would be nice to understand why.
>> >
>> >
>> >
>> > 4. Lastly, MOM is not running. Why?
>> >
>> >
>> > Please open a bug with the details from item #2 above.
>> > Y.
>> >
>> >
>> > [1]
>> >
>> > http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/6403/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-002_bootstrap.py/lago-basic-suite-master-host1/_var_log/vdsm/supervdsm.log
>> > [2]
>> >
>> > http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/6403/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-002_bootstrap.py/lago-basic-suite-master-engine/_var_log/ovirt-engine/engine.log
>> > [3]
>> >
>> > http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/6403/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-002_bootstrap.py/lago-basic-suite-master-host0/_var_log/messages
>> >
>> >
>> >
>> >
>> >
>> > On Thu, Apr 20, 2017 at 9:27 AM, Gil Shinar <gshinar@redhat.com> wrote:
>> >>
>> >> Test failed: add_secondary_storage_domains
>> >> Link to suspected patches:
>> >> Link to Job:
>> >> http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/6403
>> >> Link to all logs:
>> >>
>> >> http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/6403/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-002_bootstrap.py
>> >>
>> >>
>> >> Error seems to be:
>> >> 2017-04-19 18:58:58,774-0400 ERROR (jsonrpc/2)
>> >> [storage.TaskManager.Task]
>> >> (Task='8f9699ed-cc2f-434b-aa1e-b3c8ff30324a') Unexpected error
>> >> (task:871)
>> >> Traceback (most recent call last):
>> >>   File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line
>> >> 878,
>> >> in _run
>> >>     return fn(*args, **kargs)
>> >>   File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line 52, in
>> >> wrapper
>> >>     res = f(*args, **kwargs)
>> >>   File "/usr/share/vdsm/storage/hsm.py", line 2709, in
>> >> getStorageDomainInfo
>> >>     dom = self.validateSdUUID(sdUUID)
>> >>   File "/usr/share/vdsm/storage/hsm.py", line 298, in validateSdUUID
>> >>     sdDom = sdCache.produce(sdUUID=sdUUID)
>> >>   File "/usr/share/vdsm/storage/sdc.py", line 112, in produce
>> >>     domain.getRealDomain()
>> >>   File "/usr/share/vdsm/storage/sdc.py", line 53, in getRealDomain
>> >>     return self._cache._realProduce(self._sdUUID)
>> >>   File "/usr/share/vdsm/storage/sdc.py", line 136, in _realProduce
>> >>     domain = self._findDomain(sdUUID)
>> >>   File "/usr/share/vdsm/storage/sdc.py", line 153, in _findDomain
>> >>     return findMethod(sdUUID)
>> >>   File "/usr/share/vdsm/storage/sdc.py", line 178, in
>> >> _findUnfetchedDomain
>> >>     raise se.StorageDomainDoesNotExist(sdUUID)
>> >> StorageDomainDoesNotExist: Storage domain does not exist:
>> >> (u'ac3bbc93-26ba-4ea8-8e76-c5b761f01931',)
>> >> 2017-04-19 18:58:58,777-0400 INFO  (jsonrpc/2)
>> >> [storage.TaskManager.Task]
>> >> (Task='8f9699ed-cc2f-434b-aa1e-b3c8ff30324a') aborting: Task is
>> >> aborted:
>> >> 'Storage domain does not exist' - code 358 (task:1176)
>> >> 2017-04-19 18:58:58,777-0400 ERROR (jsonrpc/2) [storage.Dispatcher]
>> >> {'status': {'message': "Storage domain does not exist:
>> >> (u'ac3bbc93-26ba-4ea8-8e76-c5b761f01931',)", 'code': 358}}
>> >> (dispatcher:78)
>> >>
>> >> _______________________________________________
>> >> Devel mailing list
>> >> Devel@ovirt.org
>> >> http://lists.ovirt.org/mailman/listinfo/devel
>> >
>> >
>> >
>> > _______________________________________________
>> > Devel mailing list
>> > Devel@ovirt.org
>> > http://lists.ovirt.org/mailman/listinfo/devel
>
>
_______________________________________________
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel