
On Thu, Apr 20, 2017 at 1:32 PM, Nir Soffer <nsoffer@redhat.com> wrote:
בתאריך יום ה׳, 20 באפר׳ 2017, 13:05, מאת Piotr Kliczewski < piotr.kliczewski@gmail.com>:
On Thu, Apr 20, 2017 at 11:49 AM, Yaniv Kaul <ykaul@redhat.com> wrote:
On Thu, Apr 20, 2017 at 11:55 AM, Piotr Kliczewski <piotr.kliczewski@gmail.com> wrote:
On Thu, Apr 20, 2017 at 10:32 AM, Yaniv Kaul <ykaul@redhat.com> wrote:
No, that's not the issue. I've seen it happening few times.
1. It always with the ISO domain (which we don't use anyway in o-s-t) 2. Apparently, only one host is asking for a mount: authenticated mount request from 192.168.201.4:713 for
/exports/nfs/iso
(/exports/nfs/iso)
(/var/log/messages of the NFS server)
And indeed, you can see in[1] that host1 made the request and all is well on it.
However, there are connection issues with host0 which cause a timeout to connectStorageServer(): From[2]:
2017-04-19 18:58:58,465-04 DEBUG [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) [] Message received:
{"jsonrpc":"2.0","error":{"code":"lago-basic-suite- master-host0:192912448","message":"Vds timeout occured"},"id":null}
2017-04-19 18:58:58,475-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling. AuditLogDirector] (org.ovirt.thread.pool-7-thread-37) [755b908a] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM lago-basic-suite-master-host0 command ConnectStorageServerVDS failed: Message timeout which can be caused by communication issues 2017-04-19 18:58:58,475-04 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker. ConnectStorageServerVDSCommand] (org.ovirt.thread.pool-7-thread-37) [755b908a] Command
'org.ovirt.engine.core.vdsbroker.vdsbroker. ConnectStorageServerVDSCommand' return value ' ServerConnectionStatusReturn:{status='Status [code=5022, message=Message timeout which can be caused by communication issues]'}
I wonder why, but on /var/log/messages[3], I'm seeing: Apr 19 18:56:58 lago-basic-suite-master-host0 journal: vdsm Executor WARN Worker blocked: <Worker name=jsonrpc/3 running <Task <JsonRpcTask {'params': {u'connectionParams': [{u'id': u'4ca8fc84-d872-4a7f-907f- 9445bda7b6d1', u'connection': u'192.168.201.3:/exports/nfs/share1', u'iqn': u'', u'user': u'', u'tpgt': u'1', u'protocol_version': u'4.2', u'password': '********', u'port': u''}], u'storagepoolID': u'00000000-0000-0000-0000-000000000000', u'domainType': 1}, 'jsonrpc': '2.0', 'method': u'StoragePool.connectStorageServer', 'id': u'057da9c2-1e67-4c2f-9511-7d9de250386b'} at 0x2f44110> timeout=60, duration=60 at 0x2f44310> task#=9 at 0x2ac11d0> ...
I see following sequence:
The message is sent:
2017-04-19 18:55:58,020-04 DEBUG [org.ovirt.vdsm.jsonrpc.client.reactors.stomp.StompCommonClient] (org.ovirt.thread.pool-7-thread-37) [755b908a] Message sent: SEND destination:jms.topic.vdsm_requests content-length:381 ovirtCorrelationId:755b908a reply-to:jms.topic.vdsm_responses
<JsonRpcRequest id: "057da9c2-1e67-4c2f-9511-7d9de250386b", method: StoragePool.connectStorageServer, params: {storagepoolID=00000000-0000-0000-0000-000000000000, domainType=1, connectionParams=[{password=, protocol_version=4.2, port=, iqn=, connection=192.168.201.3:/exports/nfs/share1, id=4ca8fc84-d872-4a7f-907f-9445bda7b6d1, user=, tpgt=1}]}>
There is no response for specified amount of time and we timeout:
2017-04-19 18:58:58,465-04 DEBUG [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) [] Message received:
{"jsonrpc":"2.0","error":{"code":"lago-basic-suite- master-host0:192912448","message":"Vds timeout occured"},"id":null}
As Yaniv pointed here is why we never get the response:
Apr 19 18:58:58 lago-basic-suite-master-host0 journal: vdsm Executor WARN Worker blocked: <Worker name=jsonrpc/3 running <Task <JsonRpcTask {'params': {u'connectionParams': [{u'id': u'4ca8fc84-d872-4a7f-907f-9445bda7b6d1', u'connection': u'192.168.201.3:/exports/nfs/share1', u'iqn': u'', u'user': u'', u'tpgt': u'1', u'protocol_version': u'4.2', u'password': '********', u'port': u''}], u'storagepoolID': u'00000000-0000-0000-0000-000000000000', u'domainType': 1}, 'jsonrpc': '2.0', 'method': u'StoragePool.connectStorageServer', 'id': u'057da9c2-1e67-4c2f-9511-7d9de250386b'} at 0x2f44110> timeout=60, duration=180 at 0x2f44310> task#=9 at 0x2ac11d0>
This means the connection attempt was stuck for 180 seconds. Need to check if the mount was stuck, or maybe there is some issue in supervdsm running this.
This is a new warning introduced lately, before a stuck worker was hidden.
Do we check that the nfs server is up before we start the tests?
We do not check, but it is up - it's installed before anything else, and happens to be on the Engine. It has several minutes to be up and ready. Moreover, a command to the same NFS server, same params, only a different mount point, succeeded, a second earlier: 18:55:56,300::supervdsm_server::92::SuperVdsm.ServerCallback::(wrapper) call mount with (u'192.168.201.3:/exports/nfs/share2', u'/rhev/data-center/mnt/192.168.201.3:_exports_nfs_share2') {'vfstype': 'nfs', 'mntOpts': 'soft,nosharecache,timeo=600,retrans=6,nfsvers=4,minorversion=1', 'timeout': None, 'cgroup': None} ... 18:55:56,338::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper) return mount with None But: 18:55:58,076::supervdsm_server::92::SuperVdsm.ServerCallback::(wrapper) call mount with (u'192.168.201.3:/exports/nfs/share1', u'/rhev/data-center/mnt/192.168.201.3:_exports_nfs_share1') {'vfstype': 'nfs', 'mntOpts': 'soft,nosharecache,timeo=600,retrans=6,nfsvers=4,minorversion=2', 'timeout': None, 'cgroup': None} Got stuck. See https://bugzilla.redhat.com/show_bug.cgi?id=1443913 Y.
3. Also, there is still the infamous unable to update response
issues.
When we see timeout on a call our default behavior is to reconnect when we clean pending messages. As a result when we reconnect and receive a response from the message sent before disconnect we say it is unknown to us.
But the example I've given was earlier than the storage issue?
The specific message that you refer to was a ping command but it timeout (3 secs) before it arrived and it was removed from tracking. When it finally arrived it was not tracked anymore.
We may want to increase the timeout to give it more time for arrival.
Y.
{"jsonrpc":"2.0","method":"Host.ping","params":{},"id":"
2017-04-19 18:54:27,843-04 DEBUG [org.ovirt.vdsm.jsonrpc.client.reactors.stomp.StompCommonClient] (org.ovirt.thread.pool-7-thread-15) [62d198cc] Message sent: SEND destination:jms.topic.vdsm_requests content-length:94 ovirtCorrelationId:62d198cc reply-to:jms.topic.vdsm_responses
<JsonRpcRequest id: "7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0", method: Host.ping, params: {}> 2017-04-19 18:54:27,885-04 DEBUG [org.ovirt.vdsm.jsonrpc.client.reactors.stomp.impl.Message] (org.ovirt.thread.pool-7-thread-16) [1f9aac13] SEND ovirtCorrelationId:1f9aac13 destination:jms.topic.vdsm_requests reply-to:jms.topic.vdsm_responses content-length:94
...
{"jsonrpc": "2.0", "id": "7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0", "result": true}� 2017-04-19 18:54:32,132-04 DEBUG [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) [] Message received: {"jsonrpc": "2.0", "id": "7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0", "result": true} 2017-04-19 18:54:32,133-04 ERROR [org.ovirt.vdsm.jsonrpc.client.JsonRpcClient] (ResponseWorker) [] Not able to update response for "7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0"
Would be nice to understand why.
4. Lastly, MOM is not running. Why?
Please open a bug with the details from item #2 above. Y.
[1]
http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_ master/6403/artifact/exported-artifacts/basic-suit-master- el7/test_logs/basic-suite-master/post-002_bootstrap.py/ lago-basic-suite-master-host1/_var_log/vdsm/supervdsm.log [2]
http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_ master/6403/artifact/exported-artifacts/basic-suit-master- el7/test_logs/basic-suite-master/post-002_bootstrap.py/ lago-basic-suite-master-engine/_var_log/ovirt-engine/engine.log [3]
http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_ master/6403/artifact/exported-artifacts/basic-suit-master- el7/test_logs/basic-suite-master/post-002_bootstrap.py/ lago-basic-suite-master-host0/_var_log/messages
On Thu, Apr 20, 2017 at 9:27 AM, Gil Shinar <gshinar@redhat.com> wrote:
Test failed: add_secondary_storage_domains Link to suspected patches: Link to Job: http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_
master/6403
Link to all logs:
http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_ master/6403/artifact/exported-artifacts/basic-suit-master- el7/test_logs/basic-suite-master/post-002_bootstrap.py
Error seems to be: 2017-04-19 18:58:58,774-0400 ERROR (jsonrpc/2) [storage.TaskManager.Task] (Task='8f9699ed-cc2f-434b-aa1e-b3c8ff30324a') Unexpected error (task:871) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py",
7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0"}� line
878, in _run return fn(*args, **kargs) File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line 52, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 2709, in getStorageDomainInfo dom = self.validateSdUUID(sdUUID) File "/usr/share/vdsm/storage/hsm.py", line 298, in validateSdUUID sdDom = sdCache.produce(sdUUID=sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 112, in produce domain.getRealDomain() File "/usr/share/vdsm/storage/sdc.py", line 53, in getRealDomain return self._cache._realProduce(self._sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 136, in _realProduce domain = self._findDomain(sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 153, in _findDomain return findMethod(sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 178, in _findUnfetchedDomain raise se.StorageDomainDoesNotExist(sdUUID) StorageDomainDoesNotExist: Storage domain does not exist: (u'ac3bbc93-26ba-4ea8-8e76-c5b761f01931',) 2017-04-19 18:58:58,777-0400 INFO (jsonrpc/2) [storage.TaskManager.Task] (Task='8f9699ed-cc2f-434b-aa1e-b3c8ff30324a') aborting: Task is aborted: 'Storage domain does not exist' - code 358 (task:1176) 2017-04-19 18:58:58,777-0400 ERROR (jsonrpc/2) [storage.Dispatcher] {'status': {'message': "Storage domain does not exist: (u'ac3bbc93-26ba-4ea8-8e76-c5b761f01931',)", 'code': 358}} (dispatcher:78)
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel