Re: [ovirt-devel] [ OST Failure Report ] [ oVirt master ] [ 20.4.17 ] [ add_secondary_storage_domains ]

20 Apr 2017

      On Thu, Apr 20, 2017 at 1:32 PM, Nir Soffer <nsoffer@redhat.com> wrote:
...
בתאריך יום ה׳, 20 באפר׳ 2017, 13:05, מאת Piotr Kliczewski ‏<
piotr.kliczewski@gmail.com>:
...
On Thu, Apr 20, 2017 at 11:49 AM, Yaniv Kaul <ykaul@redhat.com> wrote:
...
On Thu, Apr 20, 2017 at 11:55 AM, Piotr Kliczewski
<piotr.kliczewski@gmail.com> wrote:
...
On Thu, Apr 20, 2017 at 10:32 AM, Yaniv Kaul <ykaul@redhat.com> wrote:
...
No, that's not the issue.
I've seen it happening few times.
1. It always with the ISO domain (which we don't use anyway in o-s-t)
2. Apparently, only one host is asking for a mount:
 authenticated mount request from 192.168.201.4:713 for
/exports/nfs/iso
...
...
...
(/exports/nfs/iso)
(/var/log/messages of the NFS server)
And indeed, you can see in[1] that host1 made the request and all is
well on
it.
However, there are connection issues with host0 which cause a
timeout to
connectStorageServer():
From[2]:
2017-04-19 18:58:58,465-04 DEBUG
[org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker]
(ResponseWorker)
[]
Message received:
{"jsonrpc":"2.0","error":{"code":"lago-basic-suite-
master-host0:192912448","message":"Vds
timeout occured"},"id":null}
2017-04-19 18:58:58,475-04 ERROR
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.
AuditLogDirector]
(org.ovirt.thread.pool-7-thread-37) [755b908a] EVENT_ID:
VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call
Stack:
null,
Custom Event ID: -1, Message: VDSM lago-basic-suite-master-host0
command
ConnectStorageServerVDS failed: Message timeout which can be caused
by
communication issues
2017-04-19 18:58:58,475-04 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker.
ConnectStorageServerVDSCommand]
(org.ovirt.thread.pool-7-thread-37) [755b908a] Command
'org.ovirt.engine.core.vdsbroker.vdsbroker.
ConnectStorageServerVDSCommand'
return value '
ServerConnectionStatusReturn:{status='Status [code=5022,
message=Message
timeout which can be caused by communication issues]'}
I wonder why, but on /var/log/messages[3], I'm seeing:
Apr 19 18:56:58 lago-basic-suite-master-host0 journal: vdsm Executor
WARN
Worker blocked: <Worker name=jsonrpc/3 running <Task <JsonRpcTask
{'params':
{u'connectionParams': [{u'id': u'4ca8fc84-d872-4a7f-907f-
9445bda7b6d1',
u'connection': u'192.168.201.3:/exports/nfs/share1', u'iqn': u'',
u'user':
u'', u'tpgt': u'1', u'protocol_version': u'4.2', u'password':
'********',
u'port': u''}], u'storagepoolID':
u'00000000-0000-0000-0000-000000000000',
u'domainType': 1}, 'jsonrpc': '2.0', 'method':
u'StoragePool.connectStorageServer', 'id':
u'057da9c2-1e67-4c2f-9511-7d9de250386b'} at 0x2f44110> timeout=60,
duration=60 at 0x2f44310> task#=9 at 0x2ac11d0>
...
I see following sequence:
The message is sent:
2017-04-19 18:55:58,020-04 DEBUG
[org.ovirt.vdsm.jsonrpc.client.reactors.stomp.StompCommonClient]
(org.ovirt.thread.pool-7-thread-37) [755b908a] Message sent: SEND
destination:jms.topic.vdsm_requests
content-length:381
ovirtCorrelationId:755b908a
reply-to:jms.topic.vdsm_responses
<JsonRpcRequest id: "057da9c2-1e67-4c2f-9511-7d9de250386b", method:
StoragePool.connectStorageServer, params:
{storagepoolID=00000000-0000-0000-0000-000000000000, domainType=1,
connectionParams=[{password=, protocol_version=4.2, port=, iqn=,
connection=192.168.201.3:/exports/nfs/share1,
id=4ca8fc84-d872-4a7f-907f-9445bda7b6d1, user=, tpgt=1}]}>
There is no response for specified amount of time and we timeout:
2017-04-19 18:58:58,465-04 DEBUG
[org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker]
(ResponseWorker) [] Message received:
{"jsonrpc":"2.0","error":{"code":"lago-basic-suite-
master-host0:192912448","message":"Vds
timeout occured"},"id":null}
As Yaniv pointed here is why we never get the response:
Apr 19 18:58:58 lago-basic-suite-master-host0 journal: vdsm Executor
WARN Worker blocked: <Worker name=jsonrpc/3 running <Task <JsonRpcTask
{'params': {u'connectionParams': [{u'id':
u'4ca8fc84-d872-4a7f-907f-9445bda7b6d1', u'connection':
u'192.168.201.3:/exports/nfs/share1', u'iqn': u'', u'user': u'',
u'tpgt': u'1', u'protocol_version': u'4.2', u'password': '********',
u'port': u''}], u'storagepoolID':
u'00000000-0000-0000-0000-000000000000', u'domainType': 1}, 'jsonrpc':
'2.0', 'method': u'StoragePool.connectStorageServer', 'id':
u'057da9c2-1e67-4c2f-9511-7d9de250386b'} at 0x2f44110> timeout=60,
duration=180 at 0x2f44310> task#=9 at 0x2ac11d0>
This means the connection attempt was stuck for 180 seconds. Need to check
if the mount was stuck, or maybe there is some issue in supervdsm running
this.
This is a new warning introduced lately, before a stuck worker was hidden.
Do we check that the nfs server is up before we start the tests?
We do not check, but it is up - it's installed before anything else, and
happens to be on the Engine. It has several minutes to be up and ready.
Moreover, a command to the same NFS server, same params, only a different
mount point, succeeded, a second earlier:

18:55:56,300::supervdsm_server::92::SuperVdsm.ServerCallback::(wrapper)
call mount with (u'192.168.201.3:/exports/nfs/share2',
u'/rhev/data-center/mnt/192.168.201.3:_exports_nfs_share2')
{'vfstype': 'nfs', 'mntOpts':
'soft,nosharecache,timeo=600,retrans=6,nfsvers=4,minorversion=1',
'timeout': None, 'cgroup': None}

...

18:55:56,338::supervdsm_server::99::SuperVdsm.ServerCallback::(wrapper)
return mount with None

But:

18:55:58,076::supervdsm_server::92::SuperVdsm.ServerCallback::(wrapper)
call mount with (u'192.168.201.3:/exports/nfs/share1',
u'/rhev/data-center/mnt/192.168.201.3:_exports_nfs_share1')
{'vfstype': 'nfs', 'mntOpts':
'soft,nosharecache,timeo=600,retrans=6,nfsvers=4,minorversion=2',
'timeout': None, 'cgroup': None}

Got stuck.

See https://bugzilla.redhat.com/show_bug.cgi?id=1443913

Y.
...
...
...
...
...
3. Also, there is still the infamous unable to update response
issues.
...
...
...
When we see timeout on a call our default behavior is to reconnect
when we clean pending messages.
As a result when we reconnect and receive a response from the message
sent before disconnect
we say it is unknown to us.
But the example I've given was earlier than the storage issue?
The specific message that you refer to was a ping command but it
timeout (3 secs)
before it arrived and it was removed from tracking. When it finally
arrived it was not tracked anymore.
We may want to increase the timeout to give it more time for arrival.
...
Y.
...
...
{"jsonrpc":"2.0","method":"Host.ping","params":{},"id":"
...
...
2017-04-19 18:54:27,843-04 DEBUG
[org.ovirt.vdsm.jsonrpc.client.reactors.stomp.StompCommonClient]
(org.ovirt.thread.pool-7-thread-15) [62d198cc] Message sent: SEND
destination:jms.topic.vdsm_requests
content-length:94
ovirtCorrelationId:62d198cc
reply-to:jms.topic.vdsm_responses
<JsonRpcRequest id: "7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0", method:
Host.ping, params: {}>
2017-04-19 18:54:27,885-04 DEBUG
[org.ovirt.vdsm.jsonrpc.client.reactors.stomp.impl.Message]
(org.ovirt.thread.pool-7-thread-16) [1f9aac13] SEND
ovirtCorrelationId:1f9aac13
destination:jms.topic.vdsm_requests
reply-to:jms.topic.vdsm_responses
content-length:94
...
{"jsonrpc": "2.0", "id": "7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0",
"result":
true}�
2017-04-19 18:54:32,132-04 DEBUG
[org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker]
(ResponseWorker)
[]
Message received: {"jsonrpc": "2.0", "id":
"7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0", "result": true}
2017-04-19 18:54:32,133-04 ERROR
[org.ovirt.vdsm.jsonrpc.client.JsonRpcClient] (ResponseWorker) []
Not
able
to update response for "7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0"
Would be nice to understand why.
4. Lastly, MOM is not running. Why?
Please open a bug with the details from item #2 above.
Y.
[1]
http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_
master/6403/artifact/exported-artifacts/basic-suit-master-
el7/test_logs/basic-suite-master/post-002_bootstrap.py/
lago-basic-suite-master-host1/_var_log/vdsm/supervdsm.log
[2]
http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_
master/6403/artifact/exported-artifacts/basic-suit-master-
el7/test_logs/basic-suite-master/post-002_bootstrap.py/
lago-basic-suite-master-engine/_var_log/ovirt-engine/engine.log
[3]
http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_
master/6403/artifact/exported-artifacts/basic-suit-master-
el7/test_logs/basic-suite-master/post-002_bootstrap.py/
lago-basic-suite-master-host0/_var_log/messages
On Thu, Apr 20, 2017 at 9:27 AM, Gil Shinar <gshinar@redhat.com>
wrote:
...
Test failed: add_secondary_storage_domains
Link to suspected patches:
Link to Job:
http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_
master/6403
...
Link to all logs:
http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_
master/6403/artifact/exported-artifacts/basic-suit-master-
el7/test_logs/basic-suite-master/post-002_bootstrap.py
Error seems to be:
2017-04-19 18:58:58,774-0400 ERROR (jsonrpc/2)
[storage.TaskManager.Task]
(Task='8f9699ed-cc2f-434b-aa1e-b3c8ff30324a') Unexpected error
(task:871)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py",
7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0"}�
line
...
...
...
878,
in _run
    return fn(*args, **kargs)
  File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line
52, in
wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 2709, in
getStorageDomainInfo
    dom = self.validateSdUUID(sdUUID)
  File "/usr/share/vdsm/storage/hsm.py", line 298, in
validateSdUUID
    sdDom = sdCache.produce(sdUUID=sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 112, in produce
    domain.getRealDomain()
  File "/usr/share/vdsm/storage/sdc.py", line 53, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 136, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 153, in _findDomain
    return findMethod(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 178, in
_findUnfetchedDomain
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist:
(u'ac3bbc93-26ba-4ea8-8e76-c5b761f01931',)
2017-04-19 18:58:58,777-0400 INFO  (jsonrpc/2)
[storage.TaskManager.Task]
(Task='8f9699ed-cc2f-434b-aa1e-b3c8ff30324a') aborting: Task is
aborted:
'Storage domain does not exist' - code 358 (task:1176)
2017-04-19 18:58:58,777-0400 ERROR (jsonrpc/2) [storage.Dispatcher]
{'status': {'message': "Storage domain does not exist:
(u'ac3bbc93-26ba-4ea8-8e76-c5b761f01931',)", 'code': 358}}
(dispatcher:78)
_______________________________________________
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel
_______________________________________________
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel
_______________________________________________
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel