<br><br><div class="gmail_quote"><div dir="ltr">בתאריך יום ה׳, 20 באפר׳ 2017, 13:05, מאת Piotr Kliczewski <<a href="mailto:piotr.kliczewski@gmail.com">piotr.kliczewski@gmail.com</a>>:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Thu, Apr 20, 2017 at 11:49 AM, Yaniv Kaul <<a href="mailto:ykaul@redhat.com" target="_blank">ykaul@redhat.com</a>> wrote:<br>
><br>
><br>
> On Thu, Apr 20, 2017 at 11:55 AM, Piotr Kliczewski<br>
> <<a href="mailto:piotr.kliczewski@gmail.com" target="_blank">piotr.kliczewski@gmail.com</a>> wrote:<br>
>><br>
>> On Thu, Apr 20, 2017 at 10:32 AM, Yaniv Kaul <<a href="mailto:ykaul@redhat.com" target="_blank">ykaul@redhat.com</a>> wrote:<br>
>> > No, that's not the issue.<br>
>> > I've seen it happening few times.<br>
>> ><br>
>> > 1. It always with the ISO domain (which we don't use anyway in o-s-t)<br>
>> > 2. Apparently, only one host is asking for a mount:<br>
>> > authenticated mount request from <a href="http://192.168.201.4:713" rel="noreferrer" target="_blank">192.168.201.4:713</a> for /exports/nfs/iso<br>
>> > (/exports/nfs/iso)<br>
>> ><br>
>> > (/var/log/messages of the NFS server)<br>
>> ><br>
>> > And indeed, you can see in[1] that host1 made the request and all is<br>
>> > well on<br>
>> > it.<br>
>> ><br>
>> > However, there are connection issues with host0 which cause a timeout to<br>
>> > connectStorageServer():<br>
>> > From[2]:<br>
>> ><br>
>> > 2017-04-19 18:58:58,465-04 DEBUG<br>
>> > [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker)<br>
>> > []<br>
>> > Message received:<br>
>> ><br>
>> > {"jsonrpc":"2.0","error":{"code":"lago-basic-suite-master-host0:192912448","message":"Vds<br>
>> > timeout occured"},"id":null}<br>
>> ><br>
>> > 2017-04-19 18:58:58,475-04 ERROR<br>
>> > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]<br>
>> > (org.ovirt.thread.pool-7-thread-37) [755b908a] EVENT_ID:<br>
>> > VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack:<br>
>> > null,<br>
>> > Custom Event ID: -1, Message: VDSM lago-basic-suite-master-host0 command<br>
>> > ConnectStorageServerVDS failed: Message timeout which can be caused by<br>
>> > communication issues<br>
>> > 2017-04-19 18:58:58,475-04 INFO<br>
>> ><br>
>> > [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand]<br>
>> > (org.ovirt.thread.pool-7-thread-37) [755b908a] Command<br>
>> ><br>
>> > 'org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand'<br>
>> > return value '<br>
>> > ServerConnectionStatusReturn:{status='Status [code=5022, message=Message<br>
>> > timeout which can be caused by communication issues]'}<br>
>> ><br>
>> ><br>
>> > I wonder why, but on /var/log/messages[3], I'm seeing:<br>
>> > Apr 19 18:56:58 lago-basic-suite-master-host0 journal: vdsm Executor<br>
>> > WARN<br>
>> > Worker blocked: <Worker name=jsonrpc/3 running <Task <JsonRpcTask<br>
>> > {'params':<br>
>> > {u'connectionParams': [{u'id': u'4ca8fc84-d872-4a7f-907f-9445bda7b6d1',<br>
>> > u'connection': u'192.168.201.3:/exports/nfs/share1', u'iqn': u'',<br>
>> > u'user':<br>
>> > u'', u'tpgt': u'1', u'protocol_version': u'4.2', u'password':<br>
>> > '********',<br>
>> > u'port': u''}], u'storagepoolID':<br>
>> > u'00000000-0000-0000-0000-000000000000',<br>
>> > u'domainType': 1}, 'jsonrpc': '2.0', 'method':<br>
>> > u'StoragePool.connectStorageServer', 'id':<br>
>> > u'057da9c2-1e67-4c2f-9511-7d9de250386b'} at 0x2f44110> timeout=60,<br>
>> > duration=60 at 0x2f44310> task#=9 at 0x2ac11d0><br>
>> > ...<br>
>> ><br>
>><br>
>> I see following sequence:<br>
>><br>
>> The message is sent:<br>
>><br>
>> 2017-04-19 18:55:58,020-04 DEBUG<br>
>> [org.ovirt.vdsm.jsonrpc.client.reactors.stomp.StompCommonClient]<br>
>> (org.ovirt.thread.pool-7-thread-37) [755b908a] Message sent: SEND<br>
>> destination:jms.topic.vdsm_requests<br>
>> content-length:381<br>
>> ovirtCorrelationId:755b908a<br>
>> reply-to:jms.topic.vdsm_responses<br>
>><br>
>> <JsonRpcRequest id: "057da9c2-1e67-4c2f-9511-7d9de250386b", method:<br>
>> StoragePool.connectStorageServer, params:<br>
>> {storagepoolID=00000000-0000-0000-0000-000000000000, domainType=1,<br>
>> connectionParams=[{password=, protocol_version=4.2, port=, iqn=,<br>
>> connection=192.168.201.3:/exports/nfs/share1,<br>
>> id=4ca8fc84-d872-4a7f-907f-9445bda7b6d1, user=, tpgt=1}]}><br>
>><br>
>> There is no response for specified amount of time and we timeout:<br>
>><br>
>> 2017-04-19 18:58:58,465-04 DEBUG<br>
>> [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker]<br>
>> (ResponseWorker) [] Message received:<br>
>><br>
>> {"jsonrpc":"2.0","error":{"code":"lago-basic-suite-master-host0:192912448","message":"Vds<br>
>> timeout occured"},"id":null}<br>
>><br>
>> As Yaniv pointed here is why we never get the response:<br>
>><br>
>> Apr 19 18:58:58 lago-basic-suite-master-host0 journal: vdsm Executor<br>
>> WARN Worker blocked: <Worker name=jsonrpc/3 running <Task <JsonRpcTask<br>
>> {'params': {u'connectionParams': [{u'id':<br>
>> u'4ca8fc84-d872-4a7f-907f-9445bda7b6d1', u'connection':<br>
>> u'192.168.201.3:/exports/nfs/share1', u'iqn': u'', u'user': u'',<br>
>> u'tpgt': u'1', u'protocol_version': u'4.2', u'password': '********',<br>
>> u'port': u''}], u'storagepoolID':<br>
>> u'00000000-0000-0000-0000-000000000000', u'domainType': 1}, 'jsonrpc':<br>
>> '2.0', 'method': u'StoragePool.connectStorageServer', 'id':<br>
>> u'057da9c2-1e67-4c2f-9511-7d9de250386b'} at 0x2f44110> timeout=60,<br>
>> duration=180 at 0x2f44310> task#=9 at 0x2ac11d0><br></blockquote></div><div><br></div><div>This means the connection attempt was stuck for 180 seconds. Need to check if the mount was stuck, or maybe there is some issue in supervdsm running this.</div><div><br></div><div>This is a new warning introduced lately, before a stuck worker was hidden.</div><div><br></div><div>Do we check that the nfs server is up before we start the tests?</div><div><br></div><div><br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
>><br>
>> ><br>
>> > 3. Also, there is still the infamous unable to update response issues.<br>
>> ><br>
>><br>
>> When we see timeout on a call our default behavior is to reconnect<br>
>> when we clean pending messages.<br>
>> As a result when we reconnect and receive a response from the message<br>
>> sent before disconnect<br>
>> we say it is unknown to us.<br>
><br>
><br>
> But the example I've given was earlier than the storage issue?<br>
<br>
The specific message that you refer to was a ping command but it<br>
timeout (3 secs)<br>
before it arrived and it was removed from tracking. When it finally<br>
arrived it was not tracked anymore.<br>
<br>
We may want to increase the timeout to give it more time for arrival.<br>
<br>
> Y.<br>
><br>
>><br>
>><br>
>> ><br>
>> > {"jsonrpc":"2.0","method":"Host.ping","params":{},"id":"7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0"}�<br>
>> > 2017-04-19 18:54:27,843-04 DEBUG<br>
>> > [org.ovirt.vdsm.jsonrpc.client.reactors.stomp.StompCommonClient]<br>
>> > (org.ovirt.thread.pool-7-thread-15) [62d198cc] Message sent: SEND<br>
>> > destination:jms.topic.vdsm_requests<br>
>> > content-length:94<br>
>> > ovirtCorrelationId:62d198cc<br>
>> > reply-to:jms.topic.vdsm_responses<br>
>> ><br>
>> > <JsonRpcRequest id: "7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0", method:<br>
>> > Host.ping, params: {}><br>
>> > 2017-04-19 18:54:27,885-04 DEBUG<br>
>> > [org.ovirt.vdsm.jsonrpc.client.reactors.stomp.impl.Message]<br>
>> > (org.ovirt.thread.pool-7-thread-16) [1f9aac13] SEND<br>
>> > ovirtCorrelationId:1f9aac13<br>
>> > destination:jms.topic.vdsm_requests<br>
>> > reply-to:jms.topic.vdsm_responses<br>
>> > content-length:94<br>
>> ><br>
>> > ...<br>
>> ><br>
>> > {"jsonrpc": "2.0", "id": "7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0",<br>
>> > "result":<br>
>> > true}�<br>
>> > 2017-04-19 18:54:32,132-04 DEBUG<br>
>> > [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker)<br>
>> > []<br>
>> > Message received: {"jsonrpc": "2.0", "id":<br>
>> > "7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0", "result": true}<br>
>> > 2017-04-19 18:54:32,133-04 ERROR<br>
>> > [org.ovirt.vdsm.jsonrpc.client.JsonRpcClient] (ResponseWorker) [] Not<br>
>> > able<br>
>> > to update response for "7cb6052f-c732-4f7c-bd2d-e48c2ae1f5e0"<br>
>> ><br>
>> ><br>
>> > Would be nice to understand why.<br>
>> ><br>
>> ><br>
>> ><br>
>> > 4. Lastly, MOM is not running. Why?<br>
>> ><br>
>> ><br>
>> > Please open a bug with the details from item #2 above.<br>
>> > Y.<br>
>> ><br>
>> ><br>
>> > [1]<br>
>> ><br>
>> > <a href="http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/6403/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-002_bootstrap.py/lago-basic-suite-master-host1/_var_log/vdsm/supervdsm.log" rel="noreferrer" target="_blank">http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/6403/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-002_bootstrap.py/lago-basic-suite-master-host1/_var_log/vdsm/supervdsm.log</a><br>
>> > [2]<br>
>> ><br>
>> > <a href="http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/6403/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-002_bootstrap.py/lago-basic-suite-master-engine/_var_log/ovirt-engine/engine.log" rel="noreferrer" target="_blank">http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/6403/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-002_bootstrap.py/lago-basic-suite-master-engine/_var_log/ovirt-engine/engine.log</a><br>
>> > [3]<br>
>> ><br>
>> > <a href="http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/6403/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-002_bootstrap.py/lago-basic-suite-master-host0/_var_log/messages" rel="noreferrer" target="_blank">http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/6403/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-002_bootstrap.py/lago-basic-suite-master-host0/_var_log/messages</a><br>
>> ><br>
>> ><br>
>> ><br>
>> ><br>
>> ><br>
>> > On Thu, Apr 20, 2017 at 9:27 AM, Gil Shinar <<a href="mailto:gshinar@redhat.com" target="_blank">gshinar@redhat.com</a>> wrote:<br>
>> >><br>
>> >> Test failed: add_secondary_storage_domains<br>
>> >> Link to suspected patches:<br>
>> >> Link to Job:<br>
>> >> <a href="http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/6403" rel="noreferrer" target="_blank">http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/6403</a><br>
>> >> Link to all logs:<br>
>> >><br>
>> >> <a href="http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/6403/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-002_bootstrap.py" rel="noreferrer" target="_blank">http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/6403/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-002_bootstrap.py</a><br>
>> >><br>
>> >><br>
>> >> Error seems to be:<br>
>> >> 2017-04-19 18:58:58,774-0400 ERROR (jsonrpc/2)<br>
>> >> [storage.TaskManager.Task]<br>
>> >> (Task='8f9699ed-cc2f-434b-aa1e-b3c8ff30324a') Unexpected error<br>
>> >> (task:871)<br>
>> >> Traceback (most recent call last):<br>
>> >> File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line<br>
>> >> 878,<br>
>> >> in _run<br>
>> >> return fn(*args, **kargs)<br>
>> >> File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line 52, in<br>
>> >> wrapper<br>
>> >> res = f(*args, **kwargs)<br>
>> >> File "/usr/share/vdsm/storage/hsm.py", line 2709, in<br>
>> >> getStorageDomainInfo<br>
>> >> dom = self.validateSdUUID(sdUUID)<br>
>> >> File "/usr/share/vdsm/storage/hsm.py", line 298, in validateSdUUID<br>
>> >> sdDom = sdCache.produce(sdUUID=sdUUID)<br>
>> >> File "/usr/share/vdsm/storage/sdc.py", line 112, in produce<br>
>> >> domain.getRealDomain()<br>
>> >> File "/usr/share/vdsm/storage/sdc.py", line 53, in getRealDomain<br>
>> >> return self._cache._realProduce(self._sdUUID)<br>
>> >> File "/usr/share/vdsm/storage/sdc.py", line 136, in _realProduce<br>
>> >> domain = self._findDomain(sdUUID)<br>
>> >> File "/usr/share/vdsm/storage/sdc.py", line 153, in _findDomain<br>
>> >> return findMethod(sdUUID)<br>
>> >> File "/usr/share/vdsm/storage/sdc.py", line 178, in<br>
>> >> _findUnfetchedDomain<br>
>> >> raise se.StorageDomainDoesNotExist(sdUUID)<br>
>> >> StorageDomainDoesNotExist: Storage domain does not exist:<br>
>> >> (u'ac3bbc93-26ba-4ea8-8e76-c5b761f01931',)<br>
>> >> 2017-04-19 18:58:58,777-0400 INFO (jsonrpc/2)<br>
>> >> [storage.TaskManager.Task]<br>
>> >> (Task='8f9699ed-cc2f-434b-aa1e-b3c8ff30324a') aborting: Task is<br>
>> >> aborted:<br>
>> >> 'Storage domain does not exist' - code 358 (task:1176)<br>
>> >> 2017-04-19 18:58:58,777-0400 ERROR (jsonrpc/2) [storage.Dispatcher]<br>
>> >> {'status': {'message': "Storage domain does not exist:<br>
>> >> (u'ac3bbc93-26ba-4ea8-8e76-c5b761f01931',)", 'code': 358}}<br>
>> >> (dispatcher:78)<br>
>> >><br>
>> >> _______________________________________________<br>
>> >> Devel mailing list<br>
>> >> <a href="mailto:Devel@ovirt.org" target="_blank">Devel@ovirt.org</a><br>
>> >> <a href="http://lists.ovirt.org/mailman/listinfo/devel" rel="noreferrer" target="_blank">http://lists.ovirt.org/mailman/listinfo/devel</a><br>
>> ><br>
>> ><br>
>> ><br>
>> > _______________________________________________<br>
>> > Devel mailing list<br>
>> > <a href="mailto:Devel@ovirt.org" target="_blank">Devel@ovirt.org</a><br>
>> > <a href="http://lists.ovirt.org/mailman/listinfo/devel" rel="noreferrer" target="_blank">http://lists.ovirt.org/mailman/listinfo/devel</a><br>
><br>
><br>
_______________________________________________<br>
Devel mailing list<br>
<a href="mailto:Devel@ovirt.org" target="_blank">Devel@ovirt.org</a><br>
<a href="http://lists.ovirt.org/mailman/listinfo/devel" rel="noreferrer" target="_blank">http://lists.ovirt.org/mailman/listinfo/devel</a></blockquote></div>