[ovirt-users] ovirt with glusterfs - big test - unwanted results

Sahina Bose sabose at redhat.com
Mon Apr 11 16:34:46 UTC 2016


So I looked at the vdsm logs and since there were multiple tests done it 
was difficult to isolate which error to track down. You mentioned test 
between 14:00-14:30  CET - but the gluster logs that were attached ended 
at 11.29 UTC

Tracking down the errors when the master domain (gluster volume 
1HP12-R3A1P1) went inactive for time period when corresponding gluster 
volume log was available - they all seem to correspond to an issue where 
gluster volume quorum was not met.

Can you confirm if this was for the test performed - or provide logs 
from correct time period (both vdsm and gluster mount logs are required 
- from hypervisors where the master domain is mounted)?

For master domain:
On 1hp1:
vdsm.log
Thread-35::ERROR::2016-03-31 
13:21:27,225::monitor::276::Storage.Monitor::(_monitorDomain) Err
or monitoring domain 14995860-1127-4dc4-b8c8-b540b89f9313
Traceback (most recent call last):
...
   File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 
454, in statvfs
     resdict = self._sendCommand("statvfs", {"path": path}, self.timeout)
   File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 
427, in _sendCommand
     raise OSError(errcode, errstr)
OSError: [Errno 107] Transport endpoint is not connected
Thread-35::INFO::2016-03-31 
13:21:27,267::monitor::299::Storage.Monitor::(_notifyStatusChanges) 
Domain 14995860-1127-4dc4-b8c8-b540b89f9313 became INVALID

-- And I see a corresponding:
[2016-03-31 11:21:16.027090] W [MSGID: 108001] 
[afr-common.c:4093:afr_notify] 0-1HP12-R3A1P1-r
eplicate-0: Client-quorum is not met

jsonrpc.Executor/0::DEBUG::2016-03-31 
13:23:34,110::__init__::533::jsonrpc.JsonRpcServer::(_serveRequest) 
Return 'GlusterVolume.status' in bridge with {'volumeStatus': {'bricks': 
[{'status': 'OFFLINE', 'hostuuid': 
'f6568a3b-3d65-4f4f-be9f-14a5935e37a4', 'pid': '-1', 'rdma_port': 'N/A', 
'brick': '1hp1:/STORAGES/P1/GFS', 'port': 'N/A'}, {'status': 'OFFLINE', 
'hostuuid': '8e87cf18-8958-41b7-8d24-7ee420a1ef9f', 'pid': '-1', 
'rdma_port': 'N/A', 'brick': '1hp2:/STORAGES/P1/GFS', 'port': 'N/A'}], 
'nfs': [{'status': 'OFFLINE', 'hostuuid': 
'f6568a3b-3d65-4f4f-be9f-14a5935e37a4', 'hostname': '172.16.5.151/24', 
'pid': '-1', 'rdma_port': 'N/A', 'port': 'N/A'}, {'status': 'OFFLINE', 
'hostuuid': '8e87cf18-8958-41b7-8d24-7ee420a1ef9f', 'hostname': '1hp2', 
'pid': '-1', 'rdma_port': 'N/A', 'port': 'N/A'}], 'shd': [{'status': 
'ONLINE', 'hostname': '172.16.5.151/24', 'pid': '2148', 'hostuuid': 
'f6568a3b-3d65-4f4f-be9f-14a5935e37a4'}, {'status': 'ONLINE', 
'hostname': '1hp2', 'pid': '2146', 'hostuuid': 
'8e87cf18-8958-41b7-8d24-7ee420a1ef9f'}], 'name': '1HP12-R3A1P1'}}

-- 2 bricks were offline. I think the arbiter brick is not reported in 
the xml output - this is a bug.

Similarly on 1hp2:
Thread-35::ERROR::2016-03-31 
13:21:14,284::monitor::276::Storage.Monitor::(_monitorDomain) Err
or monitoring domain 14995860-1127-4dc4-b8c8-b540b89f9313
Traceback (most recent call last):
   ...
     raise OSError(errcode, errstr)
OSError: [Errno 2] No such file or directory
Thread-35::INFO::2016-03-31 
13:21:14,285::monitor::299::Storage.Monitor::(_notifyStatusChanges) 
Domain 14995860-1127-4dc4-b8c8-b540b89f9313 became INVALID

Corresponding gluster mount log -
[2016-03-31 11:21:16.027640] W [MSGID: 108001] 
[afr-common.c:4093:afr_notify] 0-1HP12-R3A1P1-r
eplicate-0: Client-quorum is not met

On 04/05/2016 07:02 PM, paf1 at email.cz wrote:
> Hello Sahina,
> look attached logs which U requested
>
> regs.
> Pavel
>
> On 5.4.2016 14:07, Sahina Bose wrote:
>>
>>
>> On 03/31/2016 06:41 PM, paf1 at email.cz wrote:
>>> Hi,
>>> rest of logs:
>>> www.uschovna.cz/en/zasilka/HYGXR57CNHM3TP39-L3W 
>>> <http://www.uschovna.cz/en/zasilka/HYGXR57CNHM3TP39-L3W>
>>>
>>> The TEST is the last big event in logs ....
>>> TEST TIME : about 14:00-14:30  CET
>>
>> Thank you Pavel for the interesting test report and sharing the logs.
>>
>> You are right - the master domain should not go down if 2 of 3 bricks 
>> are available from volume A (1HP12-R3A1P1).
>>
>> I notice that host kvmarbiter was not responsive at 2016-03-31 
>> 13:27:19 , but the ConnectStorageServerVDSCommand executed on 
>> kvmarbiter node returned success at 2016-03-31 13:27:26
>>
>> Could you also share the vdsm logs from 1hp1, 1hp2 and kvmarbiter 
>> nodes during this time ?
>>
>> Ravi, Krutika - could you take a look at the gluster logs?
>>
>>>
>>> regs.Pavel
>>>
>>> On 31.3.2016 14:30, Yaniv Kaul wrote:
>>>> Hi Pavel,
>>>>
>>>> Thanks for the report. Can you begin with a more accurate 
>>>> description of your environment?
>>>> Begin with host, oVirt and Gluster versions. Then continue with the 
>>>> exact setup (what are 'A', 'B', 'C' - domains? Volumes? What is the 
>>>> mapping between domains and volumes?).
>>>>
>>>> Are there any logs you can share with us?
>>>>
>>>> I'm sure with more information, we'd be happy to look at the issue.
>>>> Y.
>>>>
>>>>
>>>> On Thu, Mar 31, 2016 at 3:09 PM, paf1 at email.cz <paf1 at email.cz 
>>>> <mailto:paf1 at email.cz>> wrote:
>>>>
>>>>     Hello,
>>>>     we tried the  following test - with unwanted results
>>>>
>>>>     input:
>>>>     5 node gluster
>>>>     A = replica 3 with arbiter 1 ( node1+node2+arbiter on node 5 )
>>>>     B = replica 3 with arbiter 1 ( node3+node4+arbiter on node 5 )
>>>>     C = distributed replica 3 arbiter 1  ( node1+node2,
>>>>     node3+node4, each arbiter on node 5)
>>>>     node 5 has only arbiter replica ( 4x )
>>>>
>>>>     TEST:
>>>>     1)  directly reboot one node - OK ( is not important which (
>>>>     data node or arbiter node ))
>>>>     2)  directly reboot two nodes - OK ( if  nodes are not from the
>>>>     same replica )
>>>>     3)  directly reboot three nodes - yes, this is the main problem
>>>>     and a questions ....
>>>>         - rebooted all three nodes from replica "B"  ( not so
>>>>     possible, but who knows ... )
>>>>         - all VMs with data on this replica was paused ( no data
>>>>     access ) - OK
>>>>         - all VMs running on replica "B" nodes lost ( started
>>>>     manually, later )( datas on other replicas ) - acceptable
>>>>     BUT
>>>>         - !!! all oVIrt domains went down !! - master domain is on
>>>>     replica "A" which lost only one member from three !!!
>>>>         so we are not expecting that all domain will go down,
>>>>     especially master with 2 live members.
>>>>
>>>>     Results:
>>>>         - the whole cluster unreachable until at all domains up -
>>>>     depent of all nodes up !!!
>>>>         - all paused VMs started back - OK
>>>>         - rest of all VMs rebooted and runnig - OK
>>>>
>>>>     Questions:
>>>>         1) why all domains down if master domain ( on replica "A" )
>>>>     has two runnig members ( 2 of 3 )  ??
>>>>         2) how to fix that colaps without waiting to all nodes up ?
>>>>     ( in worste case if node has HW error eg. ) ??
>>>>         3) which oVirt  cluster  policy  can prevent that situation
>>>>     ?? ( if any )
>>>>
>>>>     regs.
>>>>     Pavel
>>>>
>>>>
>>>>
>>>>     _______________________________________________
>>>>     Users mailing list
>>>>     Users at ovirt.org <mailto:Users at ovirt.org>
>>>>     http://lists.ovirt.org/mailman/listinfo/users
>>>>
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Users mailing list
>>> Users at ovirt.org
>>> http://lists.ovirt.org/mailman/listinfo/users
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20160411/d5a07461/attachment-0001.html>


More information about the Users mailing list