[ovirt-users] ovirt with glusterfs - big test - unwanted results
Sahina Bose
sabose at redhat.com
Mon Apr 11 16:34:46 UTC 2016
So I looked at the vdsm logs and since there were multiple tests done it
was difficult to isolate which error to track down. You mentioned test
between 14:00-14:30 CET - but the gluster logs that were attached ended
at 11.29 UTC
Tracking down the errors when the master domain (gluster volume
1HP12-R3A1P1) went inactive for time period when corresponding gluster
volume log was available - they all seem to correspond to an issue where
gluster volume quorum was not met.
Can you confirm if this was for the test performed - or provide logs
from correct time period (both vdsm and gluster mount logs are required
- from hypervisors where the master domain is mounted)?
For master domain:
On 1hp1:
vdsm.log
Thread-35::ERROR::2016-03-31
13:21:27,225::monitor::276::Storage.Monitor::(_monitorDomain) Err
or monitoring domain 14995860-1127-4dc4-b8c8-b540b89f9313
Traceback (most recent call last):
...
File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line
454, in statvfs
resdict = self._sendCommand("statvfs", {"path": path}, self.timeout)
File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line
427, in _sendCommand
raise OSError(errcode, errstr)
OSError: [Errno 107] Transport endpoint is not connected
Thread-35::INFO::2016-03-31
13:21:27,267::monitor::299::Storage.Monitor::(_notifyStatusChanges)
Domain 14995860-1127-4dc4-b8c8-b540b89f9313 became INVALID
-- And I see a corresponding:
[2016-03-31 11:21:16.027090] W [MSGID: 108001]
[afr-common.c:4093:afr_notify] 0-1HP12-R3A1P1-r
eplicate-0: Client-quorum is not met
jsonrpc.Executor/0::DEBUG::2016-03-31
13:23:34,110::__init__::533::jsonrpc.JsonRpcServer::(_serveRequest)
Return 'GlusterVolume.status' in bridge with {'volumeStatus': {'bricks':
[{'status': 'OFFLINE', 'hostuuid':
'f6568a3b-3d65-4f4f-be9f-14a5935e37a4', 'pid': '-1', 'rdma_port': 'N/A',
'brick': '1hp1:/STORAGES/P1/GFS', 'port': 'N/A'}, {'status': 'OFFLINE',
'hostuuid': '8e87cf18-8958-41b7-8d24-7ee420a1ef9f', 'pid': '-1',
'rdma_port': 'N/A', 'brick': '1hp2:/STORAGES/P1/GFS', 'port': 'N/A'}],
'nfs': [{'status': 'OFFLINE', 'hostuuid':
'f6568a3b-3d65-4f4f-be9f-14a5935e37a4', 'hostname': '172.16.5.151/24',
'pid': '-1', 'rdma_port': 'N/A', 'port': 'N/A'}, {'status': 'OFFLINE',
'hostuuid': '8e87cf18-8958-41b7-8d24-7ee420a1ef9f', 'hostname': '1hp2',
'pid': '-1', 'rdma_port': 'N/A', 'port': 'N/A'}], 'shd': [{'status':
'ONLINE', 'hostname': '172.16.5.151/24', 'pid': '2148', 'hostuuid':
'f6568a3b-3d65-4f4f-be9f-14a5935e37a4'}, {'status': 'ONLINE',
'hostname': '1hp2', 'pid': '2146', 'hostuuid':
'8e87cf18-8958-41b7-8d24-7ee420a1ef9f'}], 'name': '1HP12-R3A1P1'}}
-- 2 bricks were offline. I think the arbiter brick is not reported in
the xml output - this is a bug.
Similarly on 1hp2:
Thread-35::ERROR::2016-03-31
13:21:14,284::monitor::276::Storage.Monitor::(_monitorDomain) Err
or monitoring domain 14995860-1127-4dc4-b8c8-b540b89f9313
Traceback (most recent call last):
...
raise OSError(errcode, errstr)
OSError: [Errno 2] No such file or directory
Thread-35::INFO::2016-03-31
13:21:14,285::monitor::299::Storage.Monitor::(_notifyStatusChanges)
Domain 14995860-1127-4dc4-b8c8-b540b89f9313 became INVALID
Corresponding gluster mount log -
[2016-03-31 11:21:16.027640] W [MSGID: 108001]
[afr-common.c:4093:afr_notify] 0-1HP12-R3A1P1-r
eplicate-0: Client-quorum is not met
On 04/05/2016 07:02 PM, paf1 at email.cz wrote:
> Hello Sahina,
> look attached logs which U requested
>
> regs.
> Pavel
>
> On 5.4.2016 14:07, Sahina Bose wrote:
>>
>>
>> On 03/31/2016 06:41 PM, paf1 at email.cz wrote:
>>> Hi,
>>> rest of logs:
>>> www.uschovna.cz/en/zasilka/HYGXR57CNHM3TP39-L3W
>>> <http://www.uschovna.cz/en/zasilka/HYGXR57CNHM3TP39-L3W>
>>>
>>> The TEST is the last big event in logs ....
>>> TEST TIME : about 14:00-14:30 CET
>>
>> Thank you Pavel for the interesting test report and sharing the logs.
>>
>> You are right - the master domain should not go down if 2 of 3 bricks
>> are available from volume A (1HP12-R3A1P1).
>>
>> I notice that host kvmarbiter was not responsive at 2016-03-31
>> 13:27:19 , but the ConnectStorageServerVDSCommand executed on
>> kvmarbiter node returned success at 2016-03-31 13:27:26
>>
>> Could you also share the vdsm logs from 1hp1, 1hp2 and kvmarbiter
>> nodes during this time ?
>>
>> Ravi, Krutika - could you take a look at the gluster logs?
>>
>>>
>>> regs.Pavel
>>>
>>> On 31.3.2016 14:30, Yaniv Kaul wrote:
>>>> Hi Pavel,
>>>>
>>>> Thanks for the report. Can you begin with a more accurate
>>>> description of your environment?
>>>> Begin with host, oVirt and Gluster versions. Then continue with the
>>>> exact setup (what are 'A', 'B', 'C' - domains? Volumes? What is the
>>>> mapping between domains and volumes?).
>>>>
>>>> Are there any logs you can share with us?
>>>>
>>>> I'm sure with more information, we'd be happy to look at the issue.
>>>> Y.
>>>>
>>>>
>>>> On Thu, Mar 31, 2016 at 3:09 PM, paf1 at email.cz <paf1 at email.cz
>>>> <mailto:paf1 at email.cz>> wrote:
>>>>
>>>> Hello,
>>>> we tried the following test - with unwanted results
>>>>
>>>> input:
>>>> 5 node gluster
>>>> A = replica 3 with arbiter 1 ( node1+node2+arbiter on node 5 )
>>>> B = replica 3 with arbiter 1 ( node3+node4+arbiter on node 5 )
>>>> C = distributed replica 3 arbiter 1 ( node1+node2,
>>>> node3+node4, each arbiter on node 5)
>>>> node 5 has only arbiter replica ( 4x )
>>>>
>>>> TEST:
>>>> 1) directly reboot one node - OK ( is not important which (
>>>> data node or arbiter node ))
>>>> 2) directly reboot two nodes - OK ( if nodes are not from the
>>>> same replica )
>>>> 3) directly reboot three nodes - yes, this is the main problem
>>>> and a questions ....
>>>> - rebooted all three nodes from replica "B" ( not so
>>>> possible, but who knows ... )
>>>> - all VMs with data on this replica was paused ( no data
>>>> access ) - OK
>>>> - all VMs running on replica "B" nodes lost ( started
>>>> manually, later )( datas on other replicas ) - acceptable
>>>> BUT
>>>> - !!! all oVIrt domains went down !! - master domain is on
>>>> replica "A" which lost only one member from three !!!
>>>> so we are not expecting that all domain will go down,
>>>> especially master with 2 live members.
>>>>
>>>> Results:
>>>> - the whole cluster unreachable until at all domains up -
>>>> depent of all nodes up !!!
>>>> - all paused VMs started back - OK
>>>> - rest of all VMs rebooted and runnig - OK
>>>>
>>>> Questions:
>>>> 1) why all domains down if master domain ( on replica "A" )
>>>> has two runnig members ( 2 of 3 ) ??
>>>> 2) how to fix that colaps without waiting to all nodes up ?
>>>> ( in worste case if node has HW error eg. ) ??
>>>> 3) which oVirt cluster policy can prevent that situation
>>>> ?? ( if any )
>>>>
>>>> regs.
>>>> Pavel
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Users mailing list
>>>> Users at ovirt.org <mailto:Users at ovirt.org>
>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Users mailing list
>>> Users at ovirt.org
>>> http://lists.ovirt.org/mailman/listinfo/users
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20160411/d5a07461/attachment-0001.html>
More information about the Users
mailing list