[ovirt-users] ovirt with glusterfs - big test - unwanted results

Tue Apr 5 12:07:57 UTC 2016

On 03/31/2016 06:41 PM, paf1 at email.cz wrote:
> Hi,
> rest of logs:
> www.uschovna.cz/en/zasilka/HYGXR57CNHM3TP39-L3W 
> <http://www.uschovna.cz/en/zasilka/HYGXR57CNHM3TP39-L3W>
>
> The TEST is the last big event in logs ....
> TEST TIME : about 14:00-14:30  CET

Thank you Pavel for the interesting test report and sharing the logs.

You are right - the master domain should not go down if 2 of 3 bricks 
are available from volume A (1HP12-R3A1P1).

I notice that host kvmarbiter was not responsive at 2016-03-31 13:27:19 
, but the ConnectStorageServerVDSCommand executed on kvmarbiter node 
returned success at 2016-03-31 13:27:26

Could you also share the vdsm logs from 1hp1, 1hp2 and kvmarbiter nodes 
during this time ?

Ravi, Krutika - could you take a look at the gluster logs?

>
> regs.Pavel
>
> On 31.3.2016 14:30, Yaniv Kaul wrote:
>> Hi Pavel,
>>
>> Thanks for the report. Can you begin with a more accurate description 
>> of your environment?
>> Begin with host, oVirt and Gluster versions. Then continue with the 
>> exact setup (what are 'A', 'B', 'C' - domains? Volumes? What is the 
>> mapping between domains and volumes?).
>>
>> Are there any logs you can share with us?
>>
>> I'm sure with more information, we'd be happy to look at the issue.
>> Y.
>>
>>
>> On Thu, Mar 31, 2016 at 3:09 PM, paf1 at email.cz <mailto:paf1 at email.cz> 
>> <paf1 at email.cz <mailto:paf1 at email.cz>> wrote:
>>
>>     Hello,
>>     we tried the  following test - with unwanted results
>>
>>     input:
>>     5 node gluster
>>     A = replica 3 with arbiter 1 ( node1+node2+arbiter on node 5 )
>>     B = replica 3 with arbiter 1 ( node3+node4+arbiter on node 5 )
>>     C = distributed replica 3 arbiter 1  ( node1+node2, node3+node4,
>>     each arbiter on node 5)
>>     node 5 has only arbiter replica ( 4x )
>>
>>     TEST:
>>     1)  directly reboot one node - OK ( is not important which ( data
>>     node or arbiter node ))
>>     2)  directly reboot two nodes - OK ( if  nodes are not from the
>>     same replica )
>>     3)  directly reboot three nodes - yes, this is the main problem
>>     and a questions ....
>>         - rebooted all three nodes from replica "B"  ( not so
>>     possible, but who knows ... )
>>         - all VMs with data on this replica was paused ( no data
>>     access ) - OK
>>         - all VMs running on replica "B" nodes lost ( started
>>     manually, later )( datas on other replicas ) - acceptable
>>     BUT
>>         - !!! all oVIrt domains went down !! - master domain is on
>>     replica "A" which lost only one member from three !!!
>>         so we are not expecting that all domain will go down,
>>     especially master with 2 live members.
>>
>>     Results:
>>         - the whole cluster unreachable until at all domains up -
>>     depent of all nodes up !!!
>>         - all paused VMs started back - OK
>>         - rest of all VMs rebooted and runnig - OK
>>
>>     Questions:
>>         1) why all domains down if master domain ( on replica "A" )
>>     has two runnig members ( 2 of 3 )  ??
>>         2) how to fix that colaps without waiting to all nodes up ? (
>>     in worste case if node has HW error eg. ) ??
>>         3) which oVirt  cluster  policy  can prevent that situation
>>     ?? ( if any )
>>
>>     regs.
>>     Pavel
>>
>>
>>
>>     _______________________________________________
>>     Users mailing list
>>     Users at ovirt.org <mailto:Users at ovirt.org>
>>     http://lists.ovirt.org/mailman/listinfo/users
>>
>>
>
>
>
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20160405/936111d4/attachment-0001.html>