
This is a multi-part message in MIME format. --------------070802090208020205070907 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Hello, we tried the following test - with unwanted results input: 5 node gluster A = replica 3 with arbiter 1 ( node1+node2+arbiter on node 5 ) B = replica 3 with arbiter 1 ( node3+node4+arbiter on node 5 ) C = distributed replica 3 arbiter 1 ( node1+node2, node3+node4, each arbiter on node 5) node 5 has only arbiter replica ( 4x ) TEST: 1) directly reboot one node - OK ( is not important which ( data node or arbiter node )) 2) directly reboot two nodes - OK ( if nodes are not from the same replica ) 3) directly reboot three nodes - yes, this is the main problem and a questions .... - rebooted all three nodes from replica "B" ( not so possible, but who knows ... ) - all VMs with data on this replica was paused ( no data access ) - OK - all VMs running on replica "B" nodes lost ( started manually, later )( datas on other replicas ) - acceptable BUT - !!! all oVIrt domains went down !! - master domain is on replica "A" which lost only one member from three !!! so we are not expecting that all domain will go down, especially master with 2 live members. Results: - the whole cluster unreachable until at all domains up - depent of all nodes up !!! - all paused VMs started back - OK - rest of all VMs rebooted and runnig - OK Questions: 1) why all domains down if master domain ( on replica "A" ) has two runnig members ( 2 of 3 ) ?? 2) how to fix that colaps without waiting to all nodes up ? ( in worste case if node has HW error eg. ) ?? 3) which oVirt cluster policy can prevent that situation ?? ( if any ) regs. Pavel --------------070802090208020205070907 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 8bit <html> <head> <meta http-equiv="content-type" content="text/html; charset=utf-8"> </head> <body text="#000066" bgcolor="#FFFFFF"> Hello, <br> we tried the following test - with unwanted results<br> <br> input:<br> 5 node gluster<br> A = replica 3 with arbiter 1 ( node1+node2+arbiter on node 5 )<br> B = replica 3 with arbiter 1 ( node3+node4+arbiter on node 5 )<br> C = distributed replica 3 arbiter 1 ( node1+node2, node3+node4, each arbiter on node 5)<br> node 5 has only arbiter replica ( 4x )<br> <br> TEST:<br> 1) directly reboot one node - OK ( is not important which ( data node or arbiter node ))<br> 2) directly reboot two nodes - OK ( if nodes are not from the same replica ) <br> 3) directly reboot three nodes - yes, this is the main problem and a questions ....<br> - rebooted all three nodes from replica "B" ( not so possible, but who knows ... )<br> - all VMs with data on this replica was paused ( no data access ) - OK<br> - all VMs running on replica "B" nodes lost ( started manually, later )( datas on other replicas ) - acceptable<br> BUT<br> - !!! all oVIrt domains went down !! - master domain is on replica "A" which lost only one member from three !!!<br> so we are not expecting that all domain will go down, especially master with 2 live members.<br> <br> Results: <br> - the whole cluster unreachable until at all domains up - depent of all nodes up !!!<br> - all paused VMs started back - OK<br> - rest of all VMs rebooted and runnig - OK<br> <br> Questions:<br> 1) why all domains down if master domain ( on replica "A" ) has two runnig members ( 2 of 3 ) ??<br> 2) how to fix that colaps without waiting to all nodes up ? ( in worste case if node has HW error eg. ) ??<br> 3) which oVirt cluster policy can prevent that situation ?? ( if any )<br> <br> regs.<br> Pavel<br> <br> <br> </body> </html> --------------070802090208020205070907--