This is a multi-part message in MIME format.
--------------070802090208020205070907
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Hello,
we tried the following test - with unwanted results
input:
5 node gluster
A = replica 3 with arbiter 1 ( node1+node2+arbiter on node 5 )
B = replica 3 with arbiter 1 ( node3+node4+arbiter on node 5 )
C = distributed replica 3 arbiter 1 ( node1+node2, node3+node4, each
arbiter on node 5)
node 5 has only arbiter replica ( 4x )
TEST:
1) directly reboot one node - OK ( is not important which ( data node
or arbiter node ))
2) directly reboot two nodes - OK ( if nodes are not from the same
replica )
3) directly reboot three nodes - yes, this is the main problem and a
questions ....
- rebooted all three nodes from replica "B" ( not so possible, but
who knows ... )
- all VMs with data on this replica was paused ( no data access ) - OK
- all VMs running on replica "B" nodes lost ( started manually,
later )( datas on other replicas ) - acceptable
BUT
- !!! all oVIrt domains went down !! - master domain is on replica
"A" which lost only one member from three !!!
so we are not expecting that all domain will go down, especially
master with 2 live members.
Results:
- the whole cluster unreachable until at all domains up - depent of
all nodes up !!!
- all paused VMs started back - OK
- rest of all VMs rebooted and runnig - OK
Questions:
1) why all domains down if master domain ( on replica "A" ) has two
runnig members ( 2 of 3 ) ??
2) how to fix that colaps without waiting to all nodes up ? ( in
worste case if node has HW error eg. ) ??
3) which oVirt cluster policy can prevent that situation ?? ( if
any )
regs.
Pavel
--------------070802090208020205070907
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 8bit
<html>
<head>
<meta http-equiv="content-type" content="text/html;
charset=utf-8">
</head>
<body text="#000066" bgcolor="#FFFFFF">
Hello, <br>
we tried the following test - with unwanted results<br>
<br>
input:<br>
5 node gluster<br>
A = replica 3 with arbiter 1 ( node1+node2+arbiter on node 5 )<br>
B = replica 3 with arbiter 1 ( node3+node4+arbiter on node 5 )<br>
C = distributed replica 3 arbiter 1 ( node1+node2, node3+node4,
each arbiter on node 5)<br>
node 5 has only arbiter replica ( 4x )<br>
<br>
TEST:<br>
1) directly reboot one node - OK ( is not important which ( data
node or arbiter node ))<br>
2) directly reboot two nodes - OK ( if nodes are not from the same
replica ) <br>
3) directly reboot three nodes - yes, this is the main problem and
a questions ....<br>
- rebooted all three nodes from replica "B" ( not so possible,
but who knows ... )<br>
- all VMs with data on this replica was paused ( no data access
) - OK<br>
- all VMs running on replica "B" nodes lost ( started manually,
later )( datas on other replicas ) - acceptable<br>
BUT<br>
- !!! all oVIrt domains went down !! - master domain is on
replica "A" which lost only one member from three !!!<br>
so we are not expecting that all domain will go down, especially
master with 2 live members.<br>
<br>
Results: <br>
- the whole cluster unreachable until at all domains up - depent
of all nodes up !!!<br>
- all paused VMs started back - OK<br>
- rest of all VMs rebooted and runnig - OK<br>
<br>
Questions:<br>
1) why all domains down if master domain ( on replica "A" ) has
two runnig members ( 2 of 3 ) ??<br>
2) how to fix that colaps without waiting to all nodes up ? ( in
worste case if node has HW error eg. ) ??<br>
3) which oVirt cluster policy can prevent that situation ?? (
if any )<br>
<br>
regs.<br>
Pavel<br>
<br>
<br>
</body>
</html>
--------------070802090208020205070907--