From paf1 at email.cz Thu Mar 31 08:09:11 2016 Content-Type: multipart/mixed; boundary="===============9176115236222937382==" MIME-Version: 1.0 From: paf1 at email.cz To: users at ovirt.org Subject: [ovirt-users] ovirt with glusterfs - big test - unwanted results Date: Thu, 31 Mar 2016 14:09:05 +0200 Message-ID: <56FD1361.3010805@email.cz> --===============9176115236222937382== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable This is a multi-part message in MIME format. --------------070802090208020205070907 Content-Type: text/plain; charset=3Dutf-8; format=3Dflowed Content-Transfer-Encoding: 7bit Hello, we tried the following test - with unwanted results input: 5 node gluster A =3D replica 3 with arbiter 1 ( node1+node2+arbiter on node 5 ) B =3D replica 3 with arbiter 1 ( node3+node4+arbiter on node 5 ) C =3D distributed replica 3 arbiter 1 ( node1+node2, node3+node4, each = arbiter on node 5) node 5 has only arbiter replica ( 4x ) TEST: 1) directly reboot one node - OK ( is not important which ( data node = or arbiter node )) 2) directly reboot two nodes - OK ( if nodes are not from the same = replica ) 3) directly reboot three nodes - yes, this is the main problem and a = questions .... - rebooted all three nodes from replica "B" ( not so possible, but = who knows ... ) - all VMs with data on this replica was paused ( no data access ) - OK - all VMs running on replica "B" nodes lost ( started manually, = later )( datas on other replicas ) - acceptable BUT - !!! all oVIrt domains went down !! - master domain is on replica = "A" which lost only one member from three !!! so we are not expecting that all domain will go down, especially = master with 2 live members. Results: - the whole cluster unreachable until at all domains up - depent of = all nodes up !!! - all paused VMs started back - OK - rest of all VMs rebooted and runnig - OK Questions: 1) why all domains down if master domain ( on replica "A" ) has two = runnig members ( 2 of 3 ) ?? 2) how to fix that colaps without waiting to all nodes up ? ( in = worste case if node has HW error eg. ) ?? 3) which oVirt cluster policy can prevent that situation ?? ( if = any ) regs. Pavel --------------070802090208020205070907 Content-Type: text/html; charset=3Dutf-8 Content-Transfer-Encoding: 8bit Hello,
we tried the=C2=A0 following test - with unwanted results

input:
5 node gluster
A =3D replica 3 with arbiter 1 ( node1+node2+arbiter on node 5 )
B =3D replica 3 with arbiter 1 ( node3+node4+arbiter on node 5 )
C =3D distributed replica 3 arbiter 1=C2=A0 ( node1+node2, node3+node4, each arbiter on node 5)
node 5 has only arbiter replica ( 4x )

TEST:
1)=C2=A0 directly reboot one node - OK ( is not important which ( data node or arbiter node ))
2)=C2=A0 directly reboot two nodes - OK ( if=C2=A0 nodes are not from t= he same replica )
3)=C2=A0 directly reboot three nodes - yes, this is the main problem and a questions ....
=C2=A0=C2=A0=C2=A0 - rebooted all three nodes from replica "B"=C2=A0 ( = not so possible, but who knows ... )
=C2=A0=C2=A0=C2=A0 - all VMs with data on this replica was paused ( no = data access ) - OK
=C2=A0=C2=A0=C2=A0 - all VMs running on replica "B" nodes lost (=C2=A0 = started manually, later )( datas on other replicas ) - acceptable
BUT
=C2=A0=C2=A0=C2=A0 - !!! all oVIrt domains went down !! - master domain= is on replica "A" which lost only one member from three !!!
=C2=A0=C2=A0=C2=A0 so we are not expecting that all domain will go down= , especially master with 2 live members.
=C2=A0=C2=A0=C2=A0
Results:
=C2=A0=C2=A0=C2=A0 - the whole cluster unreachable until at all domains= up - depent of all nodes up !!!
=C2=A0=C2=A0=C2=A0 - all paused VMs started back - OK
=C2=A0=C2=A0=C2=A0 - rest of all VMs rebooted and runnig - OK

Questions:
=C2=A0=C2=A0=C2=A0 1) why all domains down if master domain ( on replic= a "A" ) has two runnig members ( 2 of 3 )=C2=A0 ??
=C2=A0=C2=A0=C2=A0 2) how to fix that colaps without waiting to all nod= es up ? ( in worste case if node has HW error eg. ) ??
=C2=A0=C2=A0=C2=A0 3) which oVirt=C2=A0 cluster=C2=A0 policy=C2=A0 can = prevent that situation ?? ( if any )

regs.
Pavel


--------------070802090208020205070907-- --===============9176115236222937382== Content-Type: multipart/alternative MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="attachment.bin" VGhpcyBpcyBhIG11bHRpLXBhcnQgbWVzc2FnZSBpbiBNSU1FIGZvcm1hdC4KLS0tLS0tLS0tLS0t LS0wNzA4MDIwOTAyMDgwMjAyMDUwNzA5MDcKQ29udGVudC1UeXBlOiB0ZXh0L3BsYWluOyBjaGFy c2V0PXV0Zi04OyBmb3JtYXQ9Zmxvd2VkCkNvbnRlbnQtVHJhbnNmZXItRW5jb2Rpbmc6IDdiaXQK CkhlbGxvLAp3ZSB0cmllZCB0aGUgIGZvbGxvd2luZyB0ZXN0IC0gd2l0aCB1bndhbnRlZCByZXN1 bHRzCgppbnB1dDoKNSBub2RlIGdsdXN0ZXIKQSA9IHJlcGxpY2EgMyB3aXRoIGFyYml0ZXIgMSAo IG5vZGUxK25vZGUyK2FyYml0ZXIgb24gbm9kZSA1ICkKQiA9IHJlcGxpY2EgMyB3aXRoIGFyYml0 ZXIgMSAoIG5vZGUzK25vZGU0K2FyYml0ZXIgb24gbm9kZSA1ICkKQyA9IGRpc3RyaWJ1dGVkIHJl cGxpY2EgMyBhcmJpdGVyIDEgICggbm9kZTErbm9kZTIsIG5vZGUzK25vZGU0LCBlYWNoIAphcmJp dGVyIG9uIG5vZGUgNSkKbm9kZSA1IGhhcyBvbmx5IGFyYml0ZXIgcmVwbGljYSAoIDR4ICkKClRF U1Q6CjEpICBkaXJlY3RseSByZWJvb3Qgb25lIG5vZGUgLSBPSyAoIGlzIG5vdCBpbXBvcnRhbnQg d2hpY2ggKCBkYXRhIG5vZGUgCm9yIGFyYml0ZXIgbm9kZSApKQoyKSAgZGlyZWN0bHkgcmVib290 IHR3byBub2RlcyAtIE9LICggaWYgIG5vZGVzIGFyZSBub3QgZnJvbSB0aGUgc2FtZSAKcmVwbGlj YSApCjMpICBkaXJlY3RseSByZWJvb3QgdGhyZWUgbm9kZXMgLSB5ZXMsIHRoaXMgaXMgdGhlIG1h aW4gcHJvYmxlbSBhbmQgYSAKcXVlc3Rpb25zIC4uLi4KICAgICAtIHJlYm9vdGVkIGFsbCB0aHJl ZSBub2RlcyBmcm9tIHJlcGxpY2EgIkIiICAoIG5vdCBzbyBwb3NzaWJsZSwgYnV0IAp3aG8ga25v d3MgLi4uICkKICAgICAtIGFsbCBWTXMgd2l0aCBkYXRhIG9uIHRoaXMgcmVwbGljYSB3YXMgcGF1 c2VkICggbm8gZGF0YSBhY2Nlc3MgKSAtIE9LCiAgICAgLSBhbGwgVk1zIHJ1bm5pbmcgb24gcmVw bGljYSAiQiIgbm9kZXMgbG9zdCAoICBzdGFydGVkIG1hbnVhbGx5LCAKbGF0ZXIgKSggZGF0YXMg b24gb3RoZXIgcmVwbGljYXMgKSAtIGFjY2VwdGFibGUKQlVUCiAgICAgLSAhISEgYWxsIG9WSXJ0 IGRvbWFpbnMgd2VudCBkb3duICEhIC0gbWFzdGVyIGRvbWFpbiBpcyBvbiByZXBsaWNhIAoiQSIg d2hpY2ggbG9zdCBvbmx5IG9uZSBtZW1iZXIgZnJvbSB0aHJlZSAhISEKICAgICBzbyB3ZSBhcmUg bm90IGV4cGVjdGluZyB0aGF0IGFsbCBkb21haW4gd2lsbCBnbyBkb3duLCBlc3BlY2lhbGx5IApt YXN0ZXIgd2l0aCAyIGxpdmUgbWVtYmVycy4KClJlc3VsdHM6CiAgICAgLSB0aGUgd2hvbGUgY2x1 c3RlciB1bnJlYWNoYWJsZSB1bnRpbCBhdCBhbGwgZG9tYWlucyB1cCAtIGRlcGVudCBvZiAKYWxs IG5vZGVzIHVwICEhIQogICAgIC0gYWxsIHBhdXNlZCBWTXMgc3RhcnRlZCBiYWNrIC0gT0sKICAg ICAtIHJlc3Qgb2YgYWxsIFZNcyByZWJvb3RlZCBhbmQgcnVubmlnIC0gT0sKClF1ZXN0aW9uczoK ICAgICAxKSB3aHkgYWxsIGRvbWFpbnMgZG93biBpZiBtYXN0ZXIgZG9tYWluICggb24gcmVwbGlj YSAiQSIgKSBoYXMgdHdvIApydW5uaWcgbWVtYmVycyAoIDIgb2YgMyApICA/PwogICAgIDIpIGhv dyB0byBmaXggdGhhdCBjb2xhcHMgd2l0aG91dCB3YWl0aW5nIHRvIGFsbCBub2RlcyB1cCA/ICgg aW4gCndvcnN0ZSBjYXNlIGlmIG5vZGUgaGFzIEhXIGVycm9yIGVnLiApID8/CiAgICAgMykgd2hp Y2ggb1ZpcnQgIGNsdXN0ZXIgIHBvbGljeSAgY2FuIHByZXZlbnQgdGhhdCBzaXR1YXRpb24gPz8g KCBpZiAKYW55ICkKCnJlZ3MuClBhdmVsCgoKCi0tLS0tLS0tLS0tLS0tMDcwODAyMDkwMjA4MDIw MjA1MDcwOTA3CkNvbnRlbnQtVHlwZTogdGV4dC9odG1sOyBjaGFyc2V0PXV0Zi04CkNvbnRlbnQt VHJhbnNmZXItRW5jb2Rpbmc6IDhiaXQKCjxodG1sPgogIDxoZWFkPgoKICAgIDxtZXRhIGh0dHAt ZXF1aXY9ImNvbnRlbnQtdHlwZSIgY29udGVudD0idGV4dC9odG1sOyBjaGFyc2V0PXV0Zi04Ij4K ICA8L2hlYWQ+CiAgPGJvZHkgdGV4dD0iIzAwMDA2NiIgYmdjb2xvcj0iI0ZGRkZGRiI+CiAgICBI ZWxsbywgPGJyPgogICAgd2UgdHJpZWQgdGhlwqAgZm9sbG93aW5nIHRlc3QgLSB3aXRoIHVud2Fu dGVkIHJlc3VsdHM8YnI+CiAgICA8YnI+CiAgICBpbnB1dDo8YnI+CiAgICA1IG5vZGUgZ2x1c3Rl cjxicj4KICAgIEEgPSByZXBsaWNhIDMgd2l0aCBhcmJpdGVyIDEgKCBub2RlMStub2RlMithcmJp dGVyIG9uIG5vZGUgNSApPGJyPgogICAgQiA9IHJlcGxpY2EgMyB3aXRoIGFyYml0ZXIgMSAoIG5v ZGUzK25vZGU0K2FyYml0ZXIgb24gbm9kZSA1ICk8YnI+CiAgICBDID0gZGlzdHJpYnV0ZWQgcmVw bGljYSAzIGFyYml0ZXIgMcKgICggbm9kZTErbm9kZTIsIG5vZGUzK25vZGU0LAogICAgZWFjaCBh cmJpdGVyIG9uIG5vZGUgNSk8YnI+CiAgICBub2RlIDUgaGFzIG9ubHkgYXJiaXRlciByZXBsaWNh ICggNHggKTxicj4KICAgIDxicj4KICAgIFRFU1Q6PGJyPgogICAgMSnCoCBkaXJlY3RseSByZWJv b3Qgb25lIG5vZGUgLSBPSyAoIGlzIG5vdCBpbXBvcnRhbnQgd2hpY2ggKCBkYXRhCiAgICBub2Rl IG9yIGFyYml0ZXIgbm9kZSApKTxicj4KICAgIDIpwqAgZGlyZWN0bHkgcmVib290IHR3byBub2Rl cyAtIE9LICggaWbCoCBub2RlcyBhcmUgbm90IGZyb20gdGhlIHNhbWUKICAgIHJlcGxpY2EgKSA8 YnI+CiAgICAzKcKgIGRpcmVjdGx5IHJlYm9vdCB0aHJlZSBub2RlcyAtIHllcywgdGhpcyBpcyB0 aGUgbWFpbiBwcm9ibGVtIGFuZAogICAgYSBxdWVzdGlvbnMgLi4uLjxicj4KICAgIMKgwqDCoCAt IHJlYm9vdGVkIGFsbCB0aHJlZSBub2RlcyBmcm9tIHJlcGxpY2EgIkIiwqAgKCBub3Qgc28gcG9z c2libGUsCiAgICBidXQgd2hvIGtub3dzIC4uLiApPGJyPgogICAgwqDCoMKgIC0gYWxsIFZNcyB3 aXRoIGRhdGEgb24gdGhpcyByZXBsaWNhIHdhcyBwYXVzZWQgKCBubyBkYXRhIGFjY2VzcwogICAg KSAtIE9LPGJyPgogICAgwqDCoMKgIC0gYWxsIFZNcyBydW5uaW5nIG9uIHJlcGxpY2EgIkIiIG5v ZGVzIGxvc3QgKMKgIHN0YXJ0ZWQgbWFudWFsbHksCiAgICBsYXRlciApKCBkYXRhcyBvbiBvdGhl ciByZXBsaWNhcyApIC0gYWNjZXB0YWJsZTxicj4KICAgIEJVVDxicj4KICAgIMKgwqDCoCAtICEh ISBhbGwgb1ZJcnQgZG9tYWlucyB3ZW50IGRvd24gISEgLSBtYXN0ZXIgZG9tYWluIGlzIG9uCiAg ICByZXBsaWNhICJBIiB3aGljaCBsb3N0IG9ubHkgb25lIG1lbWJlciBmcm9tIHRocmVlICEhITxi cj4KICAgIMKgwqDCoCBzbyB3ZSBhcmUgbm90IGV4cGVjdGluZyB0aGF0IGFsbCBkb21haW4gd2ls bCBnbyBkb3duLCBlc3BlY2lhbGx5CiAgICBtYXN0ZXIgd2l0aCAyIGxpdmUgbWVtYmVycy48YnI+ CiAgICDCoMKgwqAgPGJyPgogICAgUmVzdWx0czogPGJyPgogICAgwqDCoMKgIC0gdGhlIHdob2xl IGNsdXN0ZXIgdW5yZWFjaGFibGUgdW50aWwgYXQgYWxsIGRvbWFpbnMgdXAgLSBkZXBlbnQKICAg IG9mIGFsbCBub2RlcyB1cCAhISE8YnI+CiAgICDCoMKgwqAgLSBhbGwgcGF1c2VkIFZNcyBzdGFy dGVkIGJhY2sgLSBPSzxicj4KICAgIMKgwqDCoCAtIHJlc3Qgb2YgYWxsIFZNcyByZWJvb3RlZCBh bmQgcnVubmlnIC0gT0s8YnI+CiAgICA8YnI+CiAgICBRdWVzdGlvbnM6PGJyPgogICAgwqDCoMKg IDEpIHdoeSBhbGwgZG9tYWlucyBkb3duIGlmIG1hc3RlciBkb21haW4gKCBvbiByZXBsaWNhICJB IiApIGhhcwogICAgdHdvIHJ1bm5pZyBtZW1iZXJzICggMiBvZiAzICnCoCA/Pzxicj4KICAgIMKg wqDCoCAyKSBob3cgdG8gZml4IHRoYXQgY29sYXBzIHdpdGhvdXQgd2FpdGluZyB0byBhbGwgbm9k ZXMgdXAgPyAoIGluCiAgICB3b3JzdGUgY2FzZSBpZiBub2RlIGhhcyBIVyBlcnJvciBlZy4gKSA/ Pzxicj4KICAgIMKgwqDCoCAzKSB3aGljaCBvVmlydMKgIGNsdXN0ZXLCoCBwb2xpY3nCoCBjYW4g cHJldmVudCB0aGF0IHNpdHVhdGlvbiA/PyAoCiAgICBpZiBhbnkgKTxicj4KICAgIDxicj4KICAg IHJlZ3MuPGJyPgogICAgUGF2ZWw8YnI+CiAgICA8YnI+CiAgICA8YnI+CiAgPC9ib2R5Pgo8L2h0 bWw+CgotLS0tLS0tLS0tLS0tLTA3MDgwMjA5MDIwODAyMDIwNTA3MDkwNy0tCg== --===============9176115236222937382==--