From paf1 at email.cz Thu Mar 31 08:09:11 2016 Content-Type: multipart/mixed; boundary="===============7728603169085469914==" MIME-Version: 1.0 From: paf1 at email.cz To: users at ovirt.org Subject: [ovirt-users] ovirt with glusterfs - big test - unwanted results Date: Thu, 31 Mar 2016 14:09:05 +0200 Message-ID: <56FD1361.3010805@email.cz> --===============7728603169085469914== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable This is a multi-part message in MIME format. --------------070802090208020205070907 Content-Type: text/plain; charset=3Dutf-8; format=3Dflowed Content-Transfer-Encoding: 7bit Hello, we tried the following test - with unwanted results input: 5 node gluster A =3D replica 3 with arbiter 1 ( node1+node2+arbiter on node 5 ) B =3D replica 3 with arbiter 1 ( node3+node4+arbiter on node 5 ) C =3D distributed replica 3 arbiter 1 ( node1+node2, node3+node4, each = arbiter on node 5) node 5 has only arbiter replica ( 4x ) TEST: 1) directly reboot one node - OK ( is not important which ( data node = or arbiter node )) 2) directly reboot two nodes - OK ( if nodes are not from the same = replica ) 3) directly reboot three nodes - yes, this is the main problem and a = questions .... - rebooted all three nodes from replica "B" ( not so possible, but = who knows ... ) - all VMs with data on this replica was paused ( no data access ) - OK - all VMs running on replica "B" nodes lost ( started manually, = later )( datas on other replicas ) - acceptable BUT - !!! all oVIrt domains went down !! - master domain is on replica = "A" which lost only one member from three !!! so we are not expecting that all domain will go down, especially = master with 2 live members. Results: - the whole cluster unreachable until at all domains up - depent of = all nodes up !!! - all paused VMs started back - OK - rest of all VMs rebooted and runnig - OK Questions: 1) why all domains down if master domain ( on replica "A" ) has two = runnig members ( 2 of 3 ) ?? 2) how to fix that colaps without waiting to all nodes up ? ( in = worste case if node has HW error eg. ) ?? 3) which oVirt cluster policy can prevent that situation ?? ( if = any ) regs. Pavel --------------070802090208020205070907 Content-Type: text/html; charset=3Dutf-8 Content-Transfer-Encoding: 8bit
Hello,Hi Pavel,
Thanks for the report. Can you begin with a more accurate description of your environment?Begin with host, oVirt and Gluster versions. Then continue with the exact setup (what are 'A', 'B', 'C' - domains? Volumes? What is the mapping between domains and volumes?).
Are there any logs you can share with us?
I'm sure with more information, we'd be happy to look at the issue.Y.
On Thu, Mar 31, 2016 at 3:09 PM, paf1(a= )email.cz <paf1(a)emai= l.cz> wrote:
Hello,
we tried the=C2=A0 following test - with unwanted results
input:
5 node gluster
A =3D replica 3 with arbiter 1 ( node1+node2+arbiter on node 5 )
B =3D replica 3 with arbiter 1 ( node3+node4+arbiter on node 5 )
C =3D distributed replica 3 arbiter 1=C2=A0 ( node1+node2, node3+node4, each arbiter on node 5)
node 5 has only arbiter replica ( 4x )
TEST:
1)=C2=A0 directly reboot one node - OK ( is not important whi= ch ( data node or arbiter node ))
2)=C2=A0 directly reboot two nodes - OK ( if=C2=A0 nodes are = not from the same replica )
3)=C2=A0 directly reboot three nodes - yes, this is the main problem and a questions ....
=C2=A0=C2=A0=C2=A0 - rebooted all three nodes from replica "B= "=C2=A0 ( not so possible, but who knows ... )
=C2=A0=C2=A0=C2=A0 - all VMs with data on this replica was pa= used ( no data access ) - OK
=C2=A0=C2=A0=C2=A0 - all VMs running on replica "B" nodes los= t (=C2=A0 started manually, later )( datas on other replicas ) - acceptable
BUT
=C2=A0=C2=A0=C2=A0 - !!! all oVIrt domains went down !! - mas= ter domain is on replica "A" which lost only one member from three !!!
=C2=A0=C2=A0=C2=A0 so we are not expecting that all domain wi= ll go down, especially master with 2 live members.
=C2=A0=C2=A0=C2=A0
Results:
=C2=A0=C2=A0=C2=A0 - the whole cluster unreachable until at a= ll domains up - depent of all nodes up !!!
=C2=A0=C2=A0=C2=A0 - all paused VMs started back - OK
=C2=A0=C2=A0=C2=A0 - rest of all VMs rebooted and runnig - OK=
Questions:
=C2=A0=C2=A0=C2=A0 1) why all domains down if master domain (= on replica "A" ) has two runnig members ( 2 of 3 )=C2=A0 ??
=C2=A0=C2=A0=C2=A0 2) how to fix that colaps without waiting = to all nodes up ? ( in worste case if node has HW error eg. ) ??
=C2=A0=C2=A0=C2=A0 3) which oVirt=C2=A0 cluster=C2=A0 policy= =C2=A0 can prevent that situation ?? ( if any )
regs.
Pavel
_______________________________________________
Users mailing list
U= sers(a)ovirt.org
http://lists.ovirt.org/m= ailman/listinfo/users
Hi Pavel,
Thanks for the report. Can you begin with a more accurate description of your environment?Begin with host, oVirt and Gluster versions. Then continue with the exact setup (what are 'A', 'B', 'C' - domains? Volumes? What is the mapping between domains and volumes?).
Are there any logs you can share with us?
I'm sure with more information, we'd be happy to look at the issue.Y.
On Thu, Mar 31, 2016 at 3:09 PM, paf1(a= )email.cz <paf1(a)emai= l.cz> wrote:
Hello,
we tried the=C2=A0 following test - with unwanted results
input:
5 node gluster
A =3D replica 3 with arbiter 1 ( node1+node2+arbiter on node 5 )
B =3D replica 3 with arbiter 1 ( node3+node4+arbiter on node 5 )
C =3D distributed replica 3 arbiter 1=C2=A0 ( node1+node2, node3+node4, each arbiter on node 5)
node 5 has only arbiter replica ( 4x )
TEST:
1)=C2=A0 directly reboot one node - OK ( is not important whi= ch ( data node or arbiter node ))
2)=C2=A0 directly reboot two nodes - OK ( if=C2=A0 nodes are = not from the same replica )
3)=C2=A0 directly reboot three nodes - yes, this is the main problem and a questions ....
=C2=A0=C2=A0=C2=A0 - rebooted all three nodes from replica "B= "=C2=A0 ( not so possible, but who knows ... )
=C2=A0=C2=A0=C2=A0 - all VMs with data on this replica was pa= used ( no data access ) - OK
=C2=A0=C2=A0=C2=A0 - all VMs running on replica "B" nodes los= t (=C2=A0 started manually, later )( datas on other replicas ) - acceptable
BUT
=C2=A0=C2=A0=C2=A0 - !!! all oVIrt domains went down !! - mas= ter domain is on replica "A" which lost only one member from three !!!
=C2=A0=C2=A0=C2=A0 so we are not expecting that all domain wi= ll go down, especially master with 2 live members.
=C2=A0=C2=A0=C2=A0
Results:
=C2=A0=C2=A0=C2=A0 - the whole cluster unreachable until at a= ll domains up - depent of all nodes up !!!
=C2=A0=C2=A0=C2=A0 - all paused VMs started back - OK
=C2=A0=C2=A0=C2=A0 - rest of all VMs rebooted and runnig - OK=
Questions:
=C2=A0=C2=A0=C2=A0 1) why all domains down if master domain (= on replica "A" ) has two runnig members ( 2 of 3 )=C2=A0 ??
=C2=A0=C2=A0=C2=A0 2) how to fix that colaps without waiting = to all nodes up ? ( in worste case if node has HW error eg. ) ??
=C2=A0=C2=A0=C2=A0 3) which oVirt=C2=A0 cluster=C2=A0 policy= =C2=A0 can prevent that situation ?? ( if any )
regs.
Pavel
_______________________________________________
Users mailing list
U= sers(a)ovirt.org
http://lists.ovirt.org/m= ailman/listinfo/users
Hi Pavel,
Thanks for the report. Can you begin with a more accurate description of your environment?Begin with host, oVirt and Gluster versions. Then continue with the exact setup (what are 'A', 'B', 'C' - domains? Volumes? What is the mapping between domains and volumes?).
Are there any logs you can share with us?
I'm sure with more information, we'd be happy to look at the issue.Y.
On Thu, Mar 31, 2016 at 3:09 PM, paf1(a= )email.cz <paf1(a)emai= l.cz> wrote:
Hello,
we tried the=C2=A0 following test - with unwanted results
input:
5 node gluster
A =3D replica 3 with arbiter 1 ( node1+node2+arbiter on node 5 )
B =3D replica 3 with arbiter 1 ( node3+node4+arbiter on node 5 )
C =3D distributed replica 3 arbiter 1=C2=A0 ( node1+node2, node3+node4, each arbiter on node 5)
node 5 has only arbiter replica ( 4x )
TEST:
1)=C2=A0 directly reboot one node - OK ( is not important whi= ch ( data node or arbiter node ))
2)=C2=A0 directly reboot two nodes - OK ( if=C2=A0 nodes are = not from the same replica )
3)=C2=A0 directly reboot three nodes - yes, this is the main problem and a questions ....
=C2=A0=C2=A0=C2=A0 - rebooted all three nodes from replica "B= "=C2=A0 ( not so possible, but who knows ... )
=C2=A0=C2=A0=C2=A0 - all VMs with data on this replica was pa= used ( no data access ) - OK
=C2=A0=C2=A0=C2=A0 - all VMs running on replica "B" nodes los= t (=C2=A0 started manually, later )( datas on other replicas ) - acceptable
BUT
=C2=A0=C2=A0=C2=A0 - !!! all oVIrt domains went down !! - mas= ter domain is on replica "A" which lost only one member from three !!!
=C2=A0=C2=A0=C2=A0 so we are not expecting that all domain wi= ll go down, especially master with 2 live members.
=C2=A0=C2=A0=C2=A0
Results:
=C2=A0=C2=A0=C2=A0 - the whole cluster unreachable until at a= ll domains up - depent of all nodes up !!!
=C2=A0=C2=A0=C2=A0 - all paused VMs started back - OK
=C2=A0=C2=A0=C2=A0 - rest of all VMs rebooted and runnig - OK=
Questions:
=C2=A0=C2=A0=C2=A0 1) why all domains down if master domain (= on replica "A" ) has two runnig members ( 2 of 3 )=C2=A0 ??
=C2=A0=C2=A0=C2=A0 2) how to fix that colaps without waiting = to all nodes up ? ( in worste case if node has HW error eg. ) ??
=C2=A0=C2=A0=C2=A0 3) which oVirt=C2=A0 cluster=C2=A0 policy= =C2=A0 can prevent that situation ?? ( if any )
regs.
Pavel
_______________________________________________
Users mailing list
U= sers(a)ovirt.org
http://lists.ovirt.org/m= ailman/listinfo/users
Hi,
rest of logs:
www.uschovna.cz/en/za= silka/HYGXR57CNHM3TP39-L3W
The TEST is the last big event in logs ....
TEST TIME : about 14:00-14:30=C2=A0 CET
regs.Pavel
On 31.3.2016 14:30, Yaniv Kaul wrote:<= br>Hi Pavel,
Thanks for the report. Can you begin with a more accurate description of your environment?Begin with host, oVirt and Gluster versions. Then continue with the exact setup (what are 'A', 'B', 'C' - domains? Volumes? What is the mapping between domains and volumes?).
Are there any logs you can share with us?
I'm sure with more information, we'd be happy to look at the issue.Y.
On Thu, Mar 31, 2016 at 3:09 PM, paf1(a)em= ail.cz <paf1(a)em= ail.cz> wrote:
Hello,
we tried the=C2=A0 following test - with unwanted results
input:
5 node gluster
A =3D replica 3 with arbiter 1 ( node1+node2+arbiter on node 5 )
B =3D replica 3 with arbiter 1 ( node3+node4+arbiter on node 5 )
C =3D distributed replica 3 arbiter 1=C2=A0 ( node1+node2, node3+node4, each arbiter on node 5)
node 5 has only arbiter replica ( 4x )
TEST:
1)=C2=A0 directly reboot one node - OK ( is not important which ( data node or arbiter node ))
2)=C2=A0 directly reboot two nodes - OK ( if=C2=A0 nodes ar= e not from the same replica )
3)=C2=A0 directly reboot three nodes - yes, this is the main problem and a questions ....
=C2=A0=C2=A0=C2=A0 - rebooted all three nodes from replica = "B"=C2=A0 ( not so possible, but who knows ... )
=C2=A0=C2=A0=C2=A0 - all VMs with data on this replica was = paused ( no data access ) - OK
=C2=A0=C2=A0=C2=A0 - all VMs running on replica "B" nodes l= ost (=C2=A0 started manually, later )( datas on other replicas ) - acceptable
BUT
=C2=A0=C2=A0=C2=A0 - !!! all oVIrt domains went down !! - m= aster domain is on replica "A" which lost only one member from three !!!
=C2=A0=C2=A0=C2=A0 so we are not expecting that all domain = will go down, especially master with 2 live members.
=C2=A0=C2=A0=C2=A0
Results:
=C2=A0=C2=A0=C2=A0 - the whole cluster unreachable until at= all domains up - depent of all nodes up !!!
=C2=A0=C2=A0=C2=A0 - all paused VMs started back - OK
=C2=A0=C2=A0=C2=A0 - rest of all VMs rebooted and runnig - = OK
Questions:
=C2=A0=C2=A0=C2=A0 1) why all domains down if master domain= ( on replica "A" ) has two runnig members ( 2 of 3 )=C2=A0 ??
=C2=A0=C2=A0=C2=A0 2) how to fix that colaps without waitin= g to all nodes up ? ( in worste case if node has HW error eg. ) ??
=C2=A0=C2=A0=C2=A0 3) which oVirt=C2=A0 cluster=C2=A0 polic= y=C2=A0 can prevent that situation ?? ( if any )
regs.
Pavel
_______________________________________________
Users mailing list
Users(a)ovirt.org
http://lists.ovirt.org= /mailman/listinfo/users
_______________________________________________ Users mailing list Use= rs(a)ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Hello Sahina,
look attached logs which U requested
regs.
Pavel
On 5.4.2016 14:07, Sahina Bose wrote:<= br>
Hi,
rest of logs:
www.uschovna.cz/e= n/zasilka/HYGXR57CNHM3TP39-L3W
The TEST is the last big event in logs ....
TEST TIME : about 14:00-14:30=C2=A0 CET
Thank you Pavel for the interesting test report and sharing the logs.
You are right - the master domain should not go down if 2 of 3 bricks are available from volume A (1HP12-R3A1P1).
I notice that host kvmarbiter was not responsive at 2016-03-31 13:27:19 , but the ConnectStorageServerVDSCommand executed on kvmarbiter node returned success at 2016-03-31 13:27:26
Could you also share the vdsm logs from 1hp1, 1hp2 and kvmarbiter nodes during this time ?
Ravi, Krutika - could you take a look at the gluster logs?
<= br> regs.Pavel
On 31.3.2016 14:30, Yaniv Kaul wrote:
Hi Pavel,
Thanks for the report. Can you begin with a more accurate description of your environment?Begin with host, oVirt and Gluster versions. Then continue with the exact setup (what are 'A', 'B', 'C' - domains? Volumes? What is the mapping between domains and volumes?).
Are there any logs you can share with us?
I'm sure with more information, we'd be happy to look at the issue.Y.
On Thu, Mar 31, 2016 at 3:09 PM, paf1(a)email.cz <paf1(= a)email.cz> wrote:
Hello,
we tried the=C2=A0 following test - with unwanted resul= ts
input:
5 node gluster
A =3D replica 3 with arbiter 1 ( node1+node2+arbiter on node 5 )
B =3D replica 3 with arbiter 1 ( node3+node4+arbiter on node 5 )
C =3D distributed replica 3 arbiter 1=C2=A0 ( node1+nod= e2, node3+node4, each arbiter on node 5)
node 5 has only arbiter replica ( 4x )
TEST:
1)=C2=A0 directly reboot one node - OK ( is not importa= nt which ( data node or arbiter node ))
2)=C2=A0 directly reboot two nodes - OK ( if=C2=A0 node= s are not from the same replica )
3)=C2=A0 directly reboot three nodes - yes, this is the main problem and a questions ....
=C2=A0=C2=A0=C2=A0 - rebooted all three nodes from repl= ica "B"=C2=A0 ( not so possible, but who knows ... )
=C2=A0=C2=A0=C2=A0 - all VMs with data on this replica = was paused ( no data access ) - OK
=C2=A0=C2=A0=C2=A0 - all VMs running on replica "B" nod= es lost (=C2=A0 started manually, later )( datas on other replicas ) - acceptable
BUT
=C2=A0=C2=A0=C2=A0 - !!! all oVIrt domains went down !!= - master domain is on replica "A" which lost only one member from three !!!
=C2=A0=C2=A0=C2=A0 so we are not expecting that all dom= ain will go down, especially master with 2 live members.
=C2=A0=C2=A0=C2=A0
Results:
=C2=A0=C2=A0=C2=A0 - the whole cluster unreachable unti= l at all domains up - depent of all nodes up !!!
=C2=A0=C2=A0=C2=A0 - all paused VMs started back - OK =C2=A0=C2=A0=C2=A0 - rest of all VMs rebooted and runni= g - OK
Questions:
=C2=A0=C2=A0=C2=A0 1) why all domains down if master do= main ( on replica "A" ) has two runnig members ( 2 of 3 )=C2=A0 ?= ?
=C2=A0=C2=A0=C2=A0 2) how to fix that colaps without wa= iting to all nodes up ? ( in worste case if node has HW error eg. ) ??
=C2=A0=C2=A0=C2=A0 3) which oVirt=C2=A0 cluster=C2=A0 p= olicy=C2=A0 can prevent that situation ?? ( if any )
regs.
Pavel
_______________________________________________
Users mailing list
Users(a)ovirt.org=
http://lists.ovirt= .org/mailman/listinfo/users
_______________________________________________ Users mailing list Users(a)ovirt.org http://lists.ovirt.org/mailman/list= info/users