ovirt and gluster storage domain HA testing

I have a ovirt cluster with and system running ovirt-engine 3.6.2.6-1 on centos7.2 and 3 hardware nodes running glusterfs 3.7.6-1 and centos7.2. I created a gluster volume using gluster cli and then went to add a storage domain. I created it with Path ovirt1-ks.test:/gv1 and all works fine, until ovirt1 goes down. Then ALL VMs pause till ovirt1 comes back up. Do I have to list all nodes in the path for this to work? Path: ovirt1-ks.test:/gv1 ovirt2-ks.test:/gv1 ovirt3-ks.test:/gv1 Or how do I prevent ovirt1 from being single point of failure? Thanks. Cloud Services for Business www.j2.com j2 | eFax | eVoice | FuseMail | Campaigner | KeepItSafe | Onebox This email, its contents and attachments contain information from j2 Global, Inc. and/or its affiliates which may be privileged, confidential or otherwise protected from disclosure. The information is intended to be for the addressee(s) only. If you are not an addressee, any disclosure, copy, distribution, or use of the contents of this message is prohibited. If you have received this email in error please notify the sender by reply e-mail and delete the original message and any copies. (c) 2015 j2 Global, Inc. All rights reserved. eFax, eVoice, Campaigner, FuseMail, KeepItSafe, and Onebox are registered trademarks of j2 Global, Inc. and its affiliates.

On 02/17/2016 05:08 AM, Bill James wrote:
I have a ovirt cluster with and system running ovirt-engine 3.6.2.6-1 on centos7.2 and 3 hardware nodes running glusterfs 3.7.6-1 and centos7.2. I created a gluster volume using gluster cli and then went to add a storage domain.
I created it with Path ovirt1-ks.test:/gv1 and all works fine, until ovirt1 goes down. Then ALL VMs pause till ovirt1 comes back up. Do I have to list all nodes in the path for this to work?
Path: ovirt1-ks.test:/gv1 ovirt2-ks.test:/gv1 ovirt3-ks.test:/gv1
Or how do I prevent ovirt1 from being single point of failure?
Which type of storage domain have you created - NFS or GlusterFS? With NFS, this could be a problem, as the nfs server running on ovirt1 can become single point of failure. To work around this, you will need to set up HA with CTDB or pacemaker/corosync depending on version of NFS you're using. If you're using glusterfs, did the hypervisor node go down as well when ovirt1 went down? Only during mounting of the storage domain (either on activate of hypervisor host/ reboot of host) is the server provided in path required to access volume (ovirt1). You can provide backup-volfile-servers=ovirt2-ks:ovirt3-ks in Mount options while creating the storage domain. Please also provide output of gluster volume info. (I'm assuming the bricks on the remaining servers were online)
Thanks.
Cloud Services for Business www.j2.com j2 | eFax | eVoice | FuseMail | Campaigner | KeepItSafe | Onebox
This email, its contents and attachments contain information from j2 Global, Inc. and/or its affiliates which may be privileged, confidential or otherwise protected from disclosure. The information is intended to be for the addressee(s) only. If you are not an addressee, any disclosure, copy, distribution, or use of the contents of this message is prohibited. If you have received this email in error please notify the sender by reply e-mail and delete the original message and any copies. (c) 2015 j2 Global, Inc. All rights reserved. eFax, eVoice, Campaigner, FuseMail, KeepItSafe, and Onebox are registered trademarks of j2 Global, Inc. and its affiliates. _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

On Wed, Feb 17, 2016 at 7:57 AM, Sahina Bose <sabose@redhat.com> wrote:
On 02/17/2016 05:08 AM, Bill James wrote:
I have a ovirt cluster with and system running ovirt-engine 3.6.2.6-1 on centos7.2 and 3 hardware nodes running glusterfs 3.7.6-1 and centos7.2. I created a gluster volume using gluster cli and then went to add a storage domain.
I created it with Path ovirt1-ks.test:/gv1 and all works fine, until ovirt1 goes down. Then ALL VMs pause till ovirt1 comes back up. Do I have to list all nodes in the path for this to work?
Path: ovirt1-ks.test:/gv1 ovirt2-ks.test:/gv1 ovirt3-ks.test:/gv1
Or how do I prevent ovirt1 from being single point of failure?
Which type of storage domain have you created - NFS or GlusterFS? With NFS, this could be a problem, as the nfs server running on ovirt1 can become single point of failure. To work around this, you will need to set up HA with CTDB or pacemaker/corosync depending on version of NFS you're using.
If you're using glusterfs, did the hypervisor node go down as well when ovirt1 went down? Only during mounting of the storage domain (either on activate of hypervisor host/ reboot of host) is the server provided in path required to access volume (ovirt1). You can provide backup-volfile-servers=ovirt2-ks:ovirt3-ks in Mount options while creating the storage domain.
oVirt is automatically detecting the available bricks and generating the backup-volfile-servers option when connecting to a gluster storage domain. You can see if this worked in vdsm.log, the backup-volfile-servers option should appear in the mount command. Please attach vdsm.log showing the time you connected to the gluster storage domain, and the time of the failure. Attach also /var/log/sanlock.log - this is the best place to detect issues accessing storage as it reads and writes to all storage domains frequently.
Please also provide output of gluster volume info. (I'm assuming the bricks on the remaining servers were online)
Nir

This is a GLusterfs Domain. I repeated my test (powering off ovirt1). This time the VM on ovirt2 continued to run just fine. I noticed that engine.log said ovirt3 was now the Storage Pool Manager. So after ovirt1 was all healed, I powered off ovirt3. ovirt1 powered off: [root@ovirt2 test ~]# gluster peer status Number of Peers: 2 Hostname: 10.100.108.31 (ovirt1) Uuid: 758d0477-7c9d-496b-91a8-cc1113cf09c4 State: Peer in Cluster (Disconnected) Hostname: ovirt3.test.j2noc.com Uuid: dfe44d58-7050-4b8d-ba2f-5ddba1cab2e8 State: Peer in Cluster (Connected) Other names: ovirt3.test.j2noc.com [root@ovirt2 test ~]# gluster volume status Status of volume: gv1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick ovirt2-ks.test.j2noc.com:/gluster-sto re/brick1/gv1 49152 0 Y 45784 Brick ovirt3-ks.test.j2noc.com:/gluster-sto re/brick1/gv1 49152 0 Y 8536 NFS Server on localhost 2049 0 Y 45779 Self-heal Daemon on localhost N/A N/A Y 45789 NFS Server on ovirt3.test.j2noc.com 2049 0 Y 8546 Self-heal Daemon on ovirt3.test.j2noc.com N/A N/A Y 8551 Task Status of Volume gv1 ------------------------------------------------------------------------------ There are no active volume tasks ovirt3 powered off: [root@ovirt2 test ~]# gluster peer status Number of Peers: 2 Hostname: 10.100.108.31 Uuid: 758d0477-7c9d-496b-91a8-cc1113cf09c4 State: Peer in Cluster (Connected) Hostname: ovirt3.test.j2noc.com Uuid: dfe44d58-7050-4b8d-ba2f-5ddba1cab2e8 State: Peer in Cluster (Disconnected) Other names: ovirt3.test.j2noc.com test VM hung, and was paused: 2016-02-17 11:53:51,254 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandl ing.AuditLogDirector] (DefaultQuartzScheduler_Worker-14) [1f0d3877] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM billjov1.test.j2noc.com has been paused due to unknown storage error. I got impatient waiting for test VM (billjov1) to unpause so I selected "run" to activate it again. Attached are engine.log, vdsm.log.xz, sanlock.log from ovirt2. ovirt* are all hardware nodes. Also attached vdsm.log.1.xz from yesterday's test. Thanks. On 02/16/2016 11:13 PM, Nir Soffer wrote:
On Wed, Feb 17, 2016 at 7:57 AM, Sahina Bose <sabose@redhat.com> wrote:
On 02/17/2016 05:08 AM, Bill James wrote:
I have a ovirt cluster with and system running ovirt-engine 3.6.2.6-1 on centos7.2 and 3 hardware nodes running glusterfs 3.7.6-1 and centos7.2. I created a gluster volume using gluster cli and then went to add a storage domain.
I created it with Path ovirt1-ks.test:/gv1 and all works fine, until ovirt1 goes down. Then ALL VMs pause till ovirt1 comes back up. Do I have to list all nodes in the path for this to work?
Path: ovirt1-ks.test:/gv1 ovirt2-ks.test:/gv1 ovirt3-ks.test:/gv1
Or how do I prevent ovirt1 from being single point of failure?
Which type of storage domain have you created - NFS or GlusterFS? With NFS, this could be a problem, as the nfs server running on ovirt1 can become single point of failure. To work around this, you will need to set up HA with CTDB or pacemaker/corosync depending on version of NFS you're using.
If you're using glusterfs, did the hypervisor node go down as well when ovirt1 went down? Only during mounting of the storage domain (either on activate of hypervisor host/ reboot of host) is the server provided in path required to access volume (ovirt1). You can provide backup-volfile-servers=ovirt2-ks:ovirt3-ks in Mount options while creating the storage domain.
oVirt is automatically detecting the available bricks and generating the backup-volfile-servers option when connecting to a gluster storage domain.
You can see if this worked in vdsm.log, the backup-volfile-servers option should appear in the mount command.
Please attach vdsm.log showing the time you connected to the gluster storage domain, and the time of the failure.
Attach also /var/log/sanlock.log - this is the best place to detect issues accessing storage as it reads and writes to all storage domains frequently.
Please also provide output of gluster volume info. (I'm assuming the bricks on the remaining servers were online) Nir
Cloud Services for Business www.j2.com j2 | eFax | eVoice | FuseMail | Campaigner | KeepItSafe | Onebox This email, its contents and attachments contain information from j2 Global, Inc. and/or its affiliates which may be privileged, confidential or otherwise protected from disclosure. The information is intended to be for the addressee(s) only. If you are not an addressee, any disclosure, copy, distribution, or use of the contents of this message is prohibited. If you have received this email in error please notify the sender by reply e-mail and delete the original message and any copies. (c) 2015 j2 Global, Inc. All rights reserved. eFax, eVoice, Campaigner, FuseMail, KeepItSafe, and Onebox are registered trademarks of j2 Global, Inc. and its affiliates.
participants (3)
-
Bill James
-
Nir Soffer
-
Sahina Bose