
On 08/21/2015 11:30 AM, Ravishankar N wrote:
On 08/21/2015 01:21 PM, Sander Hoentjen wrote:
On 08/21/2015 09:28 AM, Ravishankar N wrote:
On 08/20/2015 02:14 PM, Sander Hoentjen wrote:
On 08/19/2015 09:04 AM, Ravishankar N wrote:
On 08/18/2015 04:22 PM, Ramesh Nachimuthu wrote:
+ Ravi from gluster.
Regards, Ramesh
----- Original Message ----- From: "Sander Hoentjen" <sander@hoentjen.eu> To: users@ovirt.org Sent: Tuesday, August 18, 2015 3:30:35 PM Subject: [ovirt-users] Ovirt/Gluster
Hi,
We are looking for some easy to manage self contained VM hosting. Ovirt with GlusterFS seems to fit that bill perfectly. I installed it and then starting kicking the tires. First results looked promising, but now I can get a VM to pause indefinitely fairly easy:
My setup is 3 hosts that are in a Virt and Gluster cluster. Gluster is setup as replica-3. The gluster export is used as the storage domain for the VM's.
Hi,
What version of gluster and ovirt are you using?
glusterfs-3.7.3-1.el7.x86_64 vdsm-4.16.20-0.el7.centos.x86_64 ovirt-engine-3.5.3.1-1.el7.centos.noarch
Now when I start the VM all is good, performance is good enough so we are happy. I then start bonnie++ to generate some load. I have a VM running on host 1, host 2 is SPM and all 3 VM's are seeing some network traffic courtesy of gluster.
Now, for fun, suddenly the network on host3 goes bad (iptables -I OUTPUT -m statistic --mode random --probability 0.75 -j REJECT). Some time later I see the guest has a small "hickup", I'm guessing that is when gluster decides host 3 is not allowed to play anymore. No big deal anyway. After a while 25% of packages just isn't good enough for Ovirt anymore, so the host will be fenced.
I'm not sure what fencing means w.r.t ovirt and what it actually fences. As far is gluster is concerned, since only one node is blocked, the VM image should still be accessible by the VM running on host1.
Fencing means (at least in this case) that the IPMI of the server does a power reset.
After a reboot *sometimes* the VM will be paused, and even after the gluster self-heal is complete it can not be unpaused, has to be restarted.
Could you provide the gluster mount (fuse?) logs and the brick logs of all 3 nodes when the VM is paused? That should give us some clue.
Logs are attached. Problem was at around 8:15 - 8:20 UTC This time however the vm stopped even without a reboot of hyp03
The mount logs (rhev-data-center-mnt-glusterSD*) are indicating frequent disconnects to the bricks with 'clnt_ping_timer_expired', 'Client-quorum is not met' and 'Read-only file system' messages. client-quorum is enabled by default for replica 3 volumes. So if the mount cannot connect to 2 bricks at least, quorum is lost and the gluster volume becomes read-only. That seems to be the reason why the VMs are pausing. I'm not sure if the frequent disconnects are due a flaky network or the bricks not responding to the mount's ping timer due to it's epoll threads busy with I/O (unlikely). Can you also share the output of `gluster volume info <volname>` ?
The frequent disconnects are probably because I intentionally broke the network on hyp03 (dropped 75% of outgoing packets). In my opinion this should not affect the VM an hyp02. Am I wrong to think that?
For client-quorum, If a client (mount) cannot connect to the number of bricks to achieve quorum, the client becomes read-only. So if the client on hyp02 can see itself and 01, it shouldn't be affected.
But it was, and I only "broke" hyp03.
[root@hyp01 ~]# gluster volume info VMS
Volume Name: VMS Type: Replicate Volume ID: 9e6657e7-8520-4720-ba9d-78b14a86c8ca Status: Started Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.99.50.20:/brick/VMS Brick2: 10.99.50.21:/brick/VMS Brick3: 10.99.50.22:/brick/VMS Options Reconfigured: performance.readdir-ahead: on nfs.disable: on user.cifs: disable auth.allow: * performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.stat-prefetch: off cluster.eager-lock: enable network.remote-dio: enable cluster.quorum-type: auto cluster.server-quorum-type: server
I see that you have enabled server-quorum too. Since you blocked hyp03, the if the glusterd on that node cannot see the other 2 nodes due to iptable rules, it would kill all brick processes. See the "7 How To Test " section in http://www.gluster.org/community/documentation/index.php/Features/Server-quo... to get a better idea of server-quorum.
Yes but it should only kill the bricks on hyp03, right? So then why does the VM on hyp02 die? I don't like the fact that a problem on any one of the hosts can bring down any VM on any host. -- Sander