[ovirt-users] Multi-node cluster with local storage

Fri Mar 4 12:00:18 UTC 2016

On 04/03/16 13:50, "Sahina Bose" <sabose at redhat.com> wrote:
>On 03/04/2016 04:13 PM, Pavel Gashev wrote:
>> On 04/03/16 12:22, "Sahina Bose" <sabose at redhat.com> wrote:
>>> On 03/04/2016 02:14 AM, Pavel Gashev wrote:
>>>> Unfortunately, oVirt doesn't support multi-node local storage clusters.
>>>> And Gluster/CEPH doesn't work well over 1G network. It looks like that
>>>> the only way to use oVirt in a three-node cluster is to share local
>>>> storages over NFS. At least it makes possible to migrate VMs and move
>>>> disks among hardware nodes.
>>>
>>> Do you know of reported problems with Gluster over 1Gb network? I think
>>> 10Gb is recommended, but 1Gb can also be used for gluster.
>>> (We use it in our lab setup, and haven't encountered any issues so far
>>> but of course, the workload may be different - hence the question)
>> Let's calculate. If I have a three node replicated gluster volume, each block writing on a node copies the block to the other two nodes. Thus, maximal write performance can't be above 50MB/s. Even it's acceptable for my workload, things get worse in failure recovering scenario. Gluster works with files. When a node fails and then recovers (even it's just a plain reboot), gluster copies the whole file over network if the file is changed during node outage. So if I have a 100GB VM disk, and guest system has written a 512-byte block to the disk, the whole 100GB will be copied during recovery. It might take 20 minutes for 100GB, and 3 hours for 1TB. And network will be 100% busy during recovery, so VMs on other nodes will wait for I/O most of time. In other words, a plain reboot of a node would result in datacenter out of service for several hours.
>>
>> Things might be better if you have a distributed+replicated gluster volume. It requires at least six nodes. But things are still bad when you try to rebalance the volume after adding new bricks, or when a node has really failed and replaced.
>>
>> Thus, 1GB network is ok for a lab, but it's not ok for production. IMHO.
>
>Most of the problems that you outline here - related to healing and 
>replacing are addressed with the sharding translator. Sharding breaks 
>the large image file into smaller files, so that the entire file does 
>not have to be copied. More details here - 
>http://blog.gluster.org/2015/12/introducing-shard-translator/

Sure, I meant the same by mentioning distributed+replicated volumes. Actually, distributed+striped+replicated - https://access.redhat.com/documentation/en-US/Red_Hat_Storage/2.0/html/Administration_Guide/sect-User_Guide-Setting_Volumes-Distributed_Striped_Replicated.html