encounter problem to detect IO bandwidth congestion in ovirt storage domains
Adam Litke
agl at us.ibm.com
Tue Jun 25 15:23:51 UTC 2013
On Tue, Jun 18, 2013 at 12:00:07PM +0800, Mei EL Liu wrote:
> Hi,
>
> I found that its a little difficult to detect IO bandwidth
> congestion in ovirt storage domains supported by NFS or GlusterFs.
>
> For block based storage, it's easier to detect, since you can use
> some tool like iostat.For the case of file system based storage,
> it's much harder.
>
> I investigate the existing solution. vsphere uses average IO latency
> to detect it. I propose a similar scheme in
> http://www.ovirt.org/Features/Design/SLA_for_storage_io_bandwidth .
> It simplifies the scheme by make the congestion decision on a single
> host instead of using the statistics from all the hosts use the
> backend storage. It doesn't need communication between hosts and
> maybe in phase two we can add communication and make a global
> decision.
>
> For now, it detects congestion viastatistics of vms using that
> backend storage in the local host(This info is collected through
> iostat in vm). It collects IO latency in such vms and compute an
> average latency for that backend storage. If it is higher than
> threshold, a congestion is found.
>
> However, when I did testing for such a policy, I found that setting
> IO limit to a smaller value will make latency longer. That means if
> average latency exceeds that threshold and then our automatically
> tuning IO limit will be decreased which lead to average IO latency
> longer. Of course, this IO latency will exceed the threshold again
> and cause the IO limit be decreased. This will finally cause the IO
> limit to its lower bound.
>
> This scheme has affected by the following reasons:
> 1. we collect statistic data from vms instead of host. (This is
> because it is hard to collect such info for remote storage like NFS,
> GlusterFS)
> 2.The IO limit affect the latency.
> 3. The threshold is a constant.
> 4 I also find that iostat's await(latency info) is not good enough
> since the latency is long for very light io or very heavy IO.
>
>
> Does anybody get an idea or have experience on this? Suggestions are
> more than welcome. Thanks in advance.
Thank you for such a thorough introduction to this topic.
One thought I had is that maybe we need to invert your logic with
respect to IO throttling. The way it could work is:
1) At the datacenter level, we establish a throughput range. VMs are
guaranteed the minimum and won't exceed the maximum. Similarly, we
set a range for latency.
2) Hosts continuously reduce allowable bandwidth for VMs down to the
throughput minimum. If latency rises above the allowable limit for a
single VM slowly increase bandwidth up to allowable maximum.
3) If over time, the IO remains congested you can:
a) Decrease the cluster-wide throughput minimum and maximum values
b) Increase the maximum allowable latency
c) Migrate VM disks to alternate storage
d) Upgrade Storage
--
Adam Litke <agl at us.ibm.com>
IBM Linux Technology Center
More information about the Arch
mailing list