Hi Strahil,
I've tried to measure the cost or of erasure coding and, more importantly, VDO with
de-duplication and compression a bit.
Erasure coding should be neglible in terms of CPU power while the vastly more complex LZ4
compression (used inside VDO) really is rather impressive at 1GByte/s single threaded for
compression (6Gbyte/s decompression, on a 25GByte/s memory bus) on the 15Watt NUCs I am
using for one cluster.
The storage I/O overhead of erasure coding shouldn't really matter with NVMe becoming
cheaper than SATA SSD. Perhaps the write amplification needs to be watched with SSDs, but
a lot of that is writeback tuning and with a Gluster in the back, you can commit to RAM as
long as you have a quorum (and a UPS).
Actually with Gluster I guess most of the erasure coding would actually be done by the
client and the network amplification would also be there, but not really different between
erasure coding and replicas: If you write to nine nodes, you write to nine nodes from the
client independent of the encoding.
There the ability to say "please continue to use the 4:2 dispersion as I expand from
6 to 9 nodes and roll that across on a shard by shard base without me having to set up
bricks like that", would certainly help.
With all of VDO enabled I get 200MByte/s for a random data workload on FIO via Gluster,
which becomes 600MByte/s for reads with 3 replicas on the 10Gbit network I use, 60% of the
theoretical maximum with random I/O.
That's completely adequate, because we're not running HPC or SAP batches here and
I'd be rather sure that using erasure coding with 6 and 9 nodes won't introduce a
performance bottleneck, unless I go to 40 or 100GBit on the network.
I'd just really want to be able to choose between say 1, 2 or 3 out of 9 bricks being
used for redundancy, depending on if it's an HCI block next door, going into a ship
with months at sea or into a space station.
I'd also probably add an extra node or two to act as warm (even cold) standby in
critical or hard-to-reach locations, that act as compute-only nodes initially (to avoid
split quotas), but can be promoted to replace a storage node that failed without hands-on
intervention.
oVirt HCI is as close at it gets to LEGO computers, but right now it's doing LEGO with
your hands tied behind your back.
Kind regards, Thomas