Hello,
tl;dr We have disk corruption when doing live storage migration on oVirt
4.2 with gluster 3.12.15. Any idea why?
We have a 3-node oVirt cluster that is both compute and gluster-storage.
The manager runs on separate hardware. We are running out of space on
this volume, so we added another Gluster volume that is bigger, put a
storage domain on it and then we migrated VM's to it with LSM. After
some time, we noticed that (some of) the migrated VM's had corrupted
filesystems. After moving everything back with export-import to the old
domain where possible, and recovering from backups where needed we set
off to investigate this issue.
We are now at the point where we can reproduce this issue within a day.
What we have found so far:
1) The corruption occurs at the very end of the replication step, most
probably between START and FINISH of diskReplicateFinish, before the
START merge step
2) In the corrupted VM, at some place where data should be, this data is
replaced by zero's. This can be file-contents or a directory-structure
or whatever.
3) The source gluster volume has different settings then the destination
(Mostly because the defaults were different at creation time):
Setting old(src) new(dst)
cluster.op-version 30800 30800 (the same)
cluster.max-op-version 31202 31202 (the same)
cluster.metadata-self-heal off on
cluster.data-self-heal off on
cluster.entry-self-heal off on
performance.low-prio-threads 16 32
performance.strict-o-direct off on
network.ping-timeout 42 30
network.remote-dio enable off
transport.address-family - inet
performance.stat-prefetch off on
features.shard-block-size 512MB 64MB
cluster.shd-max-threads 1 8
cluster.shd-wait-qlength 1024 10000
cluster.locking-scheme full granular
cluster.granular-entry-heal no enable
4) To test, we migrate some VM's back and forth. The corruption does not
occur every time. To this point it only occurs from old to new, but we
don't have enough data-points to be sure about that.
Anybody an idea what is causing the corruption? Is this the best list to
ask, or should I ask on a Gluster list? I am not sure if this is oVirt
specific or Gluster specific though.
Kind regards,
Sander Hoentjen