This is needed to prevent any inconsistencies stemming from buffered
writes/caching file data during live VM migration.
Besides, for Gluster to truly honor direct-io behavior in qemu's
'cache=none' mode (which is what oVirt uses),
one needs to turn on performance.strict-o-direct and disable remote-dio.
-Krutika
On Wed, Mar 27, 2019 at 12:24 PM Leo David <leoalex(a)gmail.com> wrote:
Hi,
I can confirm that after setting these two options, I haven't encountered
disk corruptions anymore.
The downside, is that at least for me it had a pretty big impact on
performance.
The iops really went down - performing inside vm fio tests.
On Wed, Mar 27, 2019, 07:03 Krutika Dhananjay <kdhananj(a)redhat.com> wrote:
> Could you enable strict-o-direct and disable remote-dio on the src volume
> as well, restart the vms on "old" and retry migration?
>
> # gluster volume set <VOLNAME> performance.strict-o-direct on
> # gluster volume set <VOLNAME> network.remote-dio off
>
> -Krutika
>
> On Tue, Mar 26, 2019 at 10:32 PM Sander Hoentjen <sander(a)hoentjen.eu>
> wrote:
>
>> On 26-03-19 14:23, Sahina Bose wrote:
>> > +Krutika Dhananjay and gluster ml
>> >
>> > On Tue, Mar 26, 2019 at 6:16 PM Sander Hoentjen <sander(a)hoentjen.eu>
>> wrote:
>> >> Hello,
>> >>
>> >> tl;dr We have disk corruption when doing live storage migration on
>> oVirt
>> >> 4.2 with gluster 3.12.15. Any idea why?
>> >>
>> >> We have a 3-node oVirt cluster that is both compute and
>> gluster-storage.
>> >> The manager runs on separate hardware. We are running out of space on
>> >> this volume, so we added another Gluster volume that is bigger, put a
>> >> storage domain on it and then we migrated VM's to it with LSM.
After
>> >> some time, we noticed that (some of) the migrated VM's had
corrupted
>> >> filesystems. After moving everything back with export-import to the
>> old
>> >> domain where possible, and recovering from backups where needed we set
>> >> off to investigate this issue.
>> >>
>> >> We are now at the point where we can reproduce this issue within a
>> day.
>> >> What we have found so far:
>> >> 1) The corruption occurs at the very end of the replication step, most
>> >> probably between START and FINISH of diskReplicateFinish, before the
>> >> START merge step
>> >> 2) In the corrupted VM, at some place where data should be, this data
>> is
>> >> replaced by zero's. This can be file-contents or a
directory-structure
>> >> or whatever.
>> >> 3) The source gluster volume has different settings then the
>> destination
>> >> (Mostly because the defaults were different at creation time):
>> >>
>> >> Setting old(src) new(dst)
>> >> cluster.op-version 30800 30800 (the same)
>> >> cluster.max-op-version 31202 31202 (the same)
>> >> cluster.metadata-self-heal off on
>> >> cluster.data-self-heal off on
>> >> cluster.entry-self-heal off on
>> >> performance.low-prio-threads 16 32
>> >> performance.strict-o-direct off on
>> >> network.ping-timeout 42 30
>> >> network.remote-dio enable off
>> >> transport.address-family - inet
>> >> performance.stat-prefetch off on
>> >> features.shard-block-size 512MB 64MB
>> >> cluster.shd-max-threads 1 8
>> >> cluster.shd-wait-qlength 1024 10000
>> >> cluster.locking-scheme full granular
>> >> cluster.granular-entry-heal no enable
>> >>
>> >> 4) To test, we migrate some VM's back and forth. The corruption
does
>> not
>> >> occur every time. To this point it only occurs from old to new, but we
>> >> don't have enough data-points to be sure about that.
>> >>
>> >> Anybody an idea what is causing the corruption? Is this the best list
>> to
>> >> ask, or should I ask on a Gluster list? I am not sure if this is oVirt
>> >> specific or Gluster specific though.
>> > Do you have logs from old and new gluster volumes? Any errors in the
>> > new volume's fuse mount logs?
>>
>> Around the time of corruption I see the message:
>> The message "I [MSGID: 133017] [shard.c:4941:shard_seek]
>> 0-ZoneA_Gluster1-shard: seek called on
>> 7fabc273-3d8a-4a49-8906-b8ccbea4a49f. [Operation not supported]" repeated
>> 231 times between [2019-03-26 13:14:22.297333] and [2019-03-26
>> 13:15:42.912170]
>>
>> I also see this message at other times, when I don't see the corruption
>> occur, though.
>>
>> --
>> Sander
>> _______________________________________________
>> Users mailing list -- users(a)ovirt.org
>> To unsubscribe send an email to users-leave(a)ovirt.org
>> Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
>> oVirt Code of Conduct:
>>
https://www.ovirt.org/community/about/community-guidelines/
>> List Archives:
>>
https://lists.ovirt.org/archives/list/users@ovirt.org/message/M3T2VGGGV6D...
>>
> _______________________________________________
> Users mailing list -- users(a)ovirt.org
> To unsubscribe send an email to users-leave(a)ovirt.org
> Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct:
>
https://www.ovirt.org/community/about/community-guidelines/
> List Archives:
>
https://lists.ovirt.org/archives/list/users@ovirt.org/message/ZUIRM5PT4Y4...
>