Hi Krutika, Leo,
Sounds promising. I will test this too, and report back tomorrow (or
maybe sooner, if corruption occurs again).
-- Sander
On 27-03-19 10:00, Krutika Dhananjay wrote:
This is needed to prevent any inconsistencies stemming from buffered
writes/caching file data during live VM migration.
Besides, for Gluster to truly honor direct-io behavior in qemu's
'cache=none' mode (which is what oVirt uses),
one needs to turn on performance.strict-o-direct and disable remote-dio.
-Krutika
On Wed, Mar 27, 2019 at 12:24 PM Leo David <leoalex(a)gmail.com
<mailto:leoalex@gmail.com>> wrote:
Hi,
I can confirm that after setting these two options, I haven't
encountered disk corruptions anymore.
The downside, is that at least for me it had a pretty big impact
on performance.
The iops really went down - performing inside vm fio tests.
On Wed, Mar 27, 2019, 07:03 Krutika Dhananjay <kdhananj(a)redhat.com
<mailto:kdhananj@redhat.com>> wrote:
Could you enable strict-o-direct and disable remote-dio on the
src volume as well, restart the vms on "old" and retry migration?
# gluster volume set <VOLNAME> performance.strict-o-direct on
# gluster volume set <VOLNAME> network.remote-dio off
-Krutika
On Tue, Mar 26, 2019 at 10:32 PM Sander Hoentjen
<sander(a)hoentjen.eu <mailto:sander@hoentjen.eu>> wrote:
On 26-03-19 14:23, Sahina Bose wrote:
> +Krutika Dhananjay and gluster ml
>
> On Tue, Mar 26, 2019 at 6:16 PM Sander Hoentjen
<sander(a)hoentjen.eu <mailto:sander@hoentjen.eu>> wrote:
>> Hello,
>>
>> tl;dr We have disk corruption when doing live storage
migration on oVirt
>> 4.2 with gluster 3.12.15. Any idea why?
>>
>> We have a 3-node oVirt cluster that is both compute and
gluster-storage.
>> The manager runs on separate hardware. We are running
out of space on
>> this volume, so we added another Gluster volume that is
bigger, put a
>> storage domain on it and then we migrated VM's to it
with LSM. After
>> some time, we noticed that (some of) the migrated VM's
had corrupted
>> filesystems. After moving everything back with
export-import to the old
>> domain where possible, and recovering from backups
where needed we set
>> off to investigate this issue.
>>
>> We are now at the point where we can reproduce this
issue within a day.
>> What we have found so far:
>> 1) The corruption occurs at the very end of the
replication step, most
>> probably between START and FINISH of
diskReplicateFinish, before the
>> START merge step
>> 2) In the corrupted VM, at some place where data should
be, this data is
>> replaced by zero's. This can be file-contents or a
directory-structure
>> or whatever.
>> 3) The source gluster volume has different settings
then the destination
>> (Mostly because the defaults were different at creation
time):
>>
>> Setting old(src) new(dst)
>> cluster.op-version 30800 30800
(the same)
>> cluster.max-op-version 31202 31202
(the same)
>> cluster.metadata-self-heal off on
>> cluster.data-self-heal off on
>> cluster.entry-self-heal off on
>> performance.low-prio-threads 16 32
>> performance.strict-o-direct off on
>> network.ping-timeout 42 30
>> network.remote-dio enable off
>> transport.address-family - inet
>> performance.stat-prefetch off on
>> features.shard-block-size 512MB 64MB
>> cluster.shd-max-threads 1 8
>> cluster.shd-wait-qlength 1024 10000
>> cluster.locking-scheme full granular
>> cluster.granular-entry-heal no enable
>>
>> 4) To test, we migrate some VM's back and forth. The
corruption does not
>> occur every time. To this point it only occurs from old
to new, but we
>> don't have enough data-points to be sure about that.
>>
>> Anybody an idea what is causing the corruption? Is this
the best list to
>> ask, or should I ask on a Gluster list? I am not sure
if this is oVirt
>> specific or Gluster specific though.
> Do you have logs from old and new gluster volumes? Any
errors in the
> new volume's fuse mount logs?
Around the time of corruption I see the message:
The message "I [MSGID: 133017] [shard.c:4941:shard_seek]
0-ZoneA_Gluster1-shard: seek called on
7fabc273-3d8a-4a49-8906-b8ccbea4a49f. [Operation not
supported]" repeated 231 times between [2019-03-26
13:14:22.297333] and [2019-03-26 13:15:42.912170]
I also see this message at other times, when I don't see
the corruption occur, though.
--
Sander
_______________________________________________
Users mailing list -- users(a)ovirt.org <mailto:users@ovirt.org>
To unsubscribe send an email to users-leave(a)ovirt.org
<mailto:users-leave@ovirt.org>
Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
List Archives:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/M3T2VGGGV6D...
_______________________________________________
Users mailing list -- users(a)ovirt.org <mailto:users@ovirt.org>
To unsubscribe send an email to users-leave(a)ovirt.org
<mailto:users-leave@ovirt.org>
Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
List Archives:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/ZUIRM5PT4Y4...
_______________________________________________
Gluster-users mailing list
Gluster-users(a)gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users