[ovirt-users] Replicated Glusterfs on top of ZFS
Darrell Budic
budic at onholyground.com
Fri Mar 3 18:00:45 UTC 2017
Why are you using an arbitrator if all your HW configs are identical? I’d use a true replica 3 in this case.
Also in my experience with gluster and vm hosting, the ZIL/slog degrades write performance unless it’s a truly dedicated disk. But I have 8 spinners backing my ZFS volumes, so trying to share a sata disk wasn’t a good zil. If yours is dedicated SAS, keep it, if it’s SATA, try testing without it.
You don’t have compression enabled on your zfs volume, and I’d recommend enabling relatime on it. Depending on the amount of RAM in these boxes, you probably want to limit your zfs arc size to 8G or so (1/4 total ram or less). Gluster just works volumes hard during a rebuild, what’s the problem you’re seeing? If it’s affecting your VMs, using shading and tuning client & server threads can help avoid interruptions to your VMs while repairs are running. If you really need to limit it, you can use cgroups to keep it from hogging all the CPU, but it takes longer to heal, of course. There are a couple older posts and blogs about it, if you go back a while.
> On Mar 3, 2017, at 9:02 AM, Arman Khalatyan <arm2arm at gmail.com> wrote:
>
> The problem itself is not the streaming data performance., and also dd zero does not help much in the production zfs running with compression.
> the main problem comes when the gluster is starting to do something with that, it is using xattrs, probably accessing extended attributes inside the zfs is slower than XFS.
> Also primitive find file or ls -l in the (dot)gluster folders takes ages:
>
> now I can see that arbiter host has almost 100% cache miss during the rebuild, which is actually natural while he is reading always the new datasets:
> [root at clei26 ~]# arcstat.py 1
> time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
> 15:57:31 29 29 100 29 100 0 0 29 100 685M 31G
> 15:57:32 530 476 89 476 89 0 0 457 89 685M 31G
> 15:57:33 480 467 97 467 97 0 0 463 97 685M 31G
> 15:57:34 452 443 98 443 98 0 0 435 97 685M 31G
> 15:57:35 582 547 93 547 93 0 0 536 94 685M 31G
> 15:57:36 439 417 94 417 94 0 0 393 94 685M 31G
> 15:57:38 435 392 90 392 90 0 0 374 89 685M 31G
> 15:57:39 364 352 96 352 96 0 0 352 96 685M 31G
> 15:57:40 408 375 91 375 91 0 0 360 91 685M 31G
> 15:57:41 552 539 97 539 97 0 0 539 97 685M 31G
>
> It looks like we cannot have in the same system performance and reliability :(
> Simply final conclusion is with the single disk+ssd even zfs doesnot help to speedup the glusterfs healing.
> I will stop here:)
>
>
>
>
> On Fri, Mar 3, 2017 at 3:35 PM, Juan Pablo <pablo.localhost at gmail.com <mailto:pablo.localhost at gmail.com>> wrote:
> cd to inside the pool path
> then dd if=/dev/zero of=test.tt <http://test.tt/> bs=1M
> leave it runing 5/10 minutes.
> do ctrl+c paste result here.
> etc.
>
> 2017-03-03 11:30 GMT-03:00 Arman Khalatyan <arm2arm at gmail.com <mailto:arm2arm at gmail.com>>:
> No, I have one pool made of the one disk and ssd as a cache and log device.
> I have 3 Glusterfs bricks- separate 3 hosts:Volume type Replicate (Arbiter)= replica 2+1!
> That how much you can push into compute nodes(they have only 3 disk slots).
>
>
> On Fri, Mar 3, 2017 at 3:19 PM, Juan Pablo <pablo.localhost at gmail.com <mailto:pablo.localhost at gmail.com>> wrote:
> ok, you have 3 pools, zclei22, logs and cache, thats wrong. you should have 1 pool, with zlog+cache if you are looking for performance.
> also, dont mix drives.
> whats the performance issue you are facing?
>
>
> regards,
>
> 2017-03-03 11:00 GMT-03:00 Arman Khalatyan <arm2arm at gmail.com <mailto:arm2arm at gmail.com>>:
> This is CentOS 7.3 ZoL version 0.6.5.9-1
>
> [root at clei22 ~]# lsscsi
>
> [2:0:0:0] disk ATA INTEL SSDSC2CW24 400i /dev/sda
>
> [3:0:0:0] disk ATA HGST HUS724040AL AA70 /dev/sdb
>
> [4:0:0:0] disk ATA WDC WD2002FYPS-0 1G01 /dev/sdc
>
>
>
> [root at clei22 ~]# pvs ;vgs;lvs
>
> PV VG Fmt Attr PSize PFree
>
> /dev/mapper/INTEL_SSDSC2CW240A3_CVCV306302RP240CGN vg_cache lvm2 a-- 223.57g 0
>
> /dev/sdc2 centos_clei22 lvm2 a-- 1.82t 64.00m
>
> VG #PV #LV #SN Attr VSize VFree
>
> centos_clei22 1 3 0 wz--n- 1.82t 64.00m
>
> vg_cache 1 2 0 wz--n- 223.57g 0
>
> LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
>
> home centos_clei22 -wi-ao---- 1.74t
>
> root centos_clei22 -wi-ao---- 50.00g
>
> swap centos_clei22 -wi-ao---- 31.44g
>
> lv_cache vg_cache -wi-ao---- 213.57g
>
> lv_slog vg_cache -wi-ao---- 10.00g
>
>
>
> [root at clei22 ~]# zpool status -v
>
> pool: zclei22
>
> state: ONLINE
>
> scan: scrub repaired 0 in 0h0m with 0 errors on Tue Feb 28 14:16:07 2017
>
> config:
>
>
>
> NAME STATE READ WRITE CKSUM
>
> zclei22 ONLINE 0 0 0
>
> HGST_HUS724040ALA640_PN2334PBJ4SV6T1 ONLINE 0 0 0
>
> logs
>
> lv_slog ONLINE 0 0 0
>
> cache
>
> lv_cache ONLINE 0 0 0
>
>
>
> errors: No known data errors
>
>
> ZFS config:
>
> [root at clei22 ~]# zfs get all zclei22/01
>
> NAME PROPERTY VALUE SOURCE
>
> zclei22/01 type filesystem -
>
> zclei22/01 creation Tue Feb 28 14:06 2017 -
>
> zclei22/01 used 389G -
>
> zclei22/01 available 3.13T -
>
> zclei22/01 referenced 389G -
>
> zclei22/01 compressratio 1.01x -
>
> zclei22/01 mounted yes -
>
> zclei22/01 quota none default
>
> zclei22/01 reservation none default
>
> zclei22/01 recordsize 128K local
>
> zclei22/01 mountpoint /zclei22/01 default
>
> zclei22/01 sharenfs off default
>
> zclei22/01 checksum on default
>
> zclei22/01 compression off local
>
> zclei22/01 atime on default
>
> zclei22/01 devices on default
>
> zclei22/01 exec on default
>
> zclei22/01 setuid on default
>
> zclei22/01 readonly off default
>
> zclei22/01 zoned off default
>
> zclei22/01 snapdir hidden default
>
> zclei22/01 aclinherit restricted default
>
> zclei22/01 canmount on default
>
> zclei22/01 xattr sa local
>
> zclei22/01 copies 1 default
>
> zclei22/01 version 5 -
>
> zclei22/01 utf8only off -
>
> zclei22/01 normalization none -
>
> zclei22/01 casesensitivity sensitive -
>
> zclei22/01 vscan off default
>
> zclei22/01 nbmand off default
>
> zclei22/01 sharesmb off default
>
> zclei22/01 refquota none default
>
> zclei22/01 refreservation none default
>
> zclei22/01 primarycache metadata local
>
> zclei22/01 secondarycache metadata local
>
> zclei22/01 usedbysnapshots 0 -
>
> zclei22/01 usedbydataset 389G -
>
> zclei22/01 usedbychildren 0 -
>
> zclei22/01 usedbyrefreservation 0 -
>
> zclei22/01 logbias latency default
>
> zclei22/01 dedup off default
>
> zclei22/01 mlslabel none default
>
> zclei22/01 sync disabled local
>
> zclei22/01 refcompressratio 1.01x -
>
> zclei22/01 written 389G -
>
> zclei22/01 logicalused 396G -
>
> zclei22/01 logicalreferenced 396G -
>
> zclei22/01 filesystem_limit none default
>
> zclei22/01 snapshot_limit none default
>
> zclei22/01 filesystem_count none default
>
> zclei22/01 snapshot_count none default
>
> zclei22/01 snapdev hidden default
>
> zclei22/01 acltype off default
>
> zclei22/01 context none default
>
> zclei22/01 fscontext none default
>
> zclei22/01 defcontext none default
>
> zclei22/01 rootcontext none default
>
> zclei22/01 relatime off default
>
> zclei22/01 redundant_metadata all default
>
> zclei22/01 overlay off default
>
>
>
>
>
>
> On Fri, Mar 3, 2017 at 2:52 PM, Juan Pablo <pablo.localhost at gmail.com <mailto:pablo.localhost at gmail.com>> wrote:
> Which operating system version are you using for your zfs storage?
> do:
> zfs get all your-pool-name
> use arc_summary.py from freenas git repo if you wish.
>
>
> 2017-03-03 10:33 GMT-03:00 Arman Khalatyan <arm2arm at gmail.com <mailto:arm2arm at gmail.com>>:
> Pool load:
> [root at clei21 ~]# zpool iostat -v 1
> capacity operations bandwidth
> pool alloc free read write read write
> -------------------------------------- ----- ----- ----- ----- ----- -----
> zclei21 10.1G 3.62T 0 112 823 8.82M
> HGST_HUS724040ALA640_PN2334PBJ52XWT1 10.1G 3.62T 0 46 626 4.40M
> logs - - - - - -
> lv_slog 225M 9.72G 0 66 198 4.45M
> cache - - - - - -
> lv_cache 9.81G 204G 0 46 56 4.13M
> -------------------------------------- ----- ----- ----- ----- ----- -----
>
> capacity operations bandwidth
> pool alloc free read write read write
> -------------------------------------- ----- ----- ----- ----- ----- -----
> zclei21 10.1G 3.62T 0 191 0 12.8M
> HGST_HUS724040ALA640_PN2334PBJ52XWT1 10.1G 3.62T 0 0 0 0
> logs - - - - - -
> lv_slog 225M 9.72G 0 191 0 12.8M
> cache - - - - - -
> lv_cache 9.83G 204G 0 218 0 20.0M
> -------------------------------------- ----- ----- ----- ----- ----- -----
>
> capacity operations bandwidth
> pool alloc free read write read write
> -------------------------------------- ----- ----- ----- ----- ----- -----
> zclei21 10.1G 3.62T 0 191 0 12.7M
> HGST_HUS724040ALA640_PN2334PBJ52XWT1 10.1G 3.62T 0 0 0 0
> logs - - - - - -
> lv_slog 225M 9.72G 0 191 0 12.7M
> cache - - - - - -
> lv_cache 9.83G 204G 0 72 0 7.68M
> -------------------------------------- ----- ----- ----- ----- ----- -----
>
>
> On Fri, Mar 3, 2017 at 2:32 PM, Arman Khalatyan <arm2arm at gmail.com <mailto:arm2arm at gmail.com>> wrote:
> Glusterfs now in healing mode:
> Receiver:
> [root at clei21 ~]# arcstat.py 1
> time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
> 13:24:49 0 0 0 0 0 0 0 0 0 4.6G 31G
> 13:24:50 154 80 51 80 51 0 0 80 51 4.6G 31G
> 13:24:51 179 62 34 62 34 0 0 62 42 4.6G 31G
> 13:24:52 148 68 45 68 45 0 0 68 45 4.6G 31G
> 13:24:53 140 64 45 64 45 0 0 64 45 4.6G 31G
> 13:24:54 124 48 38 48 38 0 0 48 38 4.6G 31G
> 13:24:55 157 80 50 80 50 0 0 80 50 4.7G 31G
> 13:24:56 202 68 33 68 33 0 0 68 41 4.7G 31G
> 13:24:57 127 54 42 54 42 0 0 54 42 4.7G 31G
> 13:24:58 126 50 39 50 39 0 0 50 39 4.7G 31G
> 13:24:59 116 40 34 40 34 0 0 40 34 4.7G 31G
>
>
> Sender
> [root at clei22 ~]# arcstat.py 1
> time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
> 13:28:37 8 2 25 2 25 0 0 2 25 468M 31G
> 13:28:38 1.2K 727 62 727 62 0 0 525 54 469M 31G
> 13:28:39 815 508 62 508 62 0 0 376 55 469M 31G
> 13:28:40 994 624 62 624 62 0 0 450 54 469M 31G
> 13:28:41 783 456 58 456 58 0 0 338 50 470M 31G
> 13:28:42 916 541 59 541 59 0 0 390 50 470M 31G
> 13:28:43 768 437 56 437 57 0 0 313 48 471M 31G
> 13:28:44 877 534 60 534 60 0 0 393 53 470M 31G
> 13:28:45 957 630 65 630 65 0 0 450 57 470M 31G
> 13:28:46 819 479 58 479 58 0 0 357 51 471M 31G
>
>
> On Thu, Mar 2, 2017 at 7:18 PM, Juan Pablo <pablo.localhost at gmail.com <mailto:pablo.localhost at gmail.com>> wrote:
> hey,
> what are you using for zfs? get an arc status and show please
>
>
> 2017-03-02 9:57 GMT-03:00 Arman Khalatyan <arm2arm at gmail.com <mailto:arm2arm at gmail.com>>:
> no,
> ZFS itself is not on top of lvm. only ssd was spitted by lvm for slog(10G) and cache (the rest)
> but in any-case the ssd does not help much on glusterfs/ovirt load it has almost 100% cache misses....:( (terrible performance compare with nfs)
>
>
>
>
>
> On Thu, Mar 2, 2017 at 1:47 PM, FERNANDO FREDIANI <fernando.frediani at upx.com <mailto:fernando.frediani at upx.com>> wrote:
> Am I understanding correctly, but you have Gluster on the top of ZFS which is on the top of LVM ? If so, why the usage of LVM was necessary ? I have ZFS with any need of LVM.
>
> Fernando
>
> On 02/03/2017 06:19, Arman Khalatyan wrote:
>> Hi,
>> I use 3 nodes with zfs and glusterfs.
>> Are there any suggestions to optimize it?
>>
>> host zfs config 4TB-HDD+250GB-SSD:
>> [root at clei22 ~]# zpool status
>> pool: zclei22
>> state: ONLINE
>> scan: scrub repaired 0 in 0h0m with 0 errors on Tue Feb 28 14:16:07 2017
>> config:
>>
>> NAME STATE READ WRITE CKSUM
>> zclei22 ONLINE 0 0 0
>> HGST_HUS724040ALA640_PN2334PBJ4SV6T1 ONLINE 0 0 0
>> logs
>> lv_slog ONLINE 0 0 0
>> cache
>> lv_cache ONLINE 0 0 0
>>
>> errors: No known data errors
>>
>> Name:
>> GluReplica
>> Volume ID:
>> ee686dfe-203a-4caa-a691-26353460cc48
>> Volume Type:
>> Replicate (Arbiter)
>> Replica Count:
>> 2 + 1
>> Number of Bricks:
>> 3
>> Transport Types:
>> TCP, RDMA
>> Maximum no of snapshots:
>> 256
>> Capacity:
>> 3.51 TiB total, 190.56 GiB used, 3.33 TiB free
>>
>>
>> _______________________________________________
>> Users mailing list
>> Users at ovirt.org <mailto:Users at ovirt.org>
>> http://lists.ovirt.org/mailman/listinfo/users <http://lists.ovirt.org/mailman/listinfo/users>
>
>
> _______________________________________________
> Users mailing list
> Users at ovirt.org <mailto:Users at ovirt.org>
> http://lists.ovirt.org/mailman/listinfo/users <http://lists.ovirt.org/mailman/listinfo/users>
>
>
>
> _______________________________________________
> Users mailing list
> Users at ovirt.org <mailto:Users at ovirt.org>
> http://lists.ovirt.org/mailman/listinfo/users <http://lists.ovirt.org/mailman/listinfo/users>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20170303/9328c720/attachment-0001.html>
More information about the Users
mailing list