[ovirt-users] Replicated Glusterfs on top of ZFS

Darrell Budic budic at onholyground.com
Fri Mar 3 18:00:45 UTC 2017


Why are you using an arbitrator if all your HW configs are identical? I’d use a true replica 3 in this case.

Also in my experience with gluster and vm hosting, the ZIL/slog degrades write performance unless it’s a truly dedicated disk. But I have 8 spinners backing my ZFS volumes, so trying to share a sata disk wasn’t a good zil. If yours is dedicated SAS, keep it, if it’s SATA, try testing without it.

You don’t have compression enabled on your zfs volume, and I’d recommend enabling relatime on it. Depending on the amount of RAM in these boxes, you probably want to limit your zfs arc size to 8G or so (1/4 total ram or less). Gluster just works volumes hard during a rebuild, what’s the problem you’re seeing? If it’s affecting your VMs, using shading and tuning client & server threads can help avoid interruptions to your VMs while repairs are running. If you really need to limit it, you can use cgroups to keep it from hogging all the CPU, but it takes longer to heal, of course. There are a couple older posts and blogs about it, if you go back a while.


> On Mar 3, 2017, at 9:02 AM, Arman Khalatyan <arm2arm at gmail.com> wrote:
> 
> The problem itself is not the streaming data performance., and also dd zero does not help much in the production zfs running with compression.
> the main problem comes when the gluster is starting to do something with that, it is using xattrs, probably accessing extended attributes inside the zfs is slower than XFS.
> Also primitive find file or ls -l in the (dot)gluster folders takes ages: 
> 
> now I can see that arbiter host has almost 100% cache miss during the rebuild, which is actually natural while he is reading always the new datasets:
> [root at clei26 ~]# arcstat.py 1
>     time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
> 15:57:31    29    29    100    29  100     0    0    29  100   685M   31G  
> 15:57:32   530   476     89   476   89     0    0   457   89   685M   31G  
> 15:57:33   480   467     97   467   97     0    0   463   97   685M   31G  
> 15:57:34   452   443     98   443   98     0    0   435   97   685M   31G  
> 15:57:35   582   547     93   547   93     0    0   536   94   685M   31G  
> 15:57:36   439   417     94   417   94     0    0   393   94   685M   31G  
> 15:57:38   435   392     90   392   90     0    0   374   89   685M   31G  
> 15:57:39   364   352     96   352   96     0    0   352   96   685M   31G  
> 15:57:40   408   375     91   375   91     0    0   360   91   685M   31G  
> 15:57:41   552   539     97   539   97     0    0   539   97   685M   31G  
> 
> It looks like we cannot have in the same system performance and reliability :(
> Simply final conclusion is with the single disk+ssd even zfs doesnot help to speedup the glusterfs healing.
> I will stop here:)
> 
> 
> 
> 
> On Fri, Mar 3, 2017 at 3:35 PM, Juan Pablo <pablo.localhost at gmail.com <mailto:pablo.localhost at gmail.com>> wrote:
> cd to inside the pool path
> then dd if=/dev/zero of=test.tt <http://test.tt/> bs=1M 
> leave it runing 5/10 minutes.
> do ctrl+c paste result here.
> etc.
> 
> 2017-03-03 11:30 GMT-03:00 Arman Khalatyan <arm2arm at gmail.com <mailto:arm2arm at gmail.com>>:
> No, I have one pool made of the one disk and ssd as a cache and log device.
> I have 3 Glusterfs bricks- separate 3 hosts:Volume type Replicate (Arbiter)= replica 2+1!
> That how much you can push into compute nodes(they have only 3 disk slots).
> 
> 
> On Fri, Mar 3, 2017 at 3:19 PM, Juan Pablo <pablo.localhost at gmail.com <mailto:pablo.localhost at gmail.com>> wrote:
> ok, you have 3 pools, zclei22, logs and cache, thats wrong. you should have 1 pool, with zlog+cache if you are looking for performance.
> also, dont mix drives. 
> whats the performance issue you are facing? 
> 
> 
> regards,
> 
> 2017-03-03 11:00 GMT-03:00 Arman Khalatyan <arm2arm at gmail.com <mailto:arm2arm at gmail.com>>:
> This is CentOS 7.3 ZoL version 0.6.5.9-1
> 
> [root at clei22 ~]# lsscsi
> 
> [2:0:0:0]    disk    ATA      INTEL SSDSC2CW24 400i  /dev/sda
> 
> [3:0:0:0]    disk    ATA      HGST HUS724040AL AA70  /dev/sdb
> 
> [4:0:0:0]    disk    ATA      WDC WD2002FYPS-0 1G01  /dev/sdc
> 
> 
> 
> [root at clei22 ~]# pvs ;vgs;lvs
> 
>   PV                                                 VG            Fmt  Attr PSize   PFree
> 
>   /dev/mapper/INTEL_SSDSC2CW240A3_CVCV306302RP240CGN vg_cache      lvm2 a--  223.57g     0
> 
>   /dev/sdc2                                          centos_clei22 lvm2 a--    1.82t 64.00m
> 
>   VG            #PV #LV #SN Attr   VSize   VFree
> 
>   centos_clei22   1   3   0 wz--n-   1.82t 64.00m
> 
>   vg_cache        1   2   0 wz--n- 223.57g     0
> 
>   LV       VG            Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
> 
>   home     centos_clei22 -wi-ao----   1.74t                                                   
> 
>   root     centos_clei22 -wi-ao----  50.00g                                                   
> 
>   swap     centos_clei22 -wi-ao----  31.44g                                                   
> 
>   lv_cache vg_cache      -wi-ao---- 213.57g                                                   
> 
>   lv_slog  vg_cache      -wi-ao----  10.00g   
> 
> 
> 
> [root at clei22 ~]# zpool status -v
> 
>   pool: zclei22
> 
>  state: ONLINE
> 
>   scan: scrub repaired 0 in 0h0m with 0 errors on Tue Feb 28 14:16:07 2017
> 
> config:
> 
> 
> 
>     NAME                                    STATE     READ WRITE CKSUM
> 
>     zclei22                                 ONLINE       0     0     0
> 
>       HGST_HUS724040ALA640_PN2334PBJ4SV6T1  ONLINE       0     0     0
> 
>     logs
> 
>       lv_slog                               ONLINE       0     0     0
> 
>     cache
> 
>       lv_cache                              ONLINE       0     0     0
> 
> 
> 
> errors: No known data errors
> 
> 
> ZFS config:
> 
> [root at clei22 ~]# zfs get all zclei22/01
> 
> NAME        PROPERTY              VALUE                  SOURCE
> 
> zclei22/01  type                  filesystem             -
> 
> zclei22/01  creation              Tue Feb 28 14:06 2017  -
> 
> zclei22/01  used                  389G                   -
> 
> zclei22/01  available             3.13T                  -
> 
> zclei22/01  referenced            389G                   -
> 
> zclei22/01  compressratio         1.01x                  -
> 
> zclei22/01  mounted               yes                    -
> 
> zclei22/01  quota                 none                   default
> 
> zclei22/01  reservation           none                   default
> 
> zclei22/01  recordsize            128K                   local
> 
> zclei22/01  mountpoint            /zclei22/01            default
> 
> zclei22/01  sharenfs              off                    default
> 
> zclei22/01  checksum              on                     default
> 
> zclei22/01  compression           off                    local
> 
> zclei22/01  atime                 on                     default
> 
> zclei22/01  devices               on                     default
> 
> zclei22/01  exec                  on                     default
> 
> zclei22/01  setuid                on                     default
> 
> zclei22/01  readonly              off                    default
> 
> zclei22/01  zoned                 off                    default
> 
> zclei22/01  snapdir               hidden                 default
> 
> zclei22/01  aclinherit            restricted             default
> 
> zclei22/01  canmount              on                     default
> 
> zclei22/01  xattr                 sa                     local
> 
> zclei22/01  copies                1                      default
> 
> zclei22/01  version               5                      -
> 
> zclei22/01  utf8only              off                    -
> 
> zclei22/01  normalization         none                   -
> 
> zclei22/01  casesensitivity       sensitive              -
> 
> zclei22/01  vscan                 off                    default
> 
> zclei22/01  nbmand                off                    default
> 
> zclei22/01  sharesmb              off                    default
> 
> zclei22/01  refquota              none                   default
> 
> zclei22/01  refreservation        none                   default
> 
> zclei22/01  primarycache          metadata               local
> 
> zclei22/01  secondarycache        metadata               local
> 
> zclei22/01  usedbysnapshots       0                      -
> 
> zclei22/01  usedbydataset         389G                   -
> 
> zclei22/01  usedbychildren        0                      -
> 
> zclei22/01  usedbyrefreservation  0                      -
> 
> zclei22/01  logbias               latency                default
> 
> zclei22/01  dedup                 off                    default
> 
> zclei22/01  mlslabel              none                   default
> 
> zclei22/01  sync                  disabled               local
> 
> zclei22/01  refcompressratio      1.01x                  -
> 
> zclei22/01  written               389G                   -
> 
> zclei22/01  logicalused           396G                   -
> 
> zclei22/01  logicalreferenced     396G                   -
> 
> zclei22/01  filesystem_limit      none                   default
> 
> zclei22/01  snapshot_limit        none                   default
> 
> zclei22/01  filesystem_count      none                   default
> 
> zclei22/01  snapshot_count        none                   default
> 
> zclei22/01  snapdev               hidden                 default
> 
> zclei22/01  acltype               off                    default
> 
> zclei22/01  context               none                   default
> 
> zclei22/01  fscontext             none                   default
> 
> zclei22/01  defcontext            none                   default
> 
> zclei22/01  rootcontext           none                   default
> 
> zclei22/01  relatime              off                    default
> 
> zclei22/01  redundant_metadata    all                    default
> 
> zclei22/01  overlay               off                    default
> 
> 
> 
> 
> 
> 
> On Fri, Mar 3, 2017 at 2:52 PM, Juan Pablo <pablo.localhost at gmail.com <mailto:pablo.localhost at gmail.com>> wrote:
> Which operating system version are you using for your zfs storage? 
> do:
> zfs get all your-pool-name
> use arc_summary.py from freenas git repo if you wish.
> 
> 
> 2017-03-03 10:33 GMT-03:00 Arman Khalatyan <arm2arm at gmail.com <mailto:arm2arm at gmail.com>>:
> Pool load:
> [root at clei21 ~]# zpool iostat -v 1 
>                                            capacity     operations    bandwidth
> pool                                    alloc   free   read  write   read  write
> --------------------------------------  -----  -----  -----  -----  -----  -----
> zclei21                                 10.1G  3.62T      0    112    823  8.82M
>   HGST_HUS724040ALA640_PN2334PBJ52XWT1  10.1G  3.62T      0     46    626  4.40M
> logs                                        -      -      -      -      -      -
>   lv_slog                                225M  9.72G      0     66    198  4.45M
> cache                                       -      -      -      -      -      -
>   lv_cache                              9.81G   204G      0     46     56  4.13M
> --------------------------------------  -----  -----  -----  -----  -----  -----
> 
>                                            capacity     operations    bandwidth
> pool                                    alloc   free   read  write   read  write
> --------------------------------------  -----  -----  -----  -----  -----  -----
> zclei21                                 10.1G  3.62T      0    191      0  12.8M
>   HGST_HUS724040ALA640_PN2334PBJ52XWT1  10.1G  3.62T      0      0      0      0
> logs                                        -      -      -      -      -      -
>   lv_slog                                225M  9.72G      0    191      0  12.8M
> cache                                       -      -      -      -      -      -
>   lv_cache                              9.83G   204G      0    218      0  20.0M
> --------------------------------------  -----  -----  -----  -----  -----  -----
> 
>                                            capacity     operations    bandwidth
> pool                                    alloc   free   read  write   read  write
> --------------------------------------  -----  -----  -----  -----  -----  -----
> zclei21                                 10.1G  3.62T      0    191      0  12.7M
>   HGST_HUS724040ALA640_PN2334PBJ52XWT1  10.1G  3.62T      0      0      0      0
> logs                                        -      -      -      -      -      -
>   lv_slog                                225M  9.72G      0    191      0  12.7M
> cache                                       -      -      -      -      -      -
>   lv_cache                              9.83G   204G      0     72      0  7.68M
> --------------------------------------  -----  -----  -----  -----  -----  -----
> 
> 
> On Fri, Mar 3, 2017 at 2:32 PM, Arman Khalatyan <arm2arm at gmail.com <mailto:arm2arm at gmail.com>> wrote:
> Glusterfs now in healing mode:
> Receiver:
> [root at clei21 ~]# arcstat.py 1
>     time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
> 13:24:49     0     0      0     0    0     0    0     0    0   4.6G   31G  
> 13:24:50   154    80     51    80   51     0    0    80   51   4.6G   31G  
> 13:24:51   179    62     34    62   34     0    0    62   42   4.6G   31G  
> 13:24:52   148    68     45    68   45     0    0    68   45   4.6G   31G  
> 13:24:53   140    64     45    64   45     0    0    64   45   4.6G   31G  
> 13:24:54   124    48     38    48   38     0    0    48   38   4.6G   31G  
> 13:24:55   157    80     50    80   50     0    0    80   50   4.7G   31G  
> 13:24:56   202    68     33    68   33     0    0    68   41   4.7G   31G  
> 13:24:57   127    54     42    54   42     0    0    54   42   4.7G   31G  
> 13:24:58   126    50     39    50   39     0    0    50   39   4.7G   31G  
> 13:24:59   116    40     34    40   34     0    0    40   34   4.7G   31G  
> 
> 
> Sender
> [root at clei22 ~]# arcstat.py 1
>     time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
> 13:28:37     8     2     25     2   25     0    0     2   25   468M   31G  
> 13:28:38  1.2K   727     62   727   62     0    0   525   54   469M   31G  
> 13:28:39   815   508     62   508   62     0    0   376   55   469M   31G  
> 13:28:40   994   624     62   624   62     0    0   450   54   469M   31G  
> 13:28:41   783   456     58   456   58     0    0   338   50   470M   31G  
> 13:28:42   916   541     59   541   59     0    0   390   50   470M   31G  
> 13:28:43   768   437     56   437   57     0    0   313   48   471M   31G  
> 13:28:44   877   534     60   534   60     0    0   393   53   470M   31G  
> 13:28:45   957   630     65   630   65     0    0   450   57   470M   31G  
> 13:28:46   819   479     58   479   58     0    0   357   51   471M   31G  
> 
> 
> On Thu, Mar 2, 2017 at 7:18 PM, Juan Pablo <pablo.localhost at gmail.com <mailto:pablo.localhost at gmail.com>> wrote:
> hey,
> what are you using for zfs? get an arc status and show please
> 
> 
> 2017-03-02 9:57 GMT-03:00 Arman Khalatyan <arm2arm at gmail.com <mailto:arm2arm at gmail.com>>:
> no, 
> ZFS itself is not on top of lvm. only ssd was spitted by lvm for slog(10G) and cache (the rest)
> but in any-case the ssd does not help much on glusterfs/ovirt  load it has almost 100% cache misses....:( (terrible performance compare with nfs)
> 
> 
> 
> 
> 
> On Thu, Mar 2, 2017 at 1:47 PM, FERNANDO FREDIANI <fernando.frediani at upx.com <mailto:fernando.frediani at upx.com>> wrote:
> Am I understanding correctly, but you have Gluster on the top of ZFS which is on the top of LVM ? If so, why the usage of LVM was necessary ? I have ZFS with any need of LVM.
> 
> Fernando
> 
> On 02/03/2017 06:19, Arman Khalatyan wrote:
>> Hi, 
>> I use 3 nodes with zfs and glusterfs.
>> Are there any suggestions to optimize it?
>> 
>> host zfs config 4TB-HDD+250GB-SSD:
>> [root at clei22 ~]# zpool status 
>>   pool: zclei22
>>  state: ONLINE
>>   scan: scrub repaired 0 in 0h0m with 0 errors on Tue Feb 28 14:16:07 2017
>> config:
>> 
>>     NAME                                    STATE     READ WRITE CKSUM
>>     zclei22                                 ONLINE       0     0     0
>>       HGST_HUS724040ALA640_PN2334PBJ4SV6T1  ONLINE       0     0     0
>>     logs
>>       lv_slog                               ONLINE       0     0     0
>>     cache
>>       lv_cache                              ONLINE       0     0     0
>> 
>> errors: No known data errors
>> 
>> Name:
>> GluReplica
>> Volume ID:
>> ee686dfe-203a-4caa-a691-26353460cc48
>> Volume Type:
>> Replicate (Arbiter)
>> Replica Count:
>> 2 + 1
>> Number of Bricks:
>> 3
>> Transport Types:
>> TCP, RDMA
>> Maximum no of snapshots:
>> 256
>> Capacity:
>> 3.51 TiB total, 190.56 GiB used, 3.33 TiB free
>> 
>> 
>> _______________________________________________
>> Users mailing list
>> Users at ovirt.org <mailto:Users at ovirt.org>
>> http://lists.ovirt.org/mailman/listinfo/users <http://lists.ovirt.org/mailman/listinfo/users>
> 
> 
> _______________________________________________
> Users mailing list
> Users at ovirt.org <mailto:Users at ovirt.org>
> http://lists.ovirt.org/mailman/listinfo/users <http://lists.ovirt.org/mailman/listinfo/users>
> 
> 
> 
> _______________________________________________
> Users mailing list
> Users at ovirt.org <mailto:Users at ovirt.org>
> http://lists.ovirt.org/mailman/listinfo/users <http://lists.ovirt.org/mailman/listinfo/users>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20170303/9328c720/attachment-0001.html>


More information about the Users mailing list