[ovirt-users] Re: Add gluster storage domain incomplete documentation

7 Apr 2020

      On April 7, 2020 2:21:53 AM GMT+03:00, Gianluca Cecchi <gianluca.cecchi@gmail.com> wrote:
...
On Mon, Apr 6, 2020 at 7:15 PM Strahil Nikolov <hunter86_bg@yahoo.com>
wrote:
[snip]
...
Hi Gianluca,,
Actually  the situation is just like CEPH & Openstack...
You have  Openstack  (in our case  oVirt) that can manage basic
tasks
...
with the storage, but many administrators do not rely on the UI for
complex
tasks.
Hi Strahil, thanks for your answers.
Actually here we have the basic steps of Gluster bricks setup that are
missing and only the more complex ones apparently enabled at GUI
level....
...
In order to properly run a HCI ,  some gluster knowledge  is
...
(personal opinion - you will never  find that word  anywhere :)  ).
In your case, you need:
1.  Blacklist the  disks  in multipath.conf . As  it is managed  by
vdsm,
you need  to put  a  special  comment '# VDSM PRIVATE' (without the
quotes
!) in order to prevent VDSM from modifying. I don't know if this  is
"mandatory"
the
...
best approach, yet  it  works for me.
Actually when you complete the initial supported gui based HCI setup,
it
doesn't blacklist anything in multipath.conf and it doesn't keep
private
the file.
So I would like to avoid it. I don't think it should be necessary.
The only blacklist part inside the setup generated file is:
blacklist {
       protocol "(scsi:adt|scsi:sbp)"
}
In HCI single host setup you give the gui the whole disks' names: in my
case they were /dev/nvme0n1 (for engine and data bricks/volumes) and
/dev/nvme1n1 (for vmstore). All as JBOD.
And the final configuration setup has similar approach to yours and
resembling the Red Hat Gluster storage link I sent:
https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.5/ht...
...
2. Create a  VDO (skip if not needed)
I didn't check it during initial setup, so it was skipped
3. Create  PV  from the VDO/disk/array
...
Yes, the setup created a PV, but not on /dev/nvme0n1 and on
/dev/nvme1n1,
but on their multipath side of the moon....
On my system after setup I have this for my two disks dedicated to
Gluster
Volumes:
[root@ovirt ~]# multipath -l
nvme.8086-50484b53373530353031325233373541474e-494e54454c205353 dm-5
NVME,INTEL SSDPED1K375GA
size=349G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=0 status=active
 `- 0:0:1:0 nvme0n1 259:0 active undef running
eui.01000000010000005cd2e4e359284f51 dm-7 NVME,INTEL SSDPE2KX010T7
size=932G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=0 status=active
 `- 2:0:1:0 nvme1n1 259:2 active undef running
[root@ovirt ~]#
So the possibilities are two:
- the setup workflow has done something wrong and it should have
blacklisted the disks
- it is correct that the multipath devices are in place and the PVs
done on
top of them
I don't know which is the correct one.
Can anyone answer the expected correct config after the initial setup?
The Gluster Storage guide says that I should do in my case:
pvcreate --dataalignment 256K multipath_device
NOTE: the 256K is the value specified in Gluster Storage Guide for JBOD
It seems confirmed by existing PVs:
[root@ovirt ~]# pvs -o +pe_start
/dev/mapper/eui.01000000010000005cd2e4e359284f51
PV                                               VG                 Fmt
Attr PSize   PFree 1st PE
/dev/mapper/eui.01000000010000005cd2e4e359284f51 gluster_vg_nvme1n1
lvm2
a--  931.51g    0  256.00k
[root@ovirt ~]#
4.  Either add to an existing VG or create a new one
...
Yes, the setup created two VGs:
gluster_vg_nvme0n1 on the firt multipath device
gluster_vg_nvme1n1 on the second multipath device
Just to confirm I re-created a very similar setup (only difference is
that
I used only one disk for all the 3 gluster volumes and one disk for the
operating system disk) inside this ovirt installation as a nested
environment.
Here the disk to configure for Gluster in HCI single host setup is
/dev/sdb
and the final result after reboot is:
Note the "n" (for nested) in front of the host name that is not the
same as
before
[root@novirt ~]# multipath -l
0QEMU_QEMU_HARDDISK_4daa576b-2020-4747-b dm-5 QEMU    ,QEMU HARDDISK
size=150G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=0 status=active
 `- 2:0:0:1 sdb 8:16 active undef running
[root@novirt ~]#
[root@novirt ~]# vgs
 VG             #PV #LV #SN Attr   VSize    VFree
 gluster_vg_sdb   1   4   0 wz--n- <150.00g      0
 onn_novirt       1  11   0 wz--n-  <99.00g <17.88g
[root@novirt ~]#
[root@novirt ~]# lvs gluster_vg_sdb
LV                              VG             Attr       LSize   Pool
                        Origin Data%  Meta%  Move Log Cpy%Sync Convert
 gluster_lv_data                 gluster_vg_sdb Vwi-aot--- 500.00g
gluster_thinpool_gluster_vg_sdb        0.05
gluster_lv_engine               gluster_vg_sdb -wi-ao---- 100.00g
gluster_lv_vmstore              gluster_vg_sdb Vwi-aot--- 500.00g
gluster_thinpool_gluster_vg_sdb        0.56
gluster_thinpool_gluster_vg_sdb gluster_vg_sdb twi-aot--- <44.00g
                              6.98   0.54
[root@novirt ~]#
So this confirms that setup creates PVs on top of a multipath device
(even
if composed by only one path)
I don't know with a 3 hosts HCI setup the approach would have been
different or not... anyone chiming in?
So I should simply execute, for JBOD (more considerations on PE size in
Red
Hat Gluster Storage admin guide, for RAIDN scenarios):
vgcreate VG_NAME multipath_device
5. Create a thin LVM pool and thin LV (if you want gluster-level
...
snapshots). I  use this approach to snapshot my HostedEngine VM. For
details,  I can tell you in a separate thread.
It seems also the setup creates thin LVs (apart for engine domain, as
the
manual says)
Coming back to the physical environment and concentrating on the
vmstore
volume, I have indeed:
[root@ovirt ~]# lvs gluster_vg_nvme1n1
LV                                  VG                 Attr       LSize
Pool                                Origin Data%  Meta%  Move Log
Cpy%Sync
Convert
gluster_lv_vmstore                  gluster_vg_nvme1n1 Vwi-aot---
930.00g
gluster_thinpool_gluster_vg_nvme1n1        48.67
gluster_thinpool_gluster_vg_nvme1n1 gluster_vg_nvme1n1 twi-aot---
921.51g
                                          49.12  1.46
[root@ovirt ~]#
In my case it seems I can execute:
lvcreate --thin VG_NAME/POOL_NAME --extents 100%FREE --chunksize
CHUNKSIZE
--poolmetadatasize METASIZE --zero n
The docs recommends to create the pool metadata device of the maximum
size
possible, that is 16GiB.
As my disk is 4Tb I think it is ok, for maximum safety
Also for JBOD the chunksize has to be 256K
So my commands:
lvcreate --thin VG_NAME/POOL_NAME --extents 100%FREE --chunksize 256k
--poolmetadatasize 16G --zero n
and, supposing of doing overprovisioning of 25%:
lvcreate --thin --name LV_NAME --virtualsize 5T VG_NAME/POOL_NAME
5. Create an XFS filesystem and define it either in fstab or in systemd
...
unit (second option is better as you can define dependencies).  I
would
recommend  you to use  these options:
noatime,nodiratime, context="system_u:object_r:glusterd_brick_t:s0"
Keep the quotes  and  mount the brick on all nodes.
Going to my system I see for the 3 existing bricks, in fstab:
UUID=fa5dd3cb-aeef-470e-b982-432ac896d87a /gluster_bricks/engine xfs
inode64,noatime,nodiratime 0 0
UUID=43bed7de-66b1-491d-8055-5b4ef9b0482f /gluster_bricks/data xfs
inode64,noatime,nodiratime 0 0
UUID=b81a491c-0a4c-4c11-89d8-9db7fe82888e /gluster_bricks/vmstore xfs
inode64,noatime,nodiratime 0 0
and, for the biggest one ( I align to new lines for readability):
[root@ovirt ~]# xfs_admin -lu
/dev/mapper/gluster_vg_nvme1n1-gluster_lv_vmstore
label = ""
UUID = b81a491c-0a4c-4c11-89d8-9db7fe82888e
[root@ovirt ~]#
[root@ovirt ~]# xfs_info /gluster_bricks/vmstore
meta-data
/dev/mapper/gluster_vg_nvme1n1-gluster_lv_vmstore
isize=512   agcount=32, agsize=7618528 blks
sectsz=512   attr=2, projid32bit=1
crc=1        finobt=0 spinodes=0
data
bsize=4096   blocks=243792896, imaxpct=25
sunit=32     swidth=64 blks
naming
version 2              bsize=8192   ascii-ci=0 ftype=1
log
internal
bsize=4096   blocks=119040, version=2
sectsz=512   sunit=32 blks, lazy-count=1
realtime
none                   extsz=4096   blocks=0, rtextents=0
[root@ovirt ~]#
This above confirms recommandations in Gluster Admin Guide:
inode size of 512 bytes
For RAID 10 and JBOD, the -d su=<>,sw=<> option can be omitted. By
default,
XFS will use the thin-p chunk size and other parameters to make layout
decisions.
logical block size for directory 8192
So the final command
mkfs.xfs -i size=512 -n size=8192 VG_NAME/LV_NAME
and then get its UUID with the xfs_admin command above, to put in fstab
...
I assumed that you are adding bricks  on the same HCI nodes, but that
could be a bad assumption. If not, you will need to extend the
storage
pool  and then to create your volume .
6.Last, create a storage  domain via API or the  UI.
OK, thsi should be the, hopefully easy, part in webadmin GUI. Let's
see.
...
In the end you can use  storage migration (if you are not using
qemu's
...
libgfapi integration) to utilize  the new storage  without  any 
downtime.
P.S.:  Documentation contributions are  welcomed and  if I have some
time
- I will be able to add some of my experience :)
Best Regards,
Strahil Nikolov
Thank you very much Strahil for your inputs.
I'm going to test on my nested ovirt before, adding a disk to it, and
then
to the physical one
Comments welcome
On my physical then I have a device naming problem, because the new
inserted disk has taken the name of a previous one and strangely there
is
conflict in creating VG, even if LVM2 entities have their UUID, but fot
his
particular problem I'm going to open a separate thread.
Gianluca
Hey Gianluca,

Let me clarify the multipath story.

In your case we have a single path (cause we don't use SAN)  and  it is not normal to keep local disks in the multipath.conf ...

Many monitoring scripts would raise an alert in such case, so best practice is to blacklist local devices. I'm not sure if this can be done from the engine's UI (blacklisting local disks), but it's worth checking. 

About the PV stuff, you have to use the multipath device, if it exists,  as it is 1 layer above the scsi devices. In your case, you won't be able to use the block device, even if you wish.

But as I said, it's crazy to keep local devices in multipath.

Best Regards,
Strahil Nikolov