[ovirt-users] Re: Sometimes paused due to unknown storage error on gluster

8 Apr 2020

      On Wed, Apr 8, 2020 at 6:00 PM Strahil Nikolov <hunter86_bg@yahoo.com>
wrote:
...
On April 8, 2020 2:43:01 PM GMT+03:00, Gianluca Cecchi <
gianluca.cecchi@gmail.com> wrote:
...
On Tue, Apr 7, 2020 at 8:16 PM Strahil Nikolov <hunter86_bg@yahoo.com>
wrote:
Hi Gianluca,
...
The positive thing is that you can reproduce the issue.
I  would ask you to check your gluster version and if there are any
updates - update the cluster.
I'd prefer to stick on oVirt release version of Gluster if possible
This is what I ment, but if you are using Gluster v6.0 - you should
update  to latest version.
I deployed oVirt 4.3.9 using ovirt-node-ng iso image and using cockpit
based GUI wizard for HCI one node.
And on ovirt-node-ng system the version was the latest available in stable
4.3 ovirt repos: glusterfs-6.8-1.el7.x86_64
The same if I go on a plain CentOS ovirt host of another environment, fully
updated, the version for gluster related packages is 6.8-1.el7 and no
update proposed.
So I think I'm ok with that.
...
Also check the gluster's op-version, as this limits some of the
...
features.
...
What do you mean by thgis?
Check this one:
https://docs.gluster.org/en/latest/Upgrade-Guide/op_version/
Ah, ok.

In my case I have single host environment, so in my case is not
fundamental, and no upgrade has took place yet from the original 4.3.9
deploy, I have:

[root@ovirt images]# gluster volume get all cluster.op-version
Option                                  Value

------                                  -----

cluster.op-version                      60000

[root@ovirt images]#

[root@ovirt images]# gluster volume get all cluster.max-op-version
Option                                  Value

------                                  -----

cluster.max-op-version                  60000

[root@ovirt images]#
...
...
Ok. I think that the INFO level set on the different layers outlined
problem somehow related to shardin.
Related to this, I found no official docs on Gluster web site after 3.7
version... where is it?
Only information I found was in Red Hat Gluster Storage 3.5
Administration
Guide, but I would expect something more upstream...
Red Hat versioning  is different. So far RH documentation for Gluster  has
never  failed me.
I also take in great consideration Red Hat documentation.
But at the end the product is upstream, and I don't find anything in
upstream documentation related to sharding (only for old 3.7 days...).
Why? Or where is the doc if it is my fault?
Otherwise this lets me take attention about the stability of the feature....
...
In particular in my case where I have only one host and the gluster
...
volumes
are single brick based, do you think I can try to disable sharding and
verify if using new disks with it disabled and oVirt thin provisioned
disks
let the problem go away?
Also, I found some information about sharding block size.
Apparently the only supported size on Red Hat Gluster Storage is 512MB,
but
oVirt sets it to 64MB....?
I also found a bugzilla about passing from 128MB to 64MB in oVirt
4.1.5:
https://bugzilla.redhat.com/show_bug.cgi?id=1469436
Now I see that by default and so also in my environment I have:
features.shard                          on
features.shard-block-size               64MB
features.shard-lru-limit                16384
features.shard-deletion-rate            100
NEVER  DISABLE SHARDING  ON A  VOLUME!!!
There is a very good reply from Amar Tumballi  in gluster users mailing
list about that.
The good thing is that you can do storage migration.
Remember: I have only one node and only one brick per volume, so it doesn't
give me any real benefit I think
If I go and add another node I'll take care to modify, but it is not an
immediate thing.
It is a test lab
...
The only benefits of sharding are:
1. When using distributed-replicated  volume, shards will spread on
different bricks and thus partially increase read  performance
Not needed in my environment

2. When a heal is going on, the shard is locked - so there will be no
...
disruption for the VM
No heal on my environment because it is distributed (even if actually I
have one brick per volume)
...
In both cases you cannot benefit of that.
Shard size  is 64 MB and this  is default from Gluster , not  from oVirt.
Maybe there  is a reason behind that large shard size, but I have no clue.
Where do you see that the default is 64MB? Gluster what? upstream 6.8 or at
build time by the team to create ovirt-4.3-centos-gluster6 repos or what?
Because if you read the Red Hat Gluster Storage documentation (that you
referred as the "master") it says that 512MB is the default and apparently
the only allowed in their product....
As I already linked:
https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.5/ht...

"
 features.shard-block-size
    Specifies the maximum size of the file pieces when sharding is enabled.
The supported value for this parameter is 512MB.
"

And also in the table here:
https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.5/ht...
you can see that both Alowed and Default values are 512MB
I don't know if the Red Hat Gluster Storage 3.5 product is based on 6.8
version or not? How to check?

And, missing the upstream documentation about sharding values, I cannot
compare to that....
But as the RHV 4.3 product guide directly refers to Red Hat Gluster Storage
docs.... I imagine what provided by oVirt is in the same version line....
...
In your case, you shokuld consider using VDO, as most of the VMs have
almost the same system files, which  will lead to data reduction at a small
cost of write speed.
OK, thanks
...
Another  option is to create a new volume and disable sharding at all.
In fact I created another 4Tb volume and disabled sharding on it.
Then today I deployed again the Openshift OCP installation using 3 Master
Nodes + 7 Worker Nodes, each with a 120Gb disk (thin provisioned) and the
installation completed without any storage unknown error.
Before I tried 3 times the same installation and every time I got at least
4-5 nodes that during the installation phase with ignition process from
bootstrap node went to paused....
This + the gfid errors I provided seem to confirm my idea.
...
I have one question - you have a single brick in your volume. Why do you
use oVirt  instead  of plain KVM ?
In the future it can grow to more than one node. And also I wante to use /
test a supported one node configuration for oVirt.
And also many ansible scripts used take advantage of ovirt modules:
https://docs.ansible.com/ansible/latest/modules/ovirt_module.html
...
Best Regards,
Strahil Nikolov
Cheers,
Gianluca