[ovirt-users] Re: Sometimes paused due to unknown storage error on gluster

9 Apr 2020

      On Thu, Apr 9, 2020 at 7:46 AM Krutika Dhananjay <kdhananj@redhat.com>
wrote:
...
On Tue, Apr 7, 2020 at 7:36 PM Gianluca Cecchi <gianluca.cecchi@gmail.com>
wrote:
...
OK. So I set log at least at INFO level on all subsystems and tried a
redeploy of Openshift with 3 mater nodes and 7 worker nodes.
One worker got the error and VM in paused mode
Apr 7, 2020, 3:27:28 PM VM worker-6 has been paused due to unknown
storage error.
The vm has only one 100Gb virtual disk on gluster volume named vmstore
Here below all the logs around time at the different layers.
Let me know if you need another log file not yet considered.
From what I see, the matching error is found in
- rhev-data-center-mnt-glusterSD-ovirtst.mydomain.storage:_vmstore.log
[2020-04-07 13:27:28.721262] E [MSGID: 133010]
[shard.c:2327:shard_common_lookup_shards_cbk] 0-vmstore-shard: Lookup on
shard 523 failed. Base file gfid = d22530cf-2e50-4059-8924-0aafe38497b1 [No
such file or directory]
[2020-04-07 13:27:28.721432] W [fuse-bridge.c:2918:fuse_writev_cbk]
0-glusterfs-fuse: 4435189: WRITE => -1
gfid=d22530cf-2e50-4059-8924-0aafe38497b1 fd=0x7f3c4c07ab38 (No such file
or directory)
This ^^, right here is the reason the VM paused. Are you using a plain
distribute volume here?
Can you share some of the log messages that occur right above these errors?
Also, can you check if the file
$VMSTORE_BRICKPATH/.glusterfs/d2/25/d22530cf-2e50-4059-8924-0aafe38497b1
exists on the brick?
-Krutika
Thanks for answering Krutika
To verify that sharding in some way was "involved" in the problem, I
executed a new re-deploy of the 9 Openshift OCP servers, without indeed
receiving any error.
While with sharding enable I received at least 3-4 errors every deployment
run.
In particular I deleted the VM disks of the previous VMs to put them on a
volume without sharding.
Right now the directory is so empty:

[root@ovirt ~]# ll -a /gluster_bricks/vmstore/vmstore/.glusterfs/d2/25/
total 8
drwx------.   2 root root    6 Apr  8 16:59 .
drwx------. 105 root root 8192 Apr  9 00:50 ..
[root@ovirt ~]#

Here you can find the entire log (in gzip format) from [2020-04-05
01:20:02.978429] to [2020-04-09 10:45:36.734079] of the vmstore volume
https://drive.google.com/file/d/1Dqr7KJMqKdMFg-jvhsDAzvr1xgWtvtnQ/view?usp=s...

You will find same error at least in these timestamps below corresponding
to engine webadmin events "unknown storage error", taking care that inside
the log file the time is UTC, so you have to shift 2hours behind (03:27:28
PM in engine webadmin event corresponds to 13:27:28 in log file)

Apr 7, 2020, 3:27:28 PM

Apr 7, 2020, 4:38:55 PM

Apr 7, 2020, 5:31:02 PM

Apr 8, 2020, 8:52:49 AM

Apr 8, 2020, 12:05:17 PM

Apr 8, 2020, 3:11:10 PM

Apr 8, 2020, 3:20:30 PM

Apr 8, 2020, 3:26:54 PM

Thanks again, and I'm available to re-try on sharding enable volume after
modifying anything, eventually
Gianluca