On Thu, Apr 9, 2020 at 7:46 AM Krutika Dhananjay <kdhananj@redhat.com> wrote:


On Tue, Apr 7, 2020 at 7:36 PM Gianluca Cecchi <gianluca.cecchi@gmail.com> wrote:

OK. So I set log at least at INFO level on all subsystems and tried a redeploy of Openshift with 3 mater nodes and 7 worker nodes.
One worker got the error and VM in paused mode

Apr 7, 2020, 3:27:28 PM VM worker-6 has been paused due to unknown storage error.

The vm has only one 100Gb virtual disk on gluster volume named vmstore


Here below all the logs around time at the different layers.
Let me know if you need another log file not yet considered.

From what I see, the matching error is found in

- rhev-data-center-mnt-glusterSD-ovirtst.mydomain.storage:_vmstore.log

[2020-04-07 13:27:28.721262] E [MSGID: 133010] [shard.c:2327:shard_common_lookup_shards_cbk] 0-vmstore-shard: Lookup on shard 523 failed. Base file gfid = d22530cf-2e50-4059-8924-0aafe38497b1 [No such file or directory]
[2020-04-07 13:27:28.721432] W [fuse-bridge.c:2918:fuse_writev_cbk] 0-glusterfs-fuse: 4435189: WRITE => -1 gfid=d22530cf-2e50-4059-8924-0aafe38497b1 fd=0x7f3c4c07ab38 (No such file or directory)


This ^^, right here is the reason the VM paused. Are you using a plain distribute volume here?
Can you share some of the log messages that occur right above these errors?
Also, can you check if the file $VMSTORE_BRICKPATH/.glusterfs/d2/25/d22530cf-2e50-4059-8924-0aafe38497b1 exists on the brick?

-Krutika



Thanks for answering Krutika
To verify that sharding in some way was "involved" in the problem, I executed a new re-deploy of the 9 Openshift OCP servers, without indeed receiving any error.
While with sharding enable I received at least 3-4 errors every deployment run.
In particular I deleted the VM disks of the previous VMs to put them on a volume without sharding.
Right now the directory is so empty:

[root@ovirt ~]# ll -a /gluster_bricks/vmstore/vmstore/.glusterfs/d2/25/
total 8
drwx------.   2 root root    6 Apr  8 16:59 .
drwx------. 105 root root 8192 Apr  9 00:50 ..
[root@ovirt ~]#

Here you can find the entire log (in gzip format) from [2020-04-05 01:20:02.978429] to [2020-04-09 10:45:36.734079] of the vmstore volume
https://drive.google.com/file/d/1Dqr7KJMqKdMFg-jvhsDAzvr1xgWtvtnQ/view?usp=sharing

You will find same error at least in these timestamps below corresponding to engine webadmin events "unknown storage error", taking care that inside the log file the time is UTC, so you have to shift 2hours behind (03:27:28 PM in engine webadmin event corresponds to 13:27:28 in log file)

Apr 7, 2020, 3:27:28 PM

Apr 7, 2020, 4:38:55 PM

Apr 7, 2020, 5:31:02 PM

Apr 8, 2020, 8:52:49 AM

Apr 8, 2020, 12:05:17 PM

Apr 8, 2020, 3:11:10 PM

Apr 8, 2020, 3:20:30 PM

Apr 8, 2020, 3:26:54 PM

Thanks again, and I'm available to re-try on sharding enable volume after modifying anything, eventually
Gianluca