Emergency :/ No VMs starting

Christian Reiss

3 Feb 2020 3 Feb '20

1:20 a.m.

Hey folks, oh Jesus. 3-Way HCI. Gluster w/o any issues: [root@node01:/var/log/glusterfs] # gluster vol info ssd_storage Volume Name: ssd_storage Type: Replicate Volume ID: d84ec99a-5db9-49c6-aab4-c7481a1dc57b Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: node01.company.com:/gluster_bricks/ssd_storage/ssd_storage Brick2: node02.company.com:/gluster_bricks/ssd_storage/ssd_storage Brick3: node03.company.com:/gluster_bricks/ssd_storage/ssd_storage Options Reconfigured: performance.client-io-threads: on nfs.disable: on transport.address-family: inet performance.strict-o-direct: on performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.low-prio-threads: 32 network.remote-dio: off cluster.eager-lock: enable cluster.quorum-type: auto cluster.server-quorum-type: server cluster.data-self-heal-algorithm: full cluster.locking-scheme: granular cluster.shd-max-threads: 8 cluster.shd-wait-qlength: 10000 features.shard: on user.cifs: off cluster.choose-local: off client.event-threads: 4 server.event-threads: 4 network.ping-timeout: 30 storage.owner-uid: 36 storage.owner-gid: 36 cluster.granular-entry-heal: enab [root@node01:/var/log/glusterfs] # gluster vol status ssd_storage Status of volume: ssd_storage Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick node01.company.com:/gluster_br icks/ssd_storage/ssd_storage 49152 0 Y 63488 Brick node02.company.com:/gluster_br icks/ssd_storage/ssd_storage 49152 0 Y 18860 Brick node03.company.com:/gluster_br icks/ssd_storage/ssd_storage 49152 0 Y 15262 Self-heal Daemon on localhost N/A N/A Y 63511 Self-heal Daemon on node03.dc-dus.dalason.n et N/A N/A Y 15285 Self-heal Daemon on 10.100.200.12 N/A N/A Y 18883 Task Status of Volume ssd_storage ------------------------------------------------------------------------------ There are no active volume tasks [root@node01:/var/log/glusterfs] # gluster vol heal ssd_storage info Brick node01.company.com:/gluster_bricks/ssd_storage/ssd_storage Status: Connected Number of entries: 0 Brick node02.company.com:/gluster_bricks/ssd_storage/ssd_storage Status: Connected Number of entries: 0 Brick node03.company.com:/gluster_bricks/ssd_storage/ssd_storage Status: Connected Number of entries: 0 And everything is mounted where its supposed to. But no VMs start due to IO Error. I checked a gluster-based file (CentOS iso) md5 against a local copy, it matches. One VM at one point managed to start, but failed subsequent starts. The data/disks seem okay, /var/log/glusterfs/"rhev-data-center-mnt-glusterSD-node01.company.com:_ssd__storage.log-20200202" has entries like: [2020-02-01 23:15:15.449902] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ssd_storage-client-1: remote operation failed. Path: /.shard/86da0289-f74f-4200-9284-678e7bd76195.1405 (00000000-0000-0000-0000-000000000000) [Permission denied] [2020-02-01 23:15:15.484363] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ssd_storage-client-1: remote operation failed. Path: /.shard/86da0289-f74f-4200-9284-678e7bd76195.1400 (00000000-0000-0000-0000-000000000000) [Permission denied] Before this happened we put one host into maintenance mode, it all started during migration. Any help? We're sweating blood here. -- with kind regards, mit freundlichen Gruessen, Christian Reiss

Show replies by date

Christian Reiss

3 Feb 3 Feb

1:22 a.m.

I forgot the additional logs. Please guys, any help... (insert scream here). On 03/02/2020 01:20, Christian Reiss wrote:

...

Hey folks,

oh Jesus. 3-Way HCI. Gluster w/o any issues:

[root@node01:/var/log/glusterfs] # gluster vol info ssd_storage

Volume Name: ssd_storage Type: Replicate Volume ID: d84ec99a-5db9-49c6-aab4-c7481a1dc57b Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: node01.company.com:/gluster_bricks/ssd_storage/ssd_storage Brick2: node02.company.com:/gluster_bricks/ssd_storage/ssd_storage Brick3: node03.company.com:/gluster_bricks/ssd_storage/ssd_storage Options Reconfigured: performance.client-io-threads: on nfs.disable: on transport.address-family: inet performance.strict-o-direct: on performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.low-prio-threads: 32 network.remote-dio: off cluster.eager-lock: enable cluster.quorum-type: auto cluster.server-quorum-type: server cluster.data-self-heal-algorithm: full cluster.locking-scheme: granular cluster.shd-max-threads: 8 cluster.shd-wait-qlength: 10000 features.shard: on user.cifs: off cluster.choose-local: off client.event-threads: 4 server.event-threads: 4 network.ping-timeout: 30 storage.owner-uid: 36 storage.owner-gid: 36 cluster.granular-entry-heal: enab

[root@node01:/var/log/glusterfs] # gluster vol status ssd_storage Status of volume: ssd_storage Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------

Brick node01.company.com:/gluster_br icks/ssd_storage/ssd_storage 49152 0 Y 63488 Brick node02.company.com:/gluster_br icks/ssd_storage/ssd_storage 49152 0 Y 18860 Brick node03.company.com:/gluster_br icks/ssd_storage/ssd_storage 49152 0 Y 15262 Self-heal Daemon on localhost N/A N/A Y 63511 Self-heal Daemon on node03.dc-dus.dalason.n et N/A N/A Y 15285 Self-heal Daemon on 10.100.200.12 N/A N/A Y 18883

Task Status of Volume ssd_storage ------------------------------------------------------------------------------

There are no active volume tasks

[root@node01:/var/log/glusterfs] # gluster vol heal ssd_storage info Brick node01.company.com:/gluster_bricks/ssd_storage/ssd_storage Status: Connected Number of entries: 0

Brick node02.company.com:/gluster_bricks/ssd_storage/ssd_storage Status: Connected Number of entries: 0

Brick node03.company.com:/gluster_bricks/ssd_storage/ssd_storage Status: Connected Number of entries: 0

And everything is mounted where its supposed to. But no VMs start due to IO Error. I checked a gluster-based file (CentOS iso) md5 against a local copy, it matches. One VM at one point managed to start, but failed subsequent starts. The data/disks seem okay,

/var/log/glusterfs/"rhev-data-center-mnt-glusterSD-node01.company.com:_ssd__storage.log-20200202" has entries like:

[2020-02-01 23:15:15.449902] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ssd_storage-client-1: remote operation failed. Path: /.shard/86da0289-f74f-4200-9284-678e7bd76195.1405 (00000000-0000-0000-0000-000000000000) [Permission denied] [2020-02-01 23:15:15.484363] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ssd_storage-client-1: remote operation failed. Path: /.shard/86da0289-f74f-4200-9284-678e7bd76195.1400 (00000000-0000-0000-0000-000000000000) [Permission denied]

Before this happened we put one host into maintenance mode, it all started during migration.

Any help? We're sweating blood here.

-- with kind regards, mit freundlichen Gruessen, Christian Reiss

Jayme

2:27 a.m.

The log appears to indicate that there may be a permissions issue. What is the ownership and permissions on your gluster brick dirs and mounts? On Sun, Feb 2, 2020 at 8:21 PM Christian Reiss <email@christian-reiss.de> wrote:

...

Hey folks,

oh Jesus. 3-Way HCI. Gluster w/o any issues:

[root@node01:/var/log/glusterfs] # gluster vol info ssd_storage

Volume Name: ssd_storage Type: Replicate Volume ID: d84ec99a-5db9-49c6-aab4-c7481a1dc57b Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: node01.company.com:/gluster_bricks/ssd_storage/ssd_storage Brick2: node02.company.com:/gluster_bricks/ssd_storage/ssd_storage Brick3: node03.company.com:/gluster_bricks/ssd_storage/ssd_storage Options Reconfigured: performance.client-io-threads: on nfs.disable: on transport.address-family: inet performance.strict-o-direct: on performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.low-prio-threads: 32 network.remote-dio: off cluster.eager-lock: enable cluster.quorum-type: auto cluster.server-quorum-type: server cluster.data-self-heal-algorithm: full cluster.locking-scheme: granular cluster.shd-max-threads: 8 cluster.shd-wait-qlength: 10000 features.shard: on user.cifs: off cluster.choose-local: off client.event-threads: 4 server.event-threads: 4 network.ping-timeout: 30 storage.owner-uid: 36 storage.owner-gid: 36 cluster.granular-entry-heal: enab

[root@node01:/var/log/glusterfs] # gluster vol status ssd_storage Status of volume: ssd_storage Gluster process TCP Port RDMA Port Online Pid

------------------------------------------------------------------------------ Brick node01.company.com:/gluster_br icks/ssd_storage/ssd_storage 49152 0 Y 63488 Brick node02.company.com:/gluster_br icks/ssd_storage/ssd_storage 49152 0 Y 18860 Brick node03.company.com:/gluster_br icks/ssd_storage/ssd_storage 49152 0 Y 15262 Self-heal Daemon on localhost N/A N/A Y 63511 Self-heal Daemon on node03.dc-dus.dalason.n et N/A N/A Y 15285 Self-heal Daemon on 10.100.200.12 N/A N/A Y 18883

Task Status of Volume ssd_storage

------------------------------------------------------------------------------ There are no active volume tasks

[root@node01:/var/log/glusterfs] # gluster vol heal ssd_storage info Brick node01.company.com:/gluster_bricks/ssd_storage/ssd_storage Status: Connected Number of entries: 0

Brick node02.company.com:/gluster_bricks/ssd_storage/ssd_storage Status: Connected Number of entries: 0

Brick node03.company.com:/gluster_bricks/ssd_storage/ssd_storage Status: Connected Number of entries: 0

And everything is mounted where its supposed to. But no VMs start due to IO Error. I checked a gluster-based file (CentOS iso) md5 against a local copy, it matches. One VM at one point managed to start, but failed subsequent starts. The data/disks seem okay,

/var/log/glusterfs/"rhev-data-center-mnt-glusterSD-node01.company.com:_ssd__storage.log-20200202"

has entries like:

[2020-02-01 23:15:15.449902] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ssd_storage-client-1: remote operation failed. Path: /.shard/86da0289-f74f-4200-9284-678e7bd76195.1405 (00000000-0000-0000-0000-000000000000) [Permission denied] [2020-02-01 23:15:15.484363] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ssd_storage-client-1: remote operation failed. Path: /.shard/86da0289-f74f-4200-9284-678e7bd76195.1400 (00000000-0000-0000-0000-000000000000) [Permission denied]

Before this happened we put one host into maintenance mode, it all started during migration.

Any help? We're sweating blood here.

-- with kind regards, mit freundlichen Gruessen,

Christian Reiss _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/VJUJK7USH2BV4Z...

Christian Reiss

2:35 a.m.

Thanks for replying. /gluster_bricks/ssd_storage/ssd_storage/.shard is root:root 0660, [root@node03:/gluster_bricks/ssd_storage/ssd_storage] # l total 5.8M drwxr-xr-x. 5 vdsm kvm 98 Feb 3 02:31 . drwxr-xr-x. 3 root root 25 Jan 9 15:49 .. drwxr-xr-x. 5 vdsm kvm 64 Feb 3 00:31 fec2eb5e-21b5-496b-9ea5-f718b2cb5556 drw-------. 262 root root 8.0K Jan 9 16:50 .glusterfs drwxr-xr-x. 3 root root 4.7M Feb 3 00:31 .shard [root@node03:/gluster_bricks/ssd_storage] # l total 8.0K drwxr-xr-x. 3 root root 25 Jan 9 15:49 . drwxr-xr-x. 3 root root 4.0K Jan 9 15:49 .. drwxr-xr-x. 5 vdsm kvm 98 Feb 3 02:31 ssd_storage [root@node03:/gluster_bricks] # l total 8.0K drwxr-xr-x. 3 root root 4.0K Jan 9 15:49 . dr-xr-xr-x. 21 root root 4.0K Feb 3 00:03 .. drwxr-xr-x. 3 root root 25 Jan 9 15:49 ssd_storage [root@node03:/] # l total 348K drwxr-xr-x. 3 root root 4.0K Jan 9 15:49 gluster_bricks And [root@node03:/rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/images] # l total 345K drwxr-xr-x. 46 vdsm kvm 8.0K Feb 2 23:18 . drwxr-xr-x. 5 vdsm kvm 64 Feb 3 00:31 .. drwxr-xr-x. 2 vdsm kvm 8.0K Jan 17 15:54 0b21c949-7133-4b34-b909-a6660ae12800 drwxr-xr-x. 2 vdsm kvm 165 Feb 3 01:48 0dde79ab-d773-4d23-b397-7c39371ccc60 drwxr-xr-x. 2 vdsm kvm 8.0K Jan 17 09:49 1347d489-012b-40fc-acb5-d00a9ea133a4 drwxr-xr-x. 2 vdsm kvm 8.0K Jan 22 15:04 1ccc4db6-f47d-4474-b0fa-a0c1eddb0fa7 drwxr-xr-x. 2 vdsm kvm 8.0K Jan 21 16:28 22cab044-a26d-4266-9af7-a6408eaf140c drwxr-xr-x. 2 vdsm kvm 8.0K Jan 30 06:03 288d061a-6c6c-4536-a594-3bede63c0654 drwxr-xr-x. 2 vdsm kvm 8.0K Jan 9 16:46 40c51753-1533-45ab-b9de-2c51d8a18370 Containing files as well. On 03/02/2020 02:27, Jayme wrote:

...

The log appears to indicate that there may be a permissions issue. What is the ownership and permissions on your gluster brick dirs and mounts?

-- Christian Reiss - email@christian-reiss.de /"\ ASCII Ribbon support@alpha-labs.net \ / Campaign X against HTML WEB alpha-labs.net / \ in eMails GPG Retrieval https://gpg.christian-reiss.de GPG ID ABCD43C5, 0x44E29126ABCD43C5 GPG fingerprint = 9549 F537 2596 86BA 733C A4ED 44E2 9126 ABCD 43C5 "It's better to reign in hell than to serve in heaven.", John Milton, Paradise lost.

Jayme

3:28 a.m.

I checked my HCI cluster and those permissions seem to match what I'm seeing. Since there's no VMs running currently have you tried restarting the gluster volumes as well as the glusterd service? I'm not sure what would have caused this with one host placed in maintenance. On Sun, Feb 2, 2020 at 9:35 PM Christian Reiss <email@christian-reiss.de> wrote:

...

Thanks for replying.

/gluster_bricks/ssd_storage/ssd_storage/.shard is root:root 0660,

[root@node03:/gluster_bricks/ssd_storage/ssd_storage] # l total 5.8M drwxr-xr-x. 5 vdsm kvm 98 Feb 3 02:31 . drwxr-xr-x. 3 root root 25 Jan 9 15:49 .. drwxr-xr-x. 5 vdsm kvm 64 Feb 3 00:31 fec2eb5e-21b5-496b-9ea5-f718b2cb5556 drw-------. 262 root root 8.0K Jan 9 16:50 .glusterfs drwxr-xr-x. 3 root root 4.7M Feb 3 00:31 .shard

[root@node03:/gluster_bricks/ssd_storage] # l total 8.0K drwxr-xr-x. 3 root root 25 Jan 9 15:49 . drwxr-xr-x. 3 root root 4.0K Jan 9 15:49 .. drwxr-xr-x. 5 vdsm kvm 98 Feb 3 02:31 ssd_storage

[root@node03:/gluster_bricks] # l total 8.0K drwxr-xr-x. 3 root root 4.0K Jan 9 15:49 . dr-xr-xr-x. 21 root root 4.0K Feb 3 00:03 .. drwxr-xr-x. 3 root root 25 Jan 9 15:49 ssd_storage

[root@node03:/] # l total 348K drwxr-xr-x. 3 root root 4.0K Jan 9 15:49 gluster_bricks

And

[root@node03:/rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/images]

# l total 345K drwxr-xr-x. 46 vdsm kvm 8.0K Feb 2 23:18 . drwxr-xr-x. 5 vdsm kvm 64 Feb 3 00:31 .. drwxr-xr-x. 2 vdsm kvm 8.0K Jan 17 15:54 0b21c949-7133-4b34-b909-a6660ae12800 drwxr-xr-x. 2 vdsm kvm 165 Feb 3 01:48 0dde79ab-d773-4d23-b397-7c39371ccc60 drwxr-xr-x. 2 vdsm kvm 8.0K Jan 17 09:49 1347d489-012b-40fc-acb5-d00a9ea133a4 drwxr-xr-x. 2 vdsm kvm 8.0K Jan 22 15:04 1ccc4db6-f47d-4474-b0fa-a0c1eddb0fa7 drwxr-xr-x. 2 vdsm kvm 8.0K Jan 21 16:28 22cab044-a26d-4266-9af7-a6408eaf140c drwxr-xr-x. 2 vdsm kvm 8.0K Jan 30 06:03 288d061a-6c6c-4536-a594-3bede63c0654 drwxr-xr-x. 2 vdsm kvm 8.0K Jan 9 16:46 40c51753-1533-45ab-b9de-2c51d8a18370

Containing files as well.

On 03/02/2020 02:27, Jayme wrote:

...
The log appears to indicate that there may be a permissions issue. What is the ownership and permissions on your gluster brick dirs and mounts?

-- Christian Reiss - email@christian-reiss.de /"\ ASCII Ribbon support@alpha-labs.net \ / Campaign X against HTML WEB alpha-labs.net / \ in eMails

GPG Retrieval https://gpg.christian-reiss.de GPG ID ABCD43C5, 0x44E29126ABCD43C5 GPG fingerprint = 9549 F537 2596 86BA 733C A4ED 44E2 9126 ABCD 43C5

"It's better to reign in hell than to serve in heaven.", John Milton, Paradise lost.

Christian Reiss

3:54 a.m.

Hey, it was _while_ placing the host _into_ maintenance, to be precise. I restarted the volumes and even each machine and the entire cluster to no avail. I am currently migrating the disk images out of ovirt into openvz/kvm to get them running. The copied disk images are flawless and working. On 03/02/2020 03:28, Jayme wrote:

...

I checked my HCI cluster and those permissions seem to match what I'm seeing. Since there's no VMs running currently have you tried restarting the gluster volumes as well as the glusterd service? I'm not sure what would have caused this with one host placed in maintenance.

-- with kind regards, mit freundlichen Gruessen, Christian Reiss

Darrell Budic

4:01 a.m.

Check the contents of these directories: [root@node03:/rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net <http://node01.dc-dus.dalason.net/>:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/images] # l total 345K drwxr-xr-x. 46 vdsm kvm 8.0K Feb 2 23:18 . drwxr-xr-x. 5 vdsm kvm 64 Feb 3 00:31 .. drwxr-xr-x. 2 vdsm kvm 8.0K Jan 17 15:54 0b21c949-7133-4b34-b909-a6660ae12800 drwxr-xr-x. 2 vdsm kvm 165 Feb 3 01:48 0dde79ab-d773-4d23-b397-7c39371ccc60 drwxr-xr-x. 2 vdsm kvm 8.0K Jan 17 09:49 1347d489-012b-40fc-acb5-d00a9ea133a4 drwxr-xr-x. 2 vdsm kvm 8.0K Jan 22 15:04 1ccc4db6-f47d-4474-b0fa-a0c1eddb0fa7 drwxr-xr-x. 2 vdsm kvm 8.0K Jan 21 16:28 22cab044-a26d-4266-9af7-a6408eaf140c drwxr-xr-x. 2 vdsm kvm 8.0K Jan 30 06:03 288d061a-6c6c-4536-a594-3bede63c0654 drwxr-xr-x. 2 vdsm kvm 8.0K Jan 9 16:46 40c51753-1533-45ab-b9de-2c51d8a18370 and what version of Ovirt are you running? This looks a bit like a libvirt change/bug that changed ownership on the actual disk image to root.root on shutdown/migrations, preventing later start attempts. This may help if that’s the case: chown -R vdsm.kvm /rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net <http://node01.dc-dus.dalason.net/>:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/images

...

On Feb 2, 2020, at 8:54 PM, Christian Reiss <email@christian-reiss.de> wrote:

Hey,

it was _while_ placing the host _into_ maintenance, to be precise. I restarted the volumes and even each machine and the entire cluster to no avail.

I am currently migrating the disk images out of ovirt into openvz/kvm to get them running. The copied disk images are flawless and working.

On 03/02/2020 03:28, Jayme wrote:

...
I checked my HCI cluster and those permissions seem to match what I'm seeing. Since there's no VMs running currently have you tried restarting the gluster volumes as well as the glusterd service? I'm not sure what would have caused this with one host placed in maintenance.

-- with kind regards, mit freundlichen Gruessen,

Christian Reiss _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/FOFNQO4FLJAIUB...

Christian Reiss

4:06 a.m.

Hey, they're all in this form: [root@node03:[..]/images/6113f079-fd28-4165-a807-61bb7625cd48] # l total 49G drwxr-xr-x. 2 vdsm kvm 8.0K Jan 29 23:02 . drwxr-xr-x. 46 vdsm kvm 8.0K Feb 2 23:18 .. -rw-rw----. 1 vdsm kvm 50G Jan 29 02:02 83f7942f-c74e-4bc4-a816-09988e7ab308 -rw-rw----. 1 vdsm kvm 1.0M Jan 23 12:16 83f7942f-c74e-4bc4-a816-09988e7ab308.lease -rw-r--r--. 1 vdsm kvm 323 Jan 29 23:02 83f7942f-c74e-4bc4-a816-09988e7ab308.meta -rw-rw----. 1 vdsm kvm 20G Feb 2 21:42 f72a4a62-b280-4bdf-9570-96d4b6577d89 -rw-rw----. 1 vdsm kvm 1.0M Jan 29 23:02 f72a4a62-b280-4bdf-9570-96d4b6577d89.lease -rw-r--r--. 1 vdsm kvm 251 Jan 29 23:02 f72a4a62-b280-4bdf-9570-96d4b6577d89.meta Looks good (enough) to me. On 03/02/2020 04:01, Darrell Budic wrote:

...

Check the contents of these directories:

[root@node03:/rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net <http://node01.dc-dus.dalason.net>:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/images] # l total 345K drwxr-xr-x. 46 vdsm kvm 8.0K Feb 2 23:18 . drwxr-xr-x. 5 vdsm kvm 64 Feb 3 00:31 .. drwxr-xr-x. 2 vdsm kvm 8.0K Jan 17 15:54 0b21c949-7133-4b34-b909-a6660ae12800 drwxr-xr-x. 2 vdsm kvm 165 Feb 3 01:48 0dde79ab-d773-4d23-b397-7c39371ccc60 drwxr-xr-x. 2 vdsm kvm 8.0K Jan 17 09:49 1347d489-012b-40fc-acb5-d00a9ea133a4 drwxr-xr-x. 2 vdsm kvm 8.0K Jan 22 15:04 1ccc4db6-f47d-4474-b0fa-a0c1eddb0fa7 drwxr-xr-x. 2 vdsm kvm 8.0K Jan 21 16:28 22cab044-a26d-4266-9af7-a6408eaf140c drwxr-xr-x. 2 vdsm kvm 8.0K Jan 30 06:03 288d061a-6c6c-4536-a594-3bede63c0654 drwxr-xr-x. 2 vdsm kvm 8.0K Jan 9 16:46 40c51753-1533-45ab-b9de-2c51d8a18370

and what version of Ovirt are you running? This looks a bit like a libvirt change/bug that changed ownership on the actual disk image to root.root on shutdown/migrations, preventing later start attempts.

This may help if that’s the case: chown -R vdsm.kvm /rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net <http://node01.dc-dus.dalason.net>:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/images

...
On Feb 2, 2020, at 8:54 PM, Christian Reiss <email@christian-reiss.de <mailto:email@christian-reiss.de>> wrote:

Hey,

it was _while_ placing the host _into_ maintenance, to be precise. I restarted the volumes and even each machine and the entire cluster to no avail.

I am currently migrating the disk images out of ovirt into openvz/kvm to get them running. The copied disk images are flawless and working.

On 03/02/2020 03:28, Jayme wrote:

...
I checked my HCI cluster and those permissions seem to match what I'm seeing. Since there's no VMs running currently have you tried restarting the gluster volumes as well as the glusterd service? I'm not sure what would have caused this with one host placed in maintenance.

-- with kind regards, mit freundlichen Gruessen,

Christian Reiss _______________________________________________ Users mailing list -- users@ovirt.org <mailto:users@ovirt.org> To unsubscribe send an email to users-leave@ovirt.org <mailto:users-leave@ovirt.org> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/FOFNQO4FLJAIUB...

-- with kind regards, mit freundlichen Gruessen, Christian Reiss

Strahil Nikolov

6:19 a.m.

On February 3, 2020 5:06:16 AM GMT+02:00, Christian Reiss <email@christian-reiss.de> wrote:

...

Hey,

they're all in this form:

[root@node03:[..]/images/6113f079-fd28-4165-a807-61bb7625cd48] # l total 49G drwxr-xr-x. 2 vdsm kvm 8.0K Jan 29 23:02 . drwxr-xr-x. 46 vdsm kvm 8.0K Feb 2 23:18 .. -rw-rw----. 1 vdsm kvm 50G Jan 29 02:02 83f7942f-c74e-4bc4-a816-09988e7ab308 -rw-rw----. 1 vdsm kvm 1.0M Jan 23 12:16 83f7942f-c74e-4bc4-a816-09988e7ab308.lease -rw-r--r--. 1 vdsm kvm 323 Jan 29 23:02 83f7942f-c74e-4bc4-a816-09988e7ab308.meta -rw-rw----. 1 vdsm kvm 20G Feb 2 21:42 f72a4a62-b280-4bdf-9570-96d4b6577d89 -rw-rw----. 1 vdsm kvm 1.0M Jan 29 23:02 f72a4a62-b280-4bdf-9570-96d4b6577d89.lease -rw-r--r--. 1 vdsm kvm 251 Jan 29 23:02 f72a4a62-b280-4bdf-9570-96d4b6577d89.meta

Looks good (enough) to me.

On 03/02/2020 04:01, Darrell Budic wrote:

...
Check the contents of these directories:

[root@node03:/rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net

...
<http://node01.dc-dus.dalason.net>:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/images]

...
# l total 345K drwxr-xr-x. 46 vdsm kvm 8.0K Feb 2 23:18 . drwxr-xr-x. 5 vdsm kvm 64 Feb 3 00:31 .. drwxr-xr-x. 2 vdsm kvm 8.0K Jan 17 15:54 0b21c949-7133-4b34-b909-a6660ae12800 drwxr-xr-x. 2 vdsm kvm 165 Feb 3 01:48 0dde79ab-d773-4d23-b397-7c39371ccc60 drwxr-xr-x. 2 vdsm kvm 8.0K Jan 17 09:49 1347d489-012b-40fc-acb5-d00a9ea133a4 drwxr-xr-x. 2 vdsm kvm 8.0K Jan 22 15:04 1ccc4db6-f47d-4474-b0fa-a0c1eddb0fa7 drwxr-xr-x. 2 vdsm kvm 8.0K Jan 21 16:28 22cab044-a26d-4266-9af7-a6408eaf140c drwxr-xr-x. 2 vdsm kvm 8.0K Jan 30 06:03 288d061a-6c6c-4536-a594-3bede63c0654 drwxr-xr-x. 2 vdsm kvm 8.0K Jan 9 16:46 40c51753-1533-45ab-b9de-2c51d8a18370

and what version of Ovirt are you running? This looks a bit like a libvirt change/bug that changed ownership on the actual disk image to

...
root.root on shutdown/migrations, preventing later start attempts.

This may help if that’s the case: chown -R vdsm.kvm /rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net

<http://node01.dc-dus.dalason.net>:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/images

...
...
On Feb 2, 2020, at 8:54 PM, Christian Reiss

<email@christian-reiss.de

...
...
<mailto:email@christian-reiss.de>> wrote:

Hey,

it was _while_ placing the host _into_ maintenance, to be precise. I restarted the volumes and even each machine and the entire cluster

...
...
to no avail.

I am currently migrating the disk images out of ovirt into openvz/kvm to get them running. The copied disk images are flawless and working.

On 03/02/2020 03:28, Jayme wrote:

...
I checked my HCI cluster and those permissions seem to match what I'm seeing. Since there's no VMs running currently have you tried restarting the gluster volumes as well as the glusterd service? I'm

...
...
...
not sure what would have caused this with one host placed in maintenance.

-- with kind regards, mit freundlichen Gruessen,

Christian Reiss _______________________________________________ Users mailing list -- users@ovirt.org <mailto:users@ovirt.org> To unsubscribe send an email to users-leave@ovirt.org <mailto:users-leave@ovirt.org> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives:

https://lists.ovirt.org/archives/list/users@ovirt.org/message/FOFNQO4FLJAIUB...

Deja view for me. Enable brick trace log (for sgort time or you run out of space) and check if acl is the reason. What is your gluster version ? Did you test VM power off & power on after the last gluster upgrade ? If it is acl, you have 3 options (not valid for 7.1 & 7.2): 1. Mount with acl enabled mount -t glusterfs -o acl brick1:/volume1 /mnt And run a dummy setfacl: find /mnt -exec setfacl -u:root:rw {} \; 2. Kill gluster processes and start the volume with 'force' option: gluster volume start <volume> force (or something like that. 3. Maybe a downgrade, yet 'm not in productive environment and that could be different for you. Best Regards, Strahil Nikolov

Christian Reiss

11:25 a.m.

Hey, here is one more thing: The issue we had some time ago might (just might) be the culprit. We Copied the one gluster file over to the other nodes. The one correct node which we took down yesterday is node01, which has more metadata to said file: [root@node01:~] # getfattr -m . -d -e hex /gluster_bricks/ssd_storage/ssd_storage/.glusterfs/a1/21/a121e4fb-0984-4e41-94d7-8f0c4f87f4b6 getfattr: Removing leading '/' from absolute path names # file: gluster_bricks/ssd_storage/ssd_storage/.glusterfs/a1/21/a121e4fb-0984-4e41-94d7-8f0c4f87f4b6 security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.ssd_storage-client-1=0x000000000000000000000000 trusted.gfid=0xa121e4fb09844e4194d78f0c4f87f4b6 trusted.gfid2path.d4cf876a215b173f=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f38366461303238392d663734662d343230302d393238342d3637386537626437363139352e31323030 trusted.glusterfs.mdata=0x010000000000000000000000005e35ed17000000003069a5de000000005e35ed17000000003069a5de000000005e34994900000000304a5eb2 The other nodes have significantly lesser info: [root@node02:~] # getfattr -m . -d -e hex /gluster_bricks/ssd_storage/ssd_storage/.glusterfs/a1/21/a121e4fb-0984-4e41-94d7-8f0c4f87f4b6 getfattr: Removing leading '/' from absolute path names # file: gluster_bricks/ssd_storage/ssd_storage/.glusterfs/a1/21/a121e4fb-0984-4e41-94d7-8f0c4f87f4b6 security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.gfid=0xa121e4fb09844e4194d78f0c4f87f4b6 trusted.gfid2path.d4cf876a215b173f=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f38366461303238392d663734662d343230302d393238342d3637386537626437363139352e31323030 trusted.glusterfs.mdata=0x010000000000000000000000005e35ed17000000003069a5de000000005e35ed17000000003069a5de000000005e3595f8000000003572d5ba [root@node03:~] # getfattr -m . -d -e hex /gluster_bricks/ssd_storage/ssd_storage/.glusterfs/a1/21/a121e4fb-0984-4e41-94d7-8f0c4f87f4b6 getfattr: Removing leading '/' from absolute path names # file: gluster_bricks/ssd_storage/ssd_storage/.glusterfs/a1/21/a121e4fb-0984-4e41-94d7-8f0c4f87f4b6 security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.glusterfs.mdata=0x010000000000000000000000005e35ed17000000003069a5de000000005e35ed17000000003069a5de000000005e34994900000000304a5eb2 Maybe just maybe this file contains a lot of required data? The chunk size is 64mb, and the md5 matches across the board. Also I did monitor access and modify times for this file across all three nodes and the times, size and md5 match. How could I reset the header info to match all three? -Chris. -- with kind regards, mit freundlichen Gruessen, Christian Reiss

Strahil Nikolov

11:37 a.m.

On February 3, 2020 12:25:05 PM GMT+02:00, Christian Reiss <email@christian-reiss.de> wrote:

...

Hey,

here is one more thing: The issue we had some time ago might (just might) be the culprit. We Copied the one gluster file over to the other

nodes. The one correct node which we took down yesterday is node01, which has more metadata to said file:

[root@node01:~] # getfattr -m . -d -e hex /gluster_bricks/ssd_storage/ssd_storage/.glusterfs/a1/21/a121e4fb-0984-4e41-94d7-8f0c4f87f4b6 getfattr: Removing leading '/' from absolute path names # file: gluster_bricks/ssd_storage/ssd_storage/.glusterfs/a1/21/a121e4fb-0984-4e41-94d7-8f0c4f87f4b6 security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.ssd_storage-client-1=0x000000000000000000000000 trusted.gfid=0xa121e4fb09844e4194d78f0c4f87f4b6 trusted.gfid2path.d4cf876a215b173f=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f38366461303238392d663734662d343230302d393238342d3637386537626437363139352e31323030 trusted.glusterfs.mdata=0x010000000000000000000000005e35ed17000000003069a5de000000005e35ed17000000003069a5de000000005e34994900000000304a5eb2

The other nodes have significantly lesser info:

[root@node02:~] # getfattr -m . -d -e hex /gluster_bricks/ssd_storage/ssd_storage/.glusterfs/a1/21/a121e4fb-0984-4e41-94d7-8f0c4f87f4b6 getfattr: Removing leading '/' from absolute path names # file: gluster_bricks/ssd_storage/ssd_storage/.glusterfs/a1/21/a121e4fb-0984-4e41-94d7-8f0c4f87f4b6 security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.gfid=0xa121e4fb09844e4194d78f0c4f87f4b6 trusted.gfid2path.d4cf876a215b173f=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f38366461303238392d663734662d343230302d393238342d3637386537626437363139352e31323030 trusted.glusterfs.mdata=0x010000000000000000000000005e35ed17000000003069a5de000000005e35ed17000000003069a5de000000005e3595f8000000003572d5ba

[root@node03:~] # getfattr -m . -d -e hex /gluster_bricks/ssd_storage/ssd_storage/.glusterfs/a1/21/a121e4fb-0984-4e41-94d7-8f0c4f87f4b6 getfattr: Removing leading '/' from absolute path names # file: gluster_bricks/ssd_storage/ssd_storage/.glusterfs/a1/21/a121e4fb-0984-4e41-94d7-8f0c4f87f4b6 security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.glusterfs.mdata=0x010000000000000000000000005e35ed17000000003069a5de000000005e35ed17000000003069a5de000000005e34994900000000304a5eb2

Maybe just maybe this file contains a lot of required data? The chunk size is 64mb, and the md5 matches across the board. Also I did monitor access and modify times for this file across all three nodes and the times, size and md5 match.

How could I reset the header info to match all three?

-Chris.

You can use setfattr. Have you converted gfid to file and check the file contents (if ASCII) ? Usually, I first convert my gfid to file (in brick) and then I check the timestamps and content of the file on all bricks before deciding what to do. Best Regards, Strahil Nikolov

Christian Reiss

11:46 a.m.

Hey, I think I am barking up the right tree with something (else) here; Note the timestamps & id's: dd'ing a disk image as vdsm user, try 1: [vdsm@node03:/rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/images/4a55b9c0-d550-4ecb-8dd1-cc1f24f2c7ac] $ date ; id ; dd if=5fca6d0e-e320-425b-a89a-f80563461add | pv | dd of=/dev/null Mon 3 Feb 11:39:13 CET 2020 uid=36(vdsm) gid=36(kvm) groups=36(kvm),107(qemu),179(sanlock) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 dd: error reading ‘5fca6d0e-e320-425b-a89a-f80563461add’: Permission denied 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.169465 s, 396 MB/s 64MiB 0:00:00 [ 376MiB/s] [ <=> ] 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.171726 s, 391 MB/s try 2, directly afterward: [vdsm@node03:/rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/images/4a55b9c0-d550-4ecb-8dd1-cc1f24f2c7ac] $ date ; id ; dd if=5fca6d0e-e320-425b-a89a-f80563461add | pv | dd of=/dev/null Mon 3 Feb 11:39:16 CET 2020 uid=36(vdsm) gid=36(kvm) groups=36(kvm),107(qemu),179(sanlock) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 dd: error reading ‘5fca6d0e-e320-425b-a89a-f80563461add’: Permission denied 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.148846 s, 451 MB/s 64MiB 0:00:00 [ 427MiB/s] [ <=> ] 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.149589 s, 449 MB/s try same as root: [root@node03:/rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/images/4a55b9c0-d550-4ecb-8dd1-cc1f24f2c7ac] # date ; id ; dd if=5fca6d0e-e320-425b-a89a-f80563461add | pv | dd of=/dev/null Mon 3 Feb 11:39:33 CET 2020 uid=0(root) gid=0(root) groups=0(root) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 50GiB 0:03:06 [ 274MiB/s] [ <=> ] 104857600+0 records in 104857600+0 records out 53687091200 bytes (54 GB) copied, 186.501 s, 288 MB/s 104857600+0 records in 104857600+0 records out 53687091200 bytes (54 GB) copied, 186.502 s, 288 MB/s Followed by another vdsm dd test: [vdsm@node03:/rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/images/4a55b9c0-d550-4ecb-8dd1-cc1f24f2c7ac] $ date ; id ; dd if=5fca6d0e-e320-425b-a89a-f80563461add | pv | dd of=/dev/null Mon 3 Feb 11:42:46 CET 2020 uid=36(vdsm) gid=36(kvm) groups=36(kvm),107(qemu),179(sanlock) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 50GiB 0:02:56 [ 290MiB/s] [ <=> ] 104857600+0 records in 104857600+0 records out 53687091200 bytes (54 GB) copied, 176.189 s, 305 MB/s 104857600+0 records in 104857600+0 records out 53687091200 bytes (54 GB) copied, 176.19 s, 305 MB/s So it's a permission problem (access denied) unless root loads it first? Strange: doing things like file & stat work; I can even cat the meta file (small text file). Seems only the disk images (or large files?) are affected. huh!? -- with kind regards, mit freundlichen Gruessen, Christian Reiss

Christian Reiss

12:22 p.m.

Further findings: - modified data gets written to local node, not across gluster. - vdsm user can create _new_ files on the cluster, this gets synced immediatly. - vdsm can modify, across all nodes newly created files, changes apply immediately. I think vdsm user can not modify already existing files over the gluster. Something selinux? -Chris. On 03/02/2020 11:46, Christian Reiss wrote:

...

Hey,

I think I am barking up the right tree with something (else) here; Note the timestamps & id's:

dd'ing a disk image as vdsm user, try 1:

[vdsm@node03:/rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/images/4a55b9c0-d550-4ecb-8dd1-cc1f24f2c7ac] $ date ; id ; dd if=5fca6d0e-e320-425b-a89a-f80563461add | pv | dd of=/dev/null Mon 3 Feb 11:39:13 CET 2020 uid=36(vdsm) gid=36(kvm) groups=36(kvm),107(qemu),179(sanlock) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 dd: error reading ‘5fca6d0e-e320-425b-a89a-f80563461add’: Permission denied 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.169465 s, 396 MB/s 64MiB 0:00:00 [ 376MiB/s] [ <=>      ] 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.171726 s, 391 MB/s

try 2, directly afterward:

[vdsm@node03:/rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/images/4a55b9c0-d550-4ecb-8dd1-cc1f24f2c7ac] $ date ; id ; dd if=5fca6d0e-e320-425b-a89a-f80563461add | pv | dd of=/dev/null Mon 3 Feb 11:39:16 CET 2020 uid=36(vdsm) gid=36(kvm) groups=36(kvm),107(qemu),179(sanlock) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 dd: error reading ‘5fca6d0e-e320-425b-a89a-f80563461add’: Permission denied 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.148846 s, 451 MB/s 64MiB 0:00:00 [ 427MiB/s] [ <=>      ] 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.149589 s, 449 MB/s

try same as root:

[root@node03:/rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/images/4a55b9c0-d550-4ecb-8dd1-cc1f24f2c7ac] # date ; id ; dd if=5fca6d0e-e320-425b-a89a-f80563461add | pv | dd of=/dev/null Mon 3 Feb 11:39:33 CET 2020 uid=0(root) gid=0(root) groups=0(root) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 50GiB 0:03:06 [ 274MiB/s] [                              <=>      ] 104857600+0 records in 104857600+0 records out 53687091200 bytes (54 GB) copied, 186.501 s, 288 MB/s 104857600+0 records in 104857600+0 records out 53687091200 bytes (54 GB) copied, 186.502 s, 288 MB/s

Followed by another vdsm dd test:

[vdsm@node03:/rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/images/4a55b9c0-d550-4ecb-8dd1-cc1f24f2c7ac] $ date ; id ; dd if=5fca6d0e-e320-425b-a89a-f80563461add | pv | dd of=/dev/null Mon 3 Feb 11:42:46 CET 2020 uid=36(vdsm) gid=36(kvm) groups=36(kvm),107(qemu),179(sanlock) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 50GiB 0:02:56 [ 290MiB/s] [           <=>      ] 104857600+0 records in 104857600+0 records out 53687091200 bytes (54 GB) copied, 176.189 s, 305 MB/s 104857600+0 records in 104857600+0 records out 53687091200 bytes (54 GB) copied, 176.19 s, 305 MB/s

So it's a permission problem (access denied) unless root loads it first? Strange: doing things like file & stat work; I can even cat the meta file (small text file). Seems only the disk images (or large files?) are affected.

huh!?

-- with kind regards, mit freundlichen Gruessen, Christian Reiss

Christian Reiss

1:29 p.m.

Ugh, disregarding off all previous stamenets: new findinds: vdsm user can NOT read files larger than 64mb. Root can. [vdsm@node02:/rhev/data-cente[...]c51d8a18370] $ for i in 60 62 64 66 68 ; do dd if=/dev/urandom of=file-$i bs=1M count=$i ; done [vdsm@node03:/rhev/data-cente[...]c51d8a18370] $ for i in 60 62 64 66 68 ; do echo $i ; dd if=file-$i of=/dev/null ; done 60 122880+0 records in 122880+0 records out 62914560 bytes (63 MB) copied, 0.15656 s, 402 MB/s 62 126976+0 records in 126976+0 records out 65011712 bytes (65 MB) copied, 0.172463 s, 377 MB/s 64 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.180701 s, 371 MB/s 66 dd: error reading ‘file-66’: Permission denied 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.105236 s, 638 MB/s 68 dd: error reading ‘file-68’: Permission denied 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.17046 s, 394 MB/s The files appeared instantly on all nodes, Writing large files work, however. Writing large files seem to work. I think this is the core issue. On 03/02/2020 12:22, Christian Reiss wrote:

...

Further findings:

- modified data gets written to local node, not across gluster. - vdsm user can create _new_ files on the cluster, this gets synced immediatly. - vdsm can modify, across all nodes newly created files, changes apply immediately.

I think vdsm user can not modify already existing files over the gluster. Something selinux?

-Chris.

On 03/02/2020 11:46, Christian Reiss wrote:

...
Hey,

I think I am barking up the right tree with something (else) here; Note the timestamps & id's:

dd'ing a disk image as vdsm user, try 1:

[vdsm@node03:/rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/images/4a55b9c0-d550-4ecb-8dd1-cc1f24f2c7ac] $ date ; id ; dd if=5fca6d0e-e320-425b-a89a-f80563461add | pv | dd of=/dev/null Mon 3 Feb 11:39:13 CET 2020 uid=36(vdsm) gid=36(kvm) groups=36(kvm),107(qemu),179(sanlock) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 dd: error reading ‘5fca6d0e-e320-425b-a89a-f80563461add’: Permission denied 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.169465 s, 396 MB/s    64MiB 0:00:00 [ 376MiB/s] [ <=>       ] 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.171726 s, 391 MB/s

try 2, directly afterward:

[vdsm@node03:/rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/images/4a55b9c0-d550-4ecb-8dd1-cc1f24f2c7ac] $ date ; id ; dd if=5fca6d0e-e320-425b-a89a-f80563461add | pv | dd of=/dev/null Mon 3 Feb 11:39:16 CET 2020 uid=36(vdsm) gid=36(kvm) groups=36(kvm),107(qemu),179(sanlock) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 dd: error reading ‘5fca6d0e-e320-425b-a89a-f80563461add’: Permission denied 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.148846 s, 451 MB/s    64MiB 0:00:00 [ 427MiB/s] [ <=>       ] 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.149589 s, 449 MB/s

try same as root:

[root@node03:/rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/images/4a55b9c0-d550-4ecb-8dd1-cc1f24f2c7ac] # date ; id ; dd if=5fca6d0e-e320-425b-a89a-f80563461add | pv | dd of=/dev/null Mon 3 Feb 11:39:33 CET 2020 uid=0(root) gid=0(root) groups=0(root) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023    50GiB 0:03:06 [ 274MiB/s] [                              <=>       ] 104857600+0 records in 104857600+0 records out 53687091200 bytes (54 GB) copied, 186.501 s, 288 MB/s 104857600+0 records in 104857600+0 records out 53687091200 bytes (54 GB) copied, 186.502 s, 288 MB/s

Followed by another vdsm dd test:

[vdsm@node03:/rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/images/4a55b9c0-d550-4ecb-8dd1-cc1f24f2c7ac] $ date ; id ; dd if=5fca6d0e-e320-425b-a89a-f80563461add | pv | dd of=/dev/null Mon 3 Feb 11:42:46 CET 2020 uid=36(vdsm) gid=36(kvm) groups=36(kvm),107(qemu),179(sanlock) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023    50GiB 0:02:56 [ 290MiB/s] [           <=>      ] 104857600+0 records in 104857600+0 records out 53687091200 bytes (54 GB) copied, 176.189 s, 305 MB/s 104857600+0 records in 104857600+0 records out 53687091200 bytes (54 GB) copied, 176.19 s, 305 MB/s

So it's a permission problem (access denied) unless root loads it first? Strange: doing things like file & stat work; I can even cat the meta file (small text file). Seems only the disk images (or large files?) are affected.

huh!?

-- with kind regards, mit freundlichen Gruessen, Christian Reiss

Strahil Nikolov

7:23 p.m.

On February 3, 2020 2:29:55 PM GMT+02:00, Christian Reiss <email@christian-reiss.de> wrote:

...

Ugh,

disregarding off all previous stamenets:

new findinds: vdsm user can NOT read files larger than 64mb. Root can.

[vdsm@node02:/rhev/data-cente[...]c51d8a18370] $ for i in 60 62 64 66 68 ; do dd if=/dev/urandom of=file-$i bs=1M count=$i ; done

[vdsm@node03:/rhev/data-cente[...]c51d8a18370] $ for i in 60 62 64 66 68 ; do echo $i ; dd if=file-$i of=/dev/null ; done 60 122880+0 records in 122880+0 records out 62914560 bytes (63 MB) copied, 0.15656 s, 402 MB/s 62 126976+0 records in 126976+0 records out 65011712 bytes (65 MB) copied, 0.172463 s, 377 MB/s 64 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.180701 s, 371 MB/s 66 dd: error reading ‘file-66’: Permission denied 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.105236 s, 638 MB/s 68 dd: error reading ‘file-68’: Permission denied 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.17046 s, 394 MB/s

The files appeared instantly on all nodes, Writing large files work, however. Writing large files seem to work.

I think this is the core issue.

On 03/02/2020 12:22, Christian Reiss wrote:

...
Further findings:

- modified data gets written to local node, not across gluster. - vdsm user can create _new_ files on the cluster, this gets synced immediatly. - vdsm can modify, across all nodes newly created files, changes apply immediately.

I think vdsm user can not modify already existing files over the gluster. Something selinux?

-Chris.

On 03/02/2020 11:46, Christian Reiss wrote:

...
Hey,

I think I am barking up the right tree with something (else) here; Note the timestamps & id's:

dd'ing a disk image as vdsm user, try 1:

[vdsm@node03:/rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/images/4a55b9c0-d550-4ecb-8dd1-cc1f24f2c7ac]

...
...
$ date ; id ; dd if=5fca6d0e-e320-425b-a89a-f80563461add | pv | dd of=/dev/null Mon 3 Feb 11:39:13 CET 2020 uid=36(vdsm) gid=36(kvm) groups=36(kvm),107(qemu),179(sanlock) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 dd: error reading ‘5fca6d0e-e320-425b-a89a-f80563461add’: Permission

...
...
denied 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.169465 s, 396 MB/s    64MiB 0:00:00 [ 376MiB/s] [ <=>       ] 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.171726 s, 391 MB/s

try 2, directly afterward:

[vdsm@node03:/rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/images/4a55b9c0-d550-4ecb-8dd1-cc1f24f2c7ac]

...
...
$ date ; id ; dd if=5fca6d0e-e320-425b-a89a-f80563461add | pv | dd of=/dev/null Mon 3 Feb 11:39:16 CET 2020 uid=36(vdsm) gid=36(kvm) groups=36(kvm),107(qemu),179(sanlock) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 dd: error reading ‘5fca6d0e-e320-425b-a89a-f80563461add’: Permission

...
...
denied 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.148846 s, 451 MB/s    64MiB 0:00:00 [ 427MiB/s] [ <=>       ] 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.149589 s, 449 MB/s

try same as root:

[root@node03:/rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/images/4a55b9c0-d550-4ecb-8dd1-cc1f24f2c7ac]

...
...
# date ; id ; dd if=5fca6d0e-e320-425b-a89a-f80563461add | pv | dd of=/dev/null Mon 3 Feb 11:39:33 CET 2020 uid=0(root) gid=0(root) groups=0(root) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023    50GiB 0:03:06 [ 274MiB/s] [                              <=>       ] 104857600+0 records in 104857600+0 records out 53687091200 bytes (54 GB) copied, 186.501 s, 288 MB/s 104857600+0 records in 104857600+0 records out 53687091200 bytes (54 GB) copied, 186.502 s, 288 MB/s

Followed by another vdsm dd test:

[vdsm@node03:/rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/images/4a55b9c0-d550-4ecb-8dd1-cc1f24f2c7ac]

...
...
$ date ; id ; dd if=5fca6d0e-e320-425b-a89a-f80563461add | pv | dd of=/dev/null Mon 3 Feb 11:42:46 CET 2020 uid=36(vdsm) gid=36(kvm) groups=36(kvm),107(qemu),179(sanlock) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023    50GiB 0:02:56 [ 290MiB/s] [           <=>      ] 104857600+0 records in 104857600+0 records out 53687091200 bytes (54 GB) copied, 176.189 s, 305 MB/s 104857600+0 records in 104857600+0 records out 53687091200 bytes (54 GB) copied, 176.19 s, 305 MB/s

So it's a permission problem (access denied) unless root loads it first? Strange: doing things like file & stat work; I can even cat the meta

...
...
file (small text file). Seems only the disk images (or large files?)

...
...
are affected.

huh!?

What version of gluster are you running ? Have you tried the solution with : 1. Run fake setfacl ? Or killing brick processes and start the volume with the 'force' option ? I saw in your brick output your selinux context is 'unconfined_u' ... So check for labeling. Still, it looks like my ACL issue. Best Regards, Strahil Nikolov

Christian Reiss

6 Feb 6 Feb

4:40 p.m.

Hey, For prosperity: Sadly the only way to fix this was to re-init (wipe) gluster and start from scratch. -Chris. On 03/02/2020 19:23, Strahil Nikolov wrote:

...

On February 3, 2020 2:29:55 PM GMT+02:00, Christian Reiss <email@christian-reiss.de> wrote:

...
Ugh,

disregarding off all previous stamenets:

new findinds: vdsm user can NOT read files larger than 64mb. Root can.

[vdsm@node02:/rhev/data-cente[...]c51d8a18370] $ for i in 60 62 64 66 68 ; do dd if=/dev/urandom of=file-$i bs=1M count=$i ; done

[vdsm@node03:/rhev/data-cente[...]c51d8a18370] $ for i in 60 62 64 66 68 ; do echo $i ; dd if=file-$i of=/dev/null ; done 60 122880+0 records in 122880+0 records out 62914560 bytes (63 MB) copied, 0.15656 s, 402 MB/s 62 126976+0 records in 126976+0 records out 65011712 bytes (65 MB) copied, 0.172463 s, 377 MB/s 64 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.180701 s, 371 MB/s 66 dd: error reading ‘file-66’: Permission denied 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.105236 s, 638 MB/s 68 dd: error reading ‘file-68’: Permission denied 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.17046 s, 394 MB/s

The files appeared instantly on all nodes, Writing large files work, however. Writing large files seem to work.

I think this is the core issue.

On 03/02/2020 12:22, Christian Reiss wrote:

...
Further findings:

- modified data gets written to local node, not across gluster. - vdsm user can create _new_ files on the cluster, this gets synced immediatly. - vdsm can modify, across all nodes newly created files, changes apply immediately.

I think vdsm user can not modify already existing files over the gluster. Something selinux?

-Chris.

On 03/02/2020 11:46, Christian Reiss wrote:

...
Hey,

I think I am barking up the right tree with something (else) here; Note the timestamps & id's:

dd'ing a disk image as vdsm user, try 1:

[vdsm@node03:/rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/images/4a55b9c0-d550-4ecb-8dd1-cc1f24f2c7ac]

...
...
$ date ; id ; dd if=5fca6d0e-e320-425b-a89a-f80563461add | pv | dd of=/dev/null Mon 3 Feb 11:39:13 CET 2020 uid=36(vdsm) gid=36(kvm) groups=36(kvm),107(qemu),179(sanlock) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 dd: error reading ‘5fca6d0e-e320-425b-a89a-f80563461add’: Permission

...
...
denied 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.169465 s, 396 MB/s    64MiB 0:00:00 [ 376MiB/s] [ <=>       ] 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.171726 s, 391 MB/s

try 2, directly afterward:

[vdsm@node03:/rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/images/4a55b9c0-d550-4ecb-8dd1-cc1f24f2c7ac]

...
...
$ date ; id ; dd if=5fca6d0e-e320-425b-a89a-f80563461add | pv | dd of=/dev/null Mon 3 Feb 11:39:16 CET 2020 uid=36(vdsm) gid=36(kvm) groups=36(kvm),107(qemu),179(sanlock) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 dd: error reading ‘5fca6d0e-e320-425b-a89a-f80563461add’: Permission

...
...
denied 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.148846 s, 451 MB/s    64MiB 0:00:00 [ 427MiB/s] [ <=>       ] 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.149589 s, 449 MB/s

try same as root:

[root@node03:/rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/images/4a55b9c0-d550-4ecb-8dd1-cc1f24f2c7ac]

...
...
# date ; id ; dd if=5fca6d0e-e320-425b-a89a-f80563461add | pv | dd of=/dev/null Mon 3 Feb 11:39:33 CET 2020 uid=0(root) gid=0(root) groups=0(root) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023    50GiB 0:03:06 [ 274MiB/s] [                              <=>       ] 104857600+0 records in 104857600+0 records out 53687091200 bytes (54 GB) copied, 186.501 s, 288 MB/s 104857600+0 records in 104857600+0 records out 53687091200 bytes (54 GB) copied, 186.502 s, 288 MB/s

Followed by another vdsm dd test:

[vdsm@node03:/rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/images/4a55b9c0-d550-4ecb-8dd1-cc1f24f2c7ac]

...
...
$ date ; id ; dd if=5fca6d0e-e320-425b-a89a-f80563461add | pv | dd of=/dev/null Mon 3 Feb 11:42:46 CET 2020 uid=36(vdsm) gid=36(kvm) groups=36(kvm),107(qemu),179(sanlock) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023    50GiB 0:02:56 [ 290MiB/s] [           <=>      ] 104857600+0 records in 104857600+0 records out 53687091200 bytes (54 GB) copied, 176.189 s, 305 MB/s 104857600+0 records in 104857600+0 records out 53687091200 bytes (54 GB) copied, 176.19 s, 305 MB/s

So it's a permission problem (access denied) unless root loads it first? Strange: doing things like file & stat work; I can even cat the meta

...
...
file (small text file). Seems only the disk images (or large files?)

...
...
are affected.

huh!?

What version of gluster are you running ? Have you tried the solution with : 1. Run fake setfacl ? Or killing brick processes and start the volume with the 'force' option ?

I saw in your brick output your selinux context is 'unconfined_u' ... So check for labeling. Still, it looks like my ACL issue.

Best Regards, Strahil Nikolov

-- with kind regards, mit freundlichen Gruessen, Christian Reiss

Jayme

5:04 p.m.

Appreciate the updates you've been posting. It's concerning to me as a Gluster user as well. It would be nice to figure out what happened here. On Thu, Feb 6, 2020 at 11:43 AM Christian Reiss <email@christian-reiss.de> wrote:

...

Hey,

For prosperity: Sadly the only way to fix this was to re-init (wipe) gluster and start from scratch.

-Chris.

...
On February 3, 2020 2:29:55 PM GMT+02:00, Christian Reiss < email@christian-reiss.de> wrote:

...
Ugh,

disregarding off all previous stamenets:

new findinds: vdsm user can NOT read files larger than 64mb. Root can.

[vdsm@node02:/rhev/data-cente[...]c51d8a18370] $ for i in 60 62 64 66 68 ; do dd if=/dev/urandom of=file-$i bs=1M count=$i ; done

[vdsm@node03:/rhev/data-cente[...]c51d8a18370] $ for i in 60 62 64 66 68 ; do echo $i ; dd if=file-$i of=/dev/null ; done 60 122880+0 records in 122880+0 records out 62914560 bytes (63 MB) copied, 0.15656 s, 402 MB/s 62 126976+0 records in 126976+0 records out 65011712 bytes (65 MB) copied, 0.172463 s, 377 MB/s 64 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.180701 s, 371 MB/s 66 dd: error reading ‘file-66’: Permission denied 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.105236 s, 638 MB/s 68 dd: error reading ‘file-68’: Permission denied 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.17046 s, 394 MB/s

The files appeared instantly on all nodes, Writing large files work, however. Writing large files seem to work.

I think this is the core issue.

On 03/02/2020 12:22, Christian Reiss wrote:

...
Further findings:

- modified data gets written to local node, not across gluster. - vdsm user can create _new_ files on the cluster, this gets synced immediatly. - vdsm can modify, across all nodes newly created files, changes apply immediately.

I think vdsm user can not modify already existing files over the gluster. Something selinux?

-Chris.

On 03/02/2020 11:46, Christian Reiss wrote:

...
Hey,

I think I am barking up the right tree with something (else) here; Note the timestamps & id's:

dd'ing a disk image as vdsm user, try 1:

[vdsm@node03:/rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net: _ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/images/4a55b9c0-d550-4ecb-8dd1-cc1f24f2c7ac]

...
...
$ date ; id ; dd if=5fca6d0e-e320-425b-a89a-f80563461add | pv | dd of=/dev/null Mon 3 Feb 11:39:13 CET 2020 uid=36(vdsm) gid=36(kvm) groups=36(kvm),107(qemu),179(sanlock) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 dd: error reading ‘5fca6d0e-e320-425b-a89a-f80563461add’: Permission

...
...
denied 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.169465 s, 396 MB/s 64MiB 0:00:00 [ 376MiB/s] [ <=> ] 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.171726 s, 391 MB/s

try 2, directly afterward:

[vdsm@node03:/rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net: _ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/images/4a55b9c0-d550-4ecb-8dd1-cc1f24f2c7ac]

...
...
$ date ; id ; dd if=5fca6d0e-e320-425b-a89a-f80563461add | pv | dd of=/dev/null Mon 3 Feb 11:39:16 CET 2020 uid=36(vdsm) gid=36(kvm) groups=36(kvm),107(qemu),179(sanlock) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 dd: error reading ‘5fca6d0e-e320-425b-a89a-f80563461add’: Permission

...
...
denied 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.148846 s, 451 MB/s 64MiB 0:00:00 [ 427MiB/s] [ <=> ] 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.149589 s, 449 MB/s

try same as root:

[root@node03:/rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net: _ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/images/4a55b9c0-d550-4ecb-8dd1-cc1f24f2c7ac]

...
...
# date ; id ; dd if=5fca6d0e-e320-425b-a89a-f80563461add | pv | dd of=/dev/null Mon 3 Feb 11:39:33 CET 2020 uid=0(root) gid=0(root) groups=0(root) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 50GiB 0:03:06 [ 274MiB/s] [ <=> ] 104857600+0 records in 104857600+0 records out 53687091200 bytes (54 GB) copied, 186.501 s, 288 MB/s 104857600+0 records in 104857600+0 records out 53687091200 bytes (54 GB) copied, 186.502 s, 288 MB/s

Followed by another vdsm dd test:

[vdsm@node03:/rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net: _ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/images/4a55b9c0-d550-4ecb-8dd1-cc1f24f2c7ac]

...
...
$ date ; id ; dd if=5fca6d0e-e320-425b-a89a-f80563461add | pv | dd of=/dev/null Mon 3 Feb 11:42:46 CET 2020 uid=36(vdsm) gid=36(kvm) groups=36(kvm),107(qemu),179(sanlock) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 50GiB 0:02:56 [ 290MiB/s] [ <=> ] 104857600+0 records in 104857600+0 records out 53687091200 bytes (54 GB) copied, 176.189 s, 305 MB/s 104857600+0 records in 104857600+0 records out 53687091200 bytes (54 GB) copied, 176.19 s, 305 MB/s

So it's a permission problem (access denied) unless root loads it first? Strange: doing things like file & stat work; I can even cat the meta

...
...
file (small text file). Seems only the disk images (or large files?)

...
...
are affected.

huh!?

What version of gluster are you running ? Have you tried the solution with : 1. Run fake setfacl ? Or killing brick processes and start the volume with the 'force'

On 03/02/2020 19:23, Strahil Nikolov wrote: option ?

...
I saw in your brick output your selinux context is 'unconfined_u' ...

So check for labeling.

...
Still, it looks like my ACL issue.

Best Regards, Strahil Nikolov

-- with kind regards, mit freundlichen Gruessen,

Christian Reiss _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/3ZYVQJQR7SFLOG...

Strahil Nikolov

7:28 p.m.

On February 6, 2020 6:04:58 PM GMT+02:00, Jayme <jaymef@gmail.com> wrote:

...

Appreciate the updates you've been posting. It's concerning to me as a Gluster user as well. It would be nice to figure out what happened here.

On Thu, Feb 6, 2020 at 11:43 AM Christian Reiss <email@christian-reiss.de> wrote:

...
Hey,

For prosperity: Sadly the only way to fix this was to re-init (wipe) gluster and start from scratch.

-Chris.

On 03/02/2020 19:23, Strahil Nikolov wrote:

...
On February 3, 2020 2:29:55 PM GMT+02:00, Christian Reiss < email@christian-reiss.de> wrote:

...
Ugh,

disregarding off all previous stamenets:

new findinds: vdsm user can NOT read files larger than 64mb. Root can.

[vdsm@node02:/rhev/data-cente[...]c51d8a18370] $ for i in 60 62 64 66 68 ; do dd if=/dev/urandom of=file-$i bs=1M count=$i ; done

[vdsm@node03:/rhev/data-cente[...]c51d8a18370] $ for i in 60 62 64 66 68 ; do echo $i ; dd if=file-$i of=/dev/null ; done 60 122880+0 records in 122880+0 records out 62914560 bytes (63 MB) copied, 0.15656 s, 402 MB/s 62 126976+0 records in 126976+0 records out 65011712 bytes (65 MB) copied, 0.172463 s, 377 MB/s 64 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.180701 s, 371 MB/s 66 dd: error reading ‘file-66’: Permission denied 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.105236 s, 638 MB/s 68 dd: error reading ‘file-68’: Permission denied 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.17046 s, 394 MB/s

The files appeared instantly on all nodes, Writing large files work, however. Writing large files seem to work.

I think this is the core issue.

On 03/02/2020 12:22, Christian Reiss wrote:

...
Further findings:

- modified data gets written to local node, not across gluster. - vdsm user can create _new_ files on the cluster, this gets synced immediatly. - vdsm can modify, across all nodes newly created files, changes apply immediately.

I think vdsm user can not modify already existing files over the gluster. Something selinux?

-Chris.

On 03/02/2020 11:46, Christian Reiss wrote:

...
Hey,

I think I am barking up the right tree with something (else) here; Note the timestamps & id's:

dd'ing a disk image as vdsm user, try 1:

[vdsm@node03:/rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net:

...
...
...
...
...
$ date ; id ; dd if=5fca6d0e-e320-425b-a89a-f80563461add | pv |

_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/images/4a55b9c0-d550-4ecb-8dd1-cc1f24f2c7ac] dd

...
...
...
...
...
of=/dev/null Mon 3 Feb 11:39:13 CET 2020 uid=36(vdsm) gid=36(kvm) groups=36(kvm),107(qemu),179(sanlock) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 dd: error reading ‘5fca6d0e-e320-425b-a89a-f80563461add’: Permission

...
...
denied 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.169465 s, 396 MB/s 64MiB 0:00:00 [ 376MiB/s] [ <=> ] 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.171726 s, 391 MB/s

try 2, directly afterward:

[vdsm@node03:/rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net:

...
...
...
...
...
$ date ; id ; dd if=5fca6d0e-e320-425b-a89a-f80563461add | pv |

_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/images/4a55b9c0-d550-4ecb-8dd1-cc1f24f2c7ac] dd

...
...
...
...
...
of=/dev/null Mon 3 Feb 11:39:16 CET 2020 uid=36(vdsm) gid=36(kvm) groups=36(kvm),107(qemu),179(sanlock) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 dd: error reading ‘5fca6d0e-e320-425b-a89a-f80563461add’: Permission

...
...
denied 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.148846 s, 451 MB/s 64MiB 0:00:00 [ 427MiB/s] [ <=> ] 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.149589 s, 449 MB/s

try same as root:

[root@node03:/rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net:

...
...
...
...
...
# date ; id ; dd if=5fca6d0e-e320-425b-a89a-f80563461add | pv |

_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/images/4a55b9c0-d550-4ecb-8dd1-cc1f24f2c7ac] dd

...
...
...
...
...
of=/dev/null Mon 3 Feb 11:39:33 CET 2020 uid=0(root) gid=0(root) groups=0(root) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 50GiB 0:03:06 [ 274MiB/s] [ <=> ] 104857600+0 records in 104857600+0 records out 53687091200 bytes (54 GB) copied, 186.501 s, 288 MB/s 104857600+0 records in 104857600+0 records out 53687091200 bytes (54 GB) copied, 186.502 s, 288 MB/s

Followed by another vdsm dd test:

[vdsm@node03:/rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net:

...
...
...
...
...
$ date ; id ; dd if=5fca6d0e-e320-425b-a89a-f80563461add | pv |

...
...
...
...
...
of=/dev/null Mon 3 Feb 11:42:46 CET 2020 uid=36(vdsm) gid=36(kvm) groups=36(kvm),107(qemu),179(sanlock) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 50GiB 0:02:56 [ 290MiB/s] [ <=> ] 104857600+0 records in 104857600+0 records out 53687091200 bytes (54 GB) copied, 176.189 s, 305 MB/s 104857600+0 records in 104857600+0 records out 53687091200 bytes (54 GB) copied, 176.19 s, 305 MB/s

So it's a permission problem (access denied) unless root loads it first? Strange: doing things like file & stat work; I can even cat the

_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/images/4a55b9c0-d550-4ecb-8dd1-cc1f24f2c7ac] dd meta

...
...
...
...
...
file (small text file). Seems only the disk images (or large

files?)

...
...
...
are affected.

huh!?

What version of gluster are you running ? Have you tried the solution with : 1. Run fake setfacl ? Or killing brick processes and start the volume with the 'force' option ?

I saw in your brick output your selinux context is 'unconfined_u' ... So check for labeling. Still, it looks like my ACL issue.

Best Regards, Strahil Nikolov

-- with kind regards, mit freundlichen Gruessen,

Christian Reiss _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives:

https://lists.ovirt.org/archives/list/users@ovirt.org/message/3ZYVQJQR7SFLOG...

...

Hi Jayme, It's the ACL bug that hit me some time ago. It was supposed to be fixed in 6.6 , yet it isn't (even 7.1 & 7.2 is affected). For some reason , gluster looses all acl data and thus only root can access the gluster shards. Best Regards, Strahil Nikolov

2132

Age (days ago)

2135

Last active (days ago)

List overview

Download

17 comments

4 participants

participants (4)

Christian Reiss
Darrell Budic
Jayme
Strahil Nikolov

Emergency :/ No VMs starting

tags

participants (4)