IO Storage Error / All findings / Need help.

Hello folks, I am running a 3-way, no arbiter Gluster setup using oVirt and contained Gluster 6.7. After a crash we are unable to start any VMs due to Storage IO error. After much, much backtracking and debugging we are closing in on the symptons, albeit not the issue. Conditions: - gluster volume is healthy, - No outstanding heal or split-brain files, - 3 way without arbiter nodes (3 copies), - I already ran several "heal full" commands. Gluster Volume Info Volume Name: ssd_storage Type: Replicate Volume ID: d84ec99a-5db9-49c6-aab4-c7481a1dc57b Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: node01.company.com:/gluster_bricks/ssd_storage/ssd_storage Brick2: node02.company.com:/gluster_bricks/ssd_storage/ssd_storage Brick3: node03.company.com:/gluster_bricks/ssd_storage/ssd_storage Options Reconfigured: cluster.self-heal-daemon: enable cluster.granular-entry-heal: enable storage.owner-gid: 36 storage.owner-uid: 36 network.ping-timeout: 30 server.event-threads: 4 client.event-threads: 4 cluster.choose-local: off user.cifs: off features.shard: on cluster.shd-wait-qlength: 10000 cluster.shd-max-threads: 8 cluster.locking-scheme: granular cluster.data-self-heal-algorithm: full cluster.server-quorum-type: server cluster.quorum-type: auto cluster.eager-lock: enable network.remote-dio: off performance.low-prio-threads: 32 performance.io-cache: off performance.read-ahead: off performance.quick-read: off performance.strict-o-direct: on transport.address-family: inet nfs.disable: on performance.client-io-threads: on Gluster Volume Status Status of volume: ssd_storage Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick node01.company.com:/gluster_br icks/ssd_storage/ssd_storage 49152 0 Y 8218 Brick node02.company.com:/gluster_br icks/ssd_storage/ssd_storage 49152 0 Y 23595 Brick node03.company.com:/gluster_br icks/ssd_storage/ssd_storage 49152 0 Y 8080 Self-heal Daemon on localhost N/A N/A Y 66028 Self-heal Daemon on 10.100.200.12 N/A N/A Y 52087 Self-heal Daemon on node03.company.com et N/A N/A Y 8372 Task Status of Volume ssd_storage ------------------------------------------------------------------------------ There are no active volume tasks The mounted path where the oVirt vm files reside is 100% okay, we copied all the images out there onto standalone hosts and the images run just fine. There is no obvious data corruption. However launching any VM out of oVirt fails with "IO Storage Error". This is where everything gets funny. oVirt uses a vdsm user to access all the files. Findings: - root can read, edit and write all files inside the ovirt mounted gluster path. - vdsm user can write to new files regardless of size without any issues; changes get replicated instantly to other nodes. - vdsm user can append to existing files regardless of size without any issues; changes get replicated instantly to other nodes. - vdsm user can read files if those files are smaller than 64mb. - vdsm user gets permission denied errors if the file to be read is 65mb or bigger. - vdsm user gets permission denied errors if the requests crosses a gluster shard-file boundary. - if root does a "dd if=file_larger_than64mb" of=/dev/null" on any large file, the file can then be read by the vdsm user on that single node. Changes do not get replicated to other nodes. Example: id of the vdsm user & sudo to them: [vdsm@node01:/rhev/data-center/mnt/glusterSD/node01.company.com:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/test] $ id uid=36(vdsm) gid=36(kvm) groups=36(kvm),107(qemu),179(sanlock) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 [vdsm@node02:/rhev/data-center/mnt/glusterSD/node01.company.com:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/test] $ id uid=36(vdsm) gid=36(kvm) groups=36(kvm),107(qemu),179(sanlock) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 [vdsm@node03:/rhev/data-center/mnt/glusterSD/node01.company.com:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/test] $ id uid=36(vdsm) gid=36(kvm) groups=36(kvm),107(qemu),179(sanlock) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 Create a file >64mb on one node: [vdsm@node03:/rhev/data-center/mnt/glusterSD/node01.company.com:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/test] $ base64 /dev/urandom | head -c 200000000 > file.txt [vdsm@node03:/rhev/data-center/mnt/glusterSD/node01.company.com:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/test] $ ls -lha total 191M drwxr-xr-x. 2 vdsm kvm 30 Feb 4 13:10 . drwxr-xr-x. 6 vdsm kvm 80 Jan 1 1970 .. -rw-r--r--. 1 vdsm kvm 191M Feb 4 13:10 file.txt File is instantly available on another node: [vdsm@node01:/rhev/data-center/mnt/glusterSD/node01.company.com:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/test] $ ls -lha total 191M drwxr-xr-x. 2 vdsm kvm 30 Feb 4 13:10 . drwxr-xr-x. 6 vdsm kvm 80 Jan 1 1970 .. -rw-r--r--. 1 vdsm kvm 191M Feb 4 13:10 file.txt Accessing the whole file fails: [vdsm@node01:] dd if=file.txt of=/dev/null dd: error reading ‘file.txt’: Permission denied 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.0651919 s, 1.0 GB/s Reading first 64mb works, 65mb (crossing boundary) does not: [vdsm@node01:] $ dd if=file.txt bs=1M count=64 of=/dev/null 64+0 records in 64+0 records out 67108864 bytes (67 MB) copied, 0.00801663 s, 8.4 GB/s [vdsm@node01:] $ dd if=file.txt bs=1M count=65 of=/dev/null dd: error reading ‘file.txt’: Permission denied 64+0 records in 64+0 records out 67108864 bytes (67 MB) copied, 0.00908712 s, 7.4 GB/s Attaching/ appending to the file works (not crossing bounary): [vdsm@node01:] $ date >> file.txt [vdsm@node01:] $ [vdsm@node02:] $ tail -n2 file.txt E16ACZaLqLhx2oUUUov5JHvQcVFohn6HH+eog6XZCiTaG0Tue 4 Feb 13:18:37 CET 2020 Reading the file beginning & end works, if it crosses the boundary not so much: [vdsm@node02:] $ head file.txt jrZOxGaGvwfpGSwn1BKWWmFC4556KNzXsD2BCwY78tnV1mRY54IxnE+hbnszRyWgVuXhBpVRoJTp xvVwktZwSytMyvJjsSt7pQbXbHSY66tRe/rvrw5dHr3RNJn9HjqtlKQ9mHVX4ch1HkU5posSmDbg vwzxBTXWfxLDMmIghyTgBTSFiI9Xg8W6htxDpxrbO+10EzlnaN1Am5tAlTkfrorNLyihpiQhUPGG ag6tJUcFj3IySGRTAxnStFRQoBXN5dlyx1Sqc4s/Tpl7gkgR8+I7UcdRKISjgcGcpW+zrXKqFF/H Dwv6ql+2ysPRrtlbt2V8Zf697VsNX5DTgZS9BKmWlAeqejNYaqG5Rsuhn7szbCfkkmsjedk+Rdcv A3SHMBeHXdtfBHS0AlbEwKgeml08NmCUcwnifhrQywCnu8NN9+RQ3cUxGvIuLLSzi3915wC6hbxr 8xArckQfSUfKA/hrHvoiiCGZU9D23xj3XXtsjdbIIDXATDnCPrKANdvGN5LTKal8bT0jXORfAz1z MniqVUgvWVNcviPgQ9BfT5qpGo8g7LaoBMGamAGVX6Ezrs04rk8jQ1yz1bB/8URfTRLZdyYkMh0u MB4xMylnyavgusoi7Duf5RuYJvNaL0g8Lx/cfGpGsGwdD2Lj/qRC45ammn6wCxDVfiJV6Z/TzJcY PBvzWK5xT++PQgMV8EwtXwA1kFqaGrcuiDHejMQ8O82Edjr+eBCBe0B7bRddoMD6oOlhNm1YsSNt [vdsm@node02:] $ tail file.txt 9JX8OWCJwbyvEPDyyI30H1/jPZfDo1sS11dZ2JjiO7qhB45VaU8+irG45D0GGJhFf8wE8TD9EGWG 8346QHLX9ZSFsbjpuh71hr5Ju1UduVdvIDwwP8WDBtRUbMAVvsyGR33rkpijepmUjmYl/jeZ7rsC VyUVlmG5PxrI7KKxz5dSkzApqVHKKgsf93JMDAdPwvXTq4hhZdUJ581w9FC/f9k2wWldEGkAcyB0 cCKp+VJl2vx989KUoqAJzsrvYdK0X7itruqYdpC29JXode+7NixUflhKvPdKmitBYyCEgCcyxUyn eyMOdaan2x8d8MztLLoWLpp+gLzl2Hev7y3OXq6I9SVN2t+hcVIz8Llmumy0cD+VC4u2/UZszYqS nDaSSMs35agGUUgIpHjPxCRf/yqnfrJJMTGAcxSEqHtpEdsjEmkf4QkyEgEZ13f4oi7P/DFCIIvV JBsHzOLDoetnFzAA2/RqbDflPrVWcAR7tXVqGLACCj2s19uUFSNb8nBWmEk8fFz31iJhuL43v0WE 78/THl49T0hhzHQp6kdIiw5p1zPUIFGBZ0BS4mBCHxu+tMlPZe1zWJMJZdPnvDNtHZ4gQ6LFgU4w E16ACZaLqLhx2oUUUov5JHvQcVFohn6HH+eog6XZCiTaG0Tue 4 Feb 13:18:37 CET 2020 [vdsm@node02:] $ dd if=file.txt of=/dev/null dd: error reading ‘file.txt’: Permission denied 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.106097 s, 633 MB/s if root does dd first, all is peachy: [root@node02] # dd if=file.txt of=/dev/null 390625+1 records in 390625+1 records out 200000058 bytes (200 MB) copied, 0.345906 s, 578 MB/s [vdsm@node02] $ dd if=file.txt of=/dev/null 390625+1 records in 390625+1 records out 200000058 bytes (200 MB) copied, 0.188451 s, 1.1 GB/s Error in the gluster.log: [2020-02-04 12:27:57.915356] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ssd_storage-client-1: remote operation failed. Path: /.shard/57200f4f-537d-4e56-9258-38fe6ac64c4e.2 (00000000-0000-0000-0000-000000000000) [Permission denied] [2020-02-04 12:27:57.915404] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ssd_storage-client-0: remote operation failed. Path: /.shard/57200f4f-537d-4e56-9258-38fe6ac64c4e.2 (00000000-0000-0000-0000-000000000000) [Permission denied] [2020-02-04 12:27:57.915472] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ssd_storage-client-2: remote operation failed. Path: /.shard/57200f4f-537d-4e56-9258-38fe6ac64c4e.2 (00000000-0000-0000-0000-000000000000) [Permission denied] [2020-02-04 12:27:57.915490] E [MSGID: 133010] [shard.c:2327:shard_common_lookup_shards_cbk] 0-ssd_storage-shard: Lookup on shard 2 failed. Base file gfid = 57200f4f-537d-4e56-9258-38fe6ac64c4e [Permission denied] What we tried: - restarting single hosts, - restarting the entire cluster, - doing stuff like find /rhev .. exec stats{}\ ; - dd'ing (read) all of the mount dir... We are out of ideas and also no experts on either gluster nor ovirt, it seems. And this is supposed to be a production HA environment. Any help would be appreciated. I hope I did think of all the relevant data and logs. -- with kind regards, mit freundlichen Gruessen, Christian Reiss

Hi Chriss, did you try to downgrade to gluster v6.6 for example ? Most probably it's the ACL issue I have experienced some time ago.Gluster devs recommended either a fake acl (for example mounting the volume with acl and run a find with 'setfacl -m u:root:rw {} \;' to force gluster to reread the acl data from the bricks) or to kill glusterd processes and to start the volume forcefully. In my case with 7.0-> 7.2 the only fix was downgrade , so you should consider that. Now every gluster upgrade involves test power off and power on of several VMs from all Gluster volumes. Best Regards,Strahil Nikolov В вторник, 4 февруари 2020 г., 14:40:32 ч. Гринуич+2, Christian Reiss <email@christian-reiss.de> написа: Hello folks, I am running a 3-way, no arbiter Gluster setup using oVirt and contained Gluster 6.7. After a crash we are unable to start any VMs due to Storage IO error. After much, much backtracking and debugging we are closing in on the symptons, albeit not the issue. Conditions: - gluster volume is healthy, - No outstanding heal or split-brain files, - 3 way without arbiter nodes (3 copies), - I already ran several "heal full" commands. Gluster Volume Info Volume Name: ssd_storage Type: Replicate Volume ID: d84ec99a-5db9-49c6-aab4-c7481a1dc57b Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: node01.company.com:/gluster_bricks/ssd_storage/ssd_storage Brick2: node02.company.com:/gluster_bricks/ssd_storage/ssd_storage Brick3: node03.company.com:/gluster_bricks/ssd_storage/ssd_storage Options Reconfigured: cluster.self-heal-daemon: enable cluster.granular-entry-heal: enable storage.owner-gid: 36 storage.owner-uid: 36 network.ping-timeout: 30 server.event-threads: 4 client.event-threads: 4 cluster.choose-local: off user.cifs: off features.shard: on cluster.shd-wait-qlength: 10000 cluster.shd-max-threads: 8 cluster.locking-scheme: granular cluster.data-self-heal-algorithm: full cluster.server-quorum-type: server cluster.quorum-type: auto cluster.eager-lock: enable network.remote-dio: off performance.low-prio-threads: 32 performance.io-cache: off performance.read-ahead: off performance.quick-read: off performance.strict-o-direct: on transport.address-family: inet nfs.disable: on performance.client-io-threads: on Gluster Volume Status Status of volume: ssd_storage Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick node01.company.com:/gluster_br icks/ssd_storage/ssd_storage 49152 0 Y 8218 Brick node02.company.com:/gluster_br icks/ssd_storage/ssd_storage 49152 0 Y 23595 Brick node03.company.com:/gluster_br icks/ssd_storage/ssd_storage 49152 0 Y 8080 Self-heal Daemon on localhost N/A N/A Y 66028 Self-heal Daemon on 10.100.200.12 N/A N/A Y 52087 Self-heal Daemon on node03.company.com et N/A N/A Y 8372 Task Status of Volume ssd_storage ------------------------------------------------------------------------------ There are no active volume tasks The mounted path where the oVirt vm files reside is 100% okay, we copied all the images out there onto standalone hosts and the images run just fine. There is no obvious data corruption. However launching any VM out of oVirt fails with "IO Storage Error". This is where everything gets funny. oVirt uses a vdsm user to access all the files. Findings: - root can read, edit and write all files inside the ovirt mounted gluster path. - vdsm user can write to new files regardless of size without any issues; changes get replicated instantly to other nodes. - vdsm user can append to existing files regardless of size without any issues; changes get replicated instantly to other nodes. - vdsm user can read files if those files are smaller than 64mb. - vdsm user gets permission denied errors if the file to be read is 65mb or bigger. - vdsm user gets permission denied errors if the requests crosses a gluster shard-file boundary. - if root does a "dd if=file_larger_than64mb" of=/dev/null" on any large file, the file can then be read by the vdsm user on that single node. Changes do not get replicated to other nodes. Example: id of the vdsm user & sudo to them: [vdsm@node01:/rhev/data-center/mnt/glusterSD/node01.company.com:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/test] $ id uid=36(vdsm) gid=36(kvm) groups=36(kvm),107(qemu),179(sanlock) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 [vdsm@node02:/rhev/data-center/mnt/glusterSD/node01.company.com:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/test] $ id uid=36(vdsm) gid=36(kvm) groups=36(kvm),107(qemu),179(sanlock) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 [vdsm@node03:/rhev/data-center/mnt/glusterSD/node01.company.com:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/test] $ id uid=36(vdsm) gid=36(kvm) groups=36(kvm),107(qemu),179(sanlock) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 Create a file >64mb on one node: [vdsm@node03:/rhev/data-center/mnt/glusterSD/node01.company.com:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/test] $ base64 /dev/urandom | head -c 200000000 > file.txt [vdsm@node03:/rhev/data-center/mnt/glusterSD/node01.company.com:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/test] $ ls -lha total 191M drwxr-xr-x. 2 vdsm kvm 30 Feb 4 13:10 . drwxr-xr-x. 6 vdsm kvm 80 Jan 1 1970 .. -rw-r--r--. 1 vdsm kvm 191M Feb 4 13:10 file.txt File is instantly available on another node: [vdsm@node01:/rhev/data-center/mnt/glusterSD/node01.company.com:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/test] $ ls -lha total 191M drwxr-xr-x. 2 vdsm kvm 30 Feb 4 13:10 . drwxr-xr-x. 6 vdsm kvm 80 Jan 1 1970 .. -rw-r--r--. 1 vdsm kvm 191M Feb 4 13:10 file.txt Accessing the whole file fails: [vdsm@node01:] dd if=file.txt of=/dev/null dd: error reading ‘file.txt’: Permission denied 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.0651919 s, 1.0 GB/s Reading first 64mb works, 65mb (crossing boundary) does not: [vdsm@node01:] $ dd if=file.txt bs=1M count=64 of=/dev/null 64+0 records in 64+0 records out 67108864 bytes (67 MB) copied, 0.00801663 s, 8.4 GB/s [vdsm@node01:] $ dd if=file.txt bs=1M count=65 of=/dev/null dd: error reading ‘file.txt’: Permission denied 64+0 records in 64+0 records out 67108864 bytes (67 MB) copied, 0.00908712 s, 7.4 GB/s Attaching/ appending to the file works (not crossing bounary): [vdsm@node01:] $ date >> file.txt [vdsm@node01:] $ [vdsm@node02:] $ tail -n2 file.txt E16ACZaLqLhx2oUUUov5JHvQcVFohn6HH+eog6XZCiTaG0Tue 4 Feb 13:18:37 CET 2020 Reading the file beginning & end works, if it crosses the boundary not so much: [vdsm@node02:] $ head file.txt jrZOxGaGvwfpGSwn1BKWWmFC4556KNzXsD2BCwY78tnV1mRY54IxnE+hbnszRyWgVuXhBpVRoJTp xvVwktZwSytMyvJjsSt7pQbXbHSY66tRe/rvrw5dHr3RNJn9HjqtlKQ9mHVX4ch1HkU5posSmDbg vwzxBTXWfxLDMmIghyTgBTSFiI9Xg8W6htxDpxrbO+10EzlnaN1Am5tAlTkfrorNLyihpiQhUPGG ag6tJUcFj3IySGRTAxnStFRQoBXN5dlyx1Sqc4s/Tpl7gkgR8+I7UcdRKISjgcGcpW+zrXKqFF/H Dwv6ql+2ysPRrtlbt2V8Zf697VsNX5DTgZS9BKmWlAeqejNYaqG5Rsuhn7szbCfkkmsjedk+Rdcv A3SHMBeHXdtfBHS0AlbEwKgeml08NmCUcwnifhrQywCnu8NN9+RQ3cUxGvIuLLSzi3915wC6hbxr 8xArckQfSUfKA/hrHvoiiCGZU9D23xj3XXtsjdbIIDXATDnCPrKANdvGN5LTKal8bT0jXORfAz1z MniqVUgvWVNcviPgQ9BfT5qpGo8g7LaoBMGamAGVX6Ezrs04rk8jQ1yz1bB/8URfTRLZdyYkMh0u MB4xMylnyavgusoi7Duf5RuYJvNaL0g8Lx/cfGpGsGwdD2Lj/qRC45ammn6wCxDVfiJV6Z/TzJcY PBvzWK5xT++PQgMV8EwtXwA1kFqaGrcuiDHejMQ8O82Edjr+eBCBe0B7bRddoMD6oOlhNm1YsSNt [vdsm@node02:] $ tail file.txt 9JX8OWCJwbyvEPDyyI30H1/jPZfDo1sS11dZ2JjiO7qhB45VaU8+irG45D0GGJhFf8wE8TD9EGWG 8346QHLX9ZSFsbjpuh71hr5Ju1UduVdvIDwwP8WDBtRUbMAVvsyGR33rkpijepmUjmYl/jeZ7rsC VyUVlmG5PxrI7KKxz5dSkzApqVHKKgsf93JMDAdPwvXTq4hhZdUJ581w9FC/f9k2wWldEGkAcyB0 cCKp+VJl2vx989KUoqAJzsrvYdK0X7itruqYdpC29JXode+7NixUflhKvPdKmitBYyCEgCcyxUyn eyMOdaan2x8d8MztLLoWLpp+gLzl2Hev7y3OXq6I9SVN2t+hcVIz8Llmumy0cD+VC4u2/UZszYqS nDaSSMs35agGUUgIpHjPxCRf/yqnfrJJMTGAcxSEqHtpEdsjEmkf4QkyEgEZ13f4oi7P/DFCIIvV JBsHzOLDoetnFzAA2/RqbDflPrVWcAR7tXVqGLACCj2s19uUFSNb8nBWmEk8fFz31iJhuL43v0WE 78/THl49T0hhzHQp6kdIiw5p1zPUIFGBZ0BS4mBCHxu+tMlPZe1zWJMJZdPnvDNtHZ4gQ6LFgU4w E16ACZaLqLhx2oUUUov5JHvQcVFohn6HH+eog6XZCiTaG0Tue 4 Feb 13:18:37 CET 2020 [vdsm@node02:] $ dd if=file.txt of=/dev/null dd: error reading ‘file.txt’: Permission denied 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.106097 s, 633 MB/s if root does dd first, all is peachy: [root@node02] # dd if=file.txt of=/dev/null 390625+1 records in 390625+1 records out 200000058 bytes (200 MB) copied, 0.345906 s, 578 MB/s [vdsm@node02] $ dd if=file.txt of=/dev/null 390625+1 records in 390625+1 records out 200000058 bytes (200 MB) copied, 0.188451 s, 1.1 GB/s Error in the gluster.log: [2020-02-04 12:27:57.915356] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ssd_storage-client-1: remote operation failed. Path: /.shard/57200f4f-537d-4e56-9258-38fe6ac64c4e.2 (00000000-0000-0000-0000-000000000000) [Permission denied] [2020-02-04 12:27:57.915404] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ssd_storage-client-0: remote operation failed. Path: /.shard/57200f4f-537d-4e56-9258-38fe6ac64c4e.2 (00000000-0000-0000-0000-000000000000) [Permission denied] [2020-02-04 12:27:57.915472] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ssd_storage-client-2: remote operation failed. Path: /.shard/57200f4f-537d-4e56-9258-38fe6ac64c4e.2 (00000000-0000-0000-0000-000000000000) [Permission denied] [2020-02-04 12:27:57.915490] E [MSGID: 133010] [shard.c:2327:shard_common_lookup_shards_cbk] 0-ssd_storage-shard: Lookup on shard 2 failed. Base file gfid = 57200f4f-537d-4e56-9258-38fe6ac64c4e [Permission denied] What we tried: - restarting single hosts, - restarting the entire cluster, - doing stuff like find /rhev .. exec stats{}\ ; - dd'ing (read) all of the mount dir... We are out of ideas and also no experts on either gluster nor ovirt, it seems. And this is supposed to be a production HA environment. Any help would be appreciated. I hope I did think of all the relevant data and logs. -- with kind regards, mit freundlichen Gruessen, Christian Reiss _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/BOYMT4BP6P5FIC...

Hey, ACL is correctly set: # file: 5aab365f-b1b9-49d0-b011-566bf936a100 # owner: vdsm # group: kvm user::rw- group::rw- other::--- Doing a setfacl failed due to "Operation not supported", remounting with acl, too: [root@node01 ~]# mount -o remount,acl /rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net\:_ssd__storage/ /bin/sh: glusterfs: command not found As I am running the oVirt node I am not sure how feasable down/upgrading is. I think I am stuck with what I have. Also, if this would be a permission issue, I would not be able to access the file at all. Seems I can access some of it. And all of it when root loaded the whole file first. On 04/02/2020 19:31, Strahil Nikolov wrote:
Hi Chriss,
did you try to downgrade to gluster v6.6 for example ?
Most probably it's the ACL issue I have experienced some time ago.Gluster devs recommended either a fake acl (for example mounting the volume with acl and run a find with 'setfacl -m u:root:rw {} \;' to force gluster to reread the acl data from the bricks) or to kill glusterd processes and to start the volume forcefully.
In my case with 7.0-> 7.2 the only fix was downgrade , so you should consider that. Now every gluster upgrade involves test power off and power on of several VMs from all Gluster volumes.
Best Regards, Strahil Nikolov
В вторник, 4 февруари 2020 г., 14:40:32 ч. Гринуич+2, Christian Reiss <email@christian-reiss.de> написа:
Hello folks,
I am running a 3-way, no arbiter Gluster setup using oVirt and contained Gluster 6.7. After a crash we are unable to start any VMs due to Storage IO error. After much, much backtracking and debugging we are closing in on the symptons, albeit not the issue.
Conditions:
- gluster volume is healthy, - No outstanding heal or split-brain files, - 3 way without arbiter nodes (3 copies), - I already ran several "heal full" commands.
Gluster Volume Info Volume Name: ssd_storage Type: Replicate Volume ID: d84ec99a-5db9-49c6-aab4-c7481a1dc57b Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: node01.company.com:/gluster_bricks/ssd_storage/ssd_storage Brick2: node02.company.com:/gluster_bricks/ssd_storage/ssd_storage Brick3: node03.company.com:/gluster_bricks/ssd_storage/ssd_storage Options Reconfigured: cluster.self-heal-daemon: enable cluster.granular-entry-heal: enable storage.owner-gid: 36 storage.owner-uid: 36 network.ping-timeout: 30 server.event-threads: 4 client.event-threads: 4 cluster.choose-local: off user.cifs: off features.shard: on cluster.shd-wait-qlength: 10000 cluster.shd-max-threads: 8 cluster.locking-scheme: granular cluster.data-self-heal-algorithm: full cluster.server-quorum-type: server cluster.quorum-type: auto cluster.eager-lock: enable network.remote-dio: off performance.low-prio-threads: 32 performance.io-cache: off performance.read-ahead: off performance.quick-read: off performance.strict-o-direct: on transport.address-family: inet nfs.disable: on performance.client-io-threads: on
Gluster Volume Status Status of volume: ssd_storage Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------ Brick node01.company.com:/gluster_br icks/ssd_storage/ssd_storage 49152 0 Y 8218 Brick node02.company.com:/gluster_br icks/ssd_storage/ssd_storage 49152 0 Y 23595 Brick node03.company.com:/gluster_br icks/ssd_storage/ssd_storage 49152 0 Y 8080 Self-heal Daemon on localhost N/A N/A Y 66028 Self-heal Daemon on 10.100.200.12 N/A N/A Y 52087 Self-heal Daemon on node03.company.com et N/A N/A Y 8372
Task Status of Volume ssd_storage
------------------------------------------------------------------------------ There are no active volume tasks
The mounted path where the oVirt vm files reside is 100% okay, we copied all the images out there onto standalone hosts and the images run just fine. There is no obvious data corruption. However launching any VM out of oVirt fails with "IO Storage Error".
This is where everything gets funny. oVirt uses a vdsm user to access all the files.
Findings: - root can read, edit and write all files inside the ovirt mounted gluster path. - vdsm user can write to new files regardless of size without any issues; changes get replicated instantly to other nodes. - vdsm user can append to existing files regardless of size without any issues; changes get replicated instantly to other nodes. - vdsm user can read files if those files are smaller than 64mb. - vdsm user gets permission denied errors if the file to be read is 65mb or bigger. - vdsm user gets permission denied errors if the requests crosses a gluster shard-file boundary. - if root does a "dd if=file_larger_than64mb" of=/dev/null" on any large file, the file can then be read by the vdsm user on that single node. Changes do not get replicated to other nodes.
Example:
id of the vdsm user & sudo to them:
[vdsm@node01 <mailto:vdsm@node01>:/rhev/data-center/mnt/glusterSD/node01.company.com:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/test]
$ id uid=36(vdsm) gid=36(kvm) groups=36(kvm),107(qemu),179(sanlock) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
[vdsm@node02 <mailto:vdsm@node02>:/rhev/data-center/mnt/glusterSD/node01.company.com:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/test]
$ id uid=36(vdsm) gid=36(kvm) groups=36(kvm),107(qemu),179(sanlock) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
[vdsm@node03 <mailto:vdsm@node03>:/rhev/data-center/mnt/glusterSD/node01.company.com:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/test]
$ id uid=36(vdsm) gid=36(kvm) groups=36(kvm),107(qemu),179(sanlock) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
Create a file >64mb on one node:
[vdsm@node03 <mailto:vdsm@node03>:/rhev/data-center/mnt/glusterSD/node01.company.com:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/test]
$ base64 /dev/urandom | head -c 200000000 > file.txt [vdsm@node03 <mailto:vdsm@node03>:/rhev/data-center/mnt/glusterSD/node01.company.com:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/test]
$ ls -lha total 191M drwxr-xr-x. 2 vdsm kvm 30 Feb 4 13:10 . drwxr-xr-x. 6 vdsm kvm 80 Jan 1 1970 .. -rw-r--r--. 1 vdsm kvm 191M Feb 4 13:10 file.txt
File is instantly available on another node:
[vdsm@node01 <mailto:vdsm@node01>:/rhev/data-center/mnt/glusterSD/node01.company.com:_ssd__storage/fec2eb5e-21b5-496b-9ea5-f718b2cb5556/test]
$ ls -lha total 191M drwxr-xr-x. 2 vdsm kvm 30 Feb 4 13:10 . drwxr-xr-x. 6 vdsm kvm 80 Jan 1 1970 .. -rw-r--r--. 1 vdsm kvm 191M Feb 4 13:10 file.txt
Accessing the whole file fails:
[vdsm@node01 <mailto:vdsm@node01>:] dd if=file.txt of=/dev/null dd: error reading ‘file.txt’: Permission denied 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.0651919 s, 1.0 GB/s
Reading first 64mb works, 65mb (crossing boundary) does not:
[vdsm@node01 <mailto:vdsm@node01>:] $ dd if=file.txt bs=1M count=64 of=/dev/null 64+0 records in 64+0 records out 67108864 bytes (67 MB) copied, 0.00801663 s, 8.4 GB/s
[vdsm@node01 <mailto:vdsm@node01>:] $ dd if=file.txt bs=1M count=65 of=/dev/null dd: error reading ‘file.txt’: Permission denied 64+0 records in 64+0 records out 67108864 bytes (67 MB) copied, 0.00908712 s, 7.4 GB/s
Attaching/ appending to the file works (not crossing bounary):
[vdsm@node01 <mailto:vdsm@node01>:] $ date >> file.txt [vdsm@node01 <mailto:vdsm@node01>:] $
[vdsm@node02 <mailto:vdsm@node02>:] $ tail -n2 file.txt E16ACZaLqLhx2oUUUov5JHvQcVFohn6HH+eog6XZCiTaG0Tue 4 Feb 13:18:37 CET 2020
Reading the file beginning & end works, if it crosses the boundary not so much:
[vdsm@node02 <mailto:vdsm@node02>:] $ head file.txt jrZOxGaGvwfpGSwn1BKWWmFC4556KNzXsD2BCwY78tnV1mRY54IxnE+hbnszRyWgVuXhBpVRoJTp xvVwktZwSytMyvJjsSt7pQbXbHSY66tRe/rvrw5dHr3RNJn9HjqtlKQ9mHVX4ch1HkU5posSmDbg vwzxBTXWfxLDMmIghyTgBTSFiI9Xg8W6htxDpxrbO+10EzlnaN1Am5tAlTkfrorNLyihpiQhUPGG ag6tJUcFj3IySGRTAxnStFRQoBXN5dlyx1Sqc4s/Tpl7gkgR8+I7UcdRKISjgcGcpW+zrXKqFF/H Dwv6ql+2ysPRrtlbt2V8Zf697VsNX5DTgZS9BKmWlAeqejNYaqG5Rsuhn7szbCfkkmsjedk+Rdcv A3SHMBeHXdtfBHS0AlbEwKgeml08NmCUcwnifhrQywCnu8NN9+RQ3cUxGvIuLLSzi3915wC6hbxr 8xArckQfSUfKA/hrHvoiiCGZU9D23xj3XXtsjdbIIDXATDnCPrKANdvGN5LTKal8bT0jXORfAz1z MniqVUgvWVNcviPgQ9BfT5qpGo8g7LaoBMGamAGVX6Ezrs04rk8jQ1yz1bB/8URfTRLZdyYkMh0u MB4xMylnyavgusoi7Duf5RuYJvNaL0g8Lx/cfGpGsGwdD2Lj/qRC45ammn6wCxDVfiJV6Z/TzJcY PBvzWK5xT++PQgMV8EwtXwA1kFqaGrcuiDHejMQ8O82Edjr+eBCBe0B7bRddoMD6oOlhNm1YsSNt
[vdsm@node02 <mailto:vdsm@node02>:] $ tail file.txt 9JX8OWCJwbyvEPDyyI30H1/jPZfDo1sS11dZ2JjiO7qhB45VaU8+irG45D0GGJhFf8wE8TD9EGWG 8346QHLX9ZSFsbjpuh71hr5Ju1UduVdvIDwwP8WDBtRUbMAVvsyGR33rkpijepmUjmYl/jeZ7rsC VyUVlmG5PxrI7KKxz5dSkzApqVHKKgsf93JMDAdPwvXTq4hhZdUJ581w9FC/f9k2wWldEGkAcyB0 cCKp+VJl2vx989KUoqAJzsrvYdK0X7itruqYdpC29JXode+7NixUflhKvPdKmitBYyCEgCcyxUyn eyMOdaan2x8d8MztLLoWLpp+gLzl2Hev7y3OXq6I9SVN2t+hcVIz8Llmumy0cD+VC4u2/UZszYqS nDaSSMs35agGUUgIpHjPxCRf/yqnfrJJMTGAcxSEqHtpEdsjEmkf4QkyEgEZ13f4oi7P/DFCIIvV JBsHzOLDoetnFzAA2/RqbDflPrVWcAR7tXVqGLACCj2s19uUFSNb8nBWmEk8fFz31iJhuL43v0WE 78/THl49T0hhzHQp6kdIiw5p1zPUIFGBZ0BS4mBCHxu+tMlPZe1zWJMJZdPnvDNtHZ4gQ6LFgU4w E16ACZaLqLhx2oUUUov5JHvQcVFohn6HH+eog6XZCiTaG0Tue 4 Feb 13:18:37 CET 2020
[vdsm@node02 <mailto:vdsm@node02>:] $ dd if=file.txt of=/dev/null dd: error reading ‘file.txt’: Permission denied 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.106097 s, 633 MB/s
if root does dd first, all is peachy:
[root@node02 <mailto:root@node02>] # dd if=file.txt of=/dev/null 390625+1 records in 390625+1 records out 200000058 bytes (200 MB) copied, 0.345906 s, 578 MB/s
[vdsm@node02 <mailto:vdsm@node02>] $ dd if=file.txt of=/dev/null 390625+1 records in 390625+1 records out 200000058 bytes (200 MB) copied, 0.188451 s, 1.1 GB/s
Error in the gluster.log:
[2020-02-04 12:27:57.915356] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ssd_storage-client-1: remote operation failed. Path: /.shard/57200f4f-537d-4e56-9258-38fe6ac64c4e.2 (00000000-0000-0000-0000-000000000000) [Permission denied] [2020-02-04 12:27:57.915404] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ssd_storage-client-0: remote operation failed. Path: /.shard/57200f4f-537d-4e56-9258-38fe6ac64c4e.2 (00000000-0000-0000-0000-000000000000) [Permission denied] [2020-02-04 12:27:57.915472] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ssd_storage-client-2: remote operation failed. Path: /.shard/57200f4f-537d-4e56-9258-38fe6ac64c4e.2 (00000000-0000-0000-0000-000000000000) [Permission denied] [2020-02-04 12:27:57.915490] E [MSGID: 133010] [shard.c:2327:shard_common_lookup_shards_cbk] 0-ssd_storage-shard: Lookup on shard 2 failed. Base file gfid = 57200f4f-537d-4e56-9258-38fe6ac64c4e [Permission denied]
What we tried:
- restarting single hosts, - restarting the entire cluster, - doing stuff like find /rhev .. exec stats{}\ ; - dd'ing (read) all of the mount dir...
We are out of ideas and also no experts on either gluster nor ovirt, it seems. And this is supposed to be a production HA environment. Any help would be appreciated. I hope I did think of all the relevant data and logs.
-- with kind regards, mit freundlichen Gruessen,
Christian Reiss _______________________________________________ Users mailing list -- users@ovirt.org <mailto:users@ovirt.org> To unsubscribe send an email to users-leave@ovirt.org <mailto:users-leave@ovirt.org> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/BOYMT4BP6P5FIC...
-- with kind regards, mit freundlichen Gruessen, Christian Reiss

Thanks for replying, What I just wrote Stahil was: ACL is correctly set: # file: 5aab365f-b1b9-49d0-b011-566bf936a100 # owner: vdsm # group: kvm user::rw- group::rw- other::--- Doing a setfacl failed due to "Operation not supported", remounting with acl, too: [root@node01 ~]# mount -o remount,acl /rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net\:_ssd__storage/ /bin/sh: glusterfs: command not found As I am running the oVirt node I am not sure how feasable down/upgrading is. I think I am stuck with what I have. Also, if this would be a permission issue, I would not be able to access the file at all. Seems I can access some of it. And all of it when root loaded the whole file first. I also did, even if it was correctly set, the chown from the mountpoint again, to no avail. On 04/02/2020 21:53, Christian Reiss wrote:
ACL is correctly set:
# file: 5aab365f-b1b9-49d0-b011-566bf936a100 # owner: vdsm # group: kvm user::rw- group::rw- other::---
Doing a setfacl failed due to "Operation not supported", remounting with acl, too:
[root@node01 ~]# mount -o remount,acl /rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net\:_ssd__storage/ /bin/sh: glusterfs: command not found
As I am running the oVirt node I am not sure how feasable down/upgrading is. I think I am stuck with what I have.
Also, if this would be a permission issue, I would not be able to access the file at all. Seems I can access some of it. And all of it when root loaded the whole file first.
-- with kind regards, mit freundlichen Gruessen, Christian Reiss

On February 4, 2020 10:54:54 PM GMT+02:00, Christian Reiss <email@christian-reiss.de> wrote:
Thanks for replying,
What I just wrote Stahil was:
ACL is correctly set:
# file: 5aab365f-b1b9-49d0-b011-566bf936a100 # owner: vdsm # group: kvm user::rw- group::rw- other::---
Doing a setfacl failed due to "Operation not supported", remounting with acl, too:
[root@node01 ~]# mount -o remount,acl /rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net\:_ssd__storage/ /bin/sh: glusterfs: command not found
As I am running the oVirt node I am not sure how feasable down/upgrading is. I think I am stuck with what I have.
Also, if this would be a permission issue, I would not be able to access the file at all. Seems I can access some of it. And all of it when root
loaded the whole file first.
I also did, even if it was correctly set, the chown from the mountpoint
again, to no avail.
On 04/02/2020 21:53, Christian Reiss wrote:
ACL is correctly set:
# file: 5aab365f-b1b9-49d0-b011-566bf936a100 # owner: vdsm # group: kvm user::rw- group::rw- other::---
Doing a setfacl failed due to "Operation not supported", remounting
with
acl, too:
[root@node01 ~]# mount -o remount,acl
/rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net\:_ssd__storage/
/bin/sh: glusterfs: command not found
As I am running the oVirt node I am not sure how feasable down/upgrading is. I think I am stuck with what I have.
Also, if this would be a permission issue, I would not be able to access the file at all. Seems I can access some of it. And all of it when root loaded the whole file first.
Hey Christian, The symptoms were the same: 1. sudo -u vdsm dd if=disk of=/dev/null bs=4M fails when the first shard is met (64MB by default) 2. When brick log is set to trace -> it is confirmed that Glusger's ACL (not OS one) is causing the issue 3. If the dd is run by root and immediately again with vdsm -> no issues at all. I'm just sharing my experience. If you use the node , I guess you can reboot and select in grub the previous grub menu... Best Regards, Strahil Nikolov

Hey, during debug level logging on the bricks I got this bit: [2020-02-05 09:34:11.368305] I [MSGID: 139001] [posix-acl.c:263:posix_acl_log_permit_denied] 0-ssd_storage-access-control: client: CTX_ID:096e8723-f941-4e65-9ce6-5a4a03634d02-GRAPH_ID:0-PID:50568-HOST:node03.example.com-PC_NAME:ssd_storage-client-0-RECON_NO:-0, gfid: be318638-e8a0-4c6d-977d-7a937aa84806, req(uid:36,gid:36,perm:1,ngrps:3), ctx(uid:0,gid:0,in-groups:0,perm:000,updated-fop:INVALID, acl:-) [Permission denied] I read it as follows: req(uid:36,gid:36,perm:1,ngrps:3) -> Requesting UID si 36 which is vdsm. ctx(uid:0,gid:0,in-groups:0,perm:000,updated-fop:INVALID, acl:-) -> Owning UID is root, Zero matching groups, resuling permissions for 36 are 000, Access Resolution: INVALID/ Access Denidd acl not used. Does this sound right? I tried manually mounting with mount -t glusterfs node01.example.com:/ssd_storage /media -o acl then setting the acl inside one test dir: setfacl -m u:root:rwx 2bd08834-349b-474c-94a9-0d815dd069cc and testing: sudo -u vdsm dd if=2bd08834-349b-474c-94a9-0d815dd069cc of=/dev/null dd: error reading ‘2bd08834-349b-474c-94a9-0d815dd069cc’: Permission denied 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.0662261 s, 1.0 GB/s which resulted on the node01 with the first mentioned error. (insert scream here) -Chris On 04/02/2020 21:54, Christian Reiss wrote:
Thanks for replying,
What I just wrote Stahil was:
ACL is correctly set:
# file: 5aab365f-b1b9-49d0-b011-566bf936a100 # owner: vdsm # group: kvm user::rw- group::rw- other::---
Doing a setfacl failed due to "Operation not supported", remounting with acl, too:
[root@node01 ~]# mount -o remount,acl /rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net\:_ssd__storage/ /bin/sh: glusterfs: command not found
As I am running the oVirt node I am not sure how feasable down/upgrading is. I think I am stuck with what I have.
Also, if this would be a permission issue, I would not be able to access the file at all. Seems I can access some of it. And all of it when root loaded the whole file first.
I also did, even if it was correctly set, the chown from the mountpoint again, to no avail.
On 04/02/2020 21:53, Christian Reiss wrote:
ACL is correctly set:
# file: 5aab365f-b1b9-49d0-b011-566bf936a100 # owner: vdsm # group: kvm user::rw- group::rw- other::---
Doing a setfacl failed due to "Operation not supported", remounting with acl, too:
[root@node01 ~]# mount -o remount,acl /rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net\:_ssd__storage/ /bin/sh: glusterfs: command not found
As I am running the oVirt node I am not sure how feasable down/upgrading is. I think I am stuck with what I have.
Also, if this would be a permission issue, I would not be able to access the file at all. Seems I can access some of it. And all of it when root loaded the whole file first.
-- with kind regards, mit freundlichen Gruessen, Christian Reiss

On February 5, 2020 11:48:36 AM GMT+02:00, Christian Reiss <email@christian-reiss.de> wrote:
Hey,
during debug level logging on the bricks I got this bit:
[2020-02-05 09:34:11.368305] I [MSGID: 139001] [posix-acl.c:263:posix_acl_log_permit_denied] 0-ssd_storage-access-control: client: CTX_ID:096e8723-f941-4e65-9ce6-5a4a03634d02-GRAPH_ID:0-PID:50568-HOST:node03.example.com-PC_NAME:ssd_storage-client-0-RECON_NO:-0,
gfid: be318638-e8a0-4c6d-977d-7a937aa84806, req(uid:36,gid:36,perm:1,ngrps:3), ctx(uid:0,gid:0,in-groups:0,perm:000,updated-fop:INVALID, acl:-) [Permission denied]
I read it as follows:
req(uid:36,gid:36,perm:1,ngrps:3) -> Requesting UID si 36 which is vdsm.
ctx(uid:0,gid:0,in-groups:0,perm:000,updated-fop:INVALID, acl:-) -> Owning UID is root, Zero matching groups, resuling permissions for 36 are 000, Access Resolution: INVALID/ Access Denidd acl not used.
Does this sound right? I tried manually mounting with
mount -t glusterfs node01.example.com:/ssd_storage /media -o acl
then setting the acl inside one test dir:
setfacl -m u:root:rwx 2bd08834-349b-474c-94a9-0d815dd069cc
and testing:
sudo -u vdsm dd if=2bd08834-349b-474c-94a9-0d815dd069cc of=/dev/null dd: error reading ‘2bd08834-349b-474c-94a9-0d815dd069cc’: Permission denied 131072+0 records in 131072+0 records out 67108864 bytes (67 MB) copied, 0.0662261 s, 1.0 GB/s
which resulted on the node01 with the first mentioned error.
(insert scream here)
-Chris
On 04/02/2020 21:54, Christian Reiss wrote:
Thanks for replying,
What I just wrote Stahil was:
ACL is correctly set:
# file: 5aab365f-b1b9-49d0-b011-566bf936a100 # owner: vdsm # group: kvm user::rw- group::rw- other::---
Doing a setfacl failed due to "Operation not supported", remounting with acl, too:
[root@node01 ~]# mount -o remount,acl
/rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net\:_ssd__storage/
/bin/sh: glusterfs: command not found
As I am running the oVirt node I am not sure how feasable down/upgrading is. I think I am stuck with what I have.
Also, if this would be a permission issue, I would not be able to access the file at all. Seems I can access some of it. And all of it when root loaded the whole file first.
I also did, even if it was correctly set, the chown from the mountpoint again, to no avail.
On 04/02/2020 21:53, Christian Reiss wrote:
ACL is correctly set:
# file: 5aab365f-b1b9-49d0-b011-566bf936a100 # owner: vdsm # group: kvm user::rw- group::rw- other::---
Doing a setfacl failed due to "Operation not supported", remounting with acl, too:
[root@node01 ~]# mount -o remount,acl
/rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net\:_ssd__storage/
/bin/sh: glusterfs: command not found
As I am running the oVirt node I am not sure how feasable down/upgrading is. I think I am stuck with what I have.
Also, if this would be a permission issue, I would not be able to access the file at all. Seems I can access some of it. And all of it
when root loaded the whole file first.
I first noticed that issue when going from v6.5 to v6.6 , so you still have the option to: A) Try with monting via acl abd run a find to set acl for root user. B) Downgrade to v6.5 C) Upgrade to 7.0 (7.1 & 7.2 were broken for me ) It looks like Gluster are failing Ovirt again :D Best Regards, Strahil Nikolov

Were you ever able to find a fix for this? I am facing the same problem and the case is similar to yours. We have a 6 node distributed-replicated Gluster, due to a network issue all servers got disconnected and upon recover one of the volumes started giving the same IO error. The files can be read as root but are giving error when read as vdsm. Everything else is as in your case including the oVirt versions. While doing a full dd if=IMAGE of=/dev/null allows the disk to be mounted on one server temporarily, upon reboot/restart it returns to failing with IO error. I had to create a completely new gluster volume and copy the disks from the failing volume as root to resolve this. Did you create a bug report in Bugzilla for this? Regards, Hesham Ahmed On Wed, Feb 5, 2020 at 1:01 AM Christian Reiss <email@christian-reiss.de> wrote:
Thanks for replying,
What I just wrote Stahil was:
ACL is correctly set:
# file: 5aab365f-b1b9-49d0-b011-566bf936a100 # owner: vdsm # group: kvm user::rw- group::rw- other::---
Doing a setfacl failed due to "Operation not supported", remounting with acl, too:
[root@node01 ~]# mount -o remount,acl /rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net\:_ssd__storage/ /bin/sh: glusterfs: command not found
As I am running the oVirt node I am not sure how feasable down/upgrading is. I think I am stuck with what I have.
Also, if this would be a permission issue, I would not be able to access the file at all. Seems I can access some of it. And all of it when root loaded the whole file first.
I also did, even if it was correctly set, the chown from the mountpoint again, to no avail.
On 04/02/2020 21:53, Christian Reiss wrote:
ACL is correctly set:
# file: 5aab365f-b1b9-49d0-b011-566bf936a100 # owner: vdsm # group: kvm user::rw- group::rw- other::---
Doing a setfacl failed due to "Operation not supported", remounting with acl, too:
[root@node01 ~]# mount -o remount,acl /rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net
\:_ssd__storage/
/bin/sh: glusterfs: command not found
As I am running the oVirt node I am not sure how feasable down/upgrading is. I think I am stuck with what I have.
Also, if this would be a permission issue, I would not be able to access the file at all. Seems I can access some of it. And all of it when root loaded the whole file first.
-- with kind regards, mit freundlichen Gruessen,
Christian Reiss _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/G2MOTXZU6AFAGV...

On February 24, 2020 1:55:34 PM GMT+02:00, Hesham Ahmed <hsahmed@gmail.com> wrote:
Were you ever able to find a fix for this? I am facing the same problem and the case is similar to yours. We have a 6 node distributed-replicated Gluster, due to a network issue all servers got disconnected and upon recover one of the volumes started giving the same IO error. The files can be read as root but are giving error when read as vdsm. Everything else is as in your case including the oVirt versions. While doing a full dd if=IMAGE of=/dev/null allows the disk to be mounted on one server temporarily, upon reboot/restart it returns to failing with IO error. I had to create a completely new gluster volume and copy the disks from the failing volume as root to resolve this.
Did you create a bug report in Bugzilla for this?
Regards,
Hesham Ahmed
On Wed, Feb 5, 2020 at 1:01 AM Christian Reiss <email@christian-reiss.de> wrote:
Thanks for replying,
What I just wrote Stahil was:
ACL is correctly set:
# file: 5aab365f-b1b9-49d0-b011-566bf936a100 # owner: vdsm # group: kvm user::rw- group::rw- other::---
Doing a setfacl failed due to "Operation not supported", remounting with acl, too:
[root@node01 ~]# mount -o remount,acl
/rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net\:_ssd__storage/
/bin/sh: glusterfs: command not found
As I am running the oVirt node I am not sure how feasable down/upgrading is. I think I am stuck with what I have.
Also, if this would be a permission issue, I would not be able to access the file at all. Seems I can access some of it. And all of it when root loaded the whole file first.
I also did, even if it was correctly set, the chown from the mountpoint again, to no avail.
On 04/02/2020 21:53, Christian Reiss wrote:
ACL is correctly set:
# file: 5aab365f-b1b9-49d0-b011-566bf936a100 # owner: vdsm # group: kvm user::rw- group::rw- other::---
Doing a setfacl failed due to "Operation not supported", remounting
with
acl, too:
[root@node01 ~]# mount -o remount,acl /rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net \:_ssd__storage/ /bin/sh: glusterfs: command not found
As I am running the oVirt node I am not sure how feasable down/upgrading is. I think I am stuck with what I have.
Also, if this would be a permission issue, I would not be able to access the file at all. Seems I can access some of it. And all of it when root loaded the whole file first.
-- with kind regards, mit freundlichen Gruessen,
Christian Reiss _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/G2MOTXZU6AFAGV...
If you mean the ACL issue -> check https://bugzilla.redhat.com/show_bug.cgi?id=1797099 Ravi will be happy to have a setup that is already affected, so he can debug the issue. In my case , I have reverted to v7.0 Best Regards, Strahil Nikolov

My issue is with Gluster 6.7 (the default with oVirt 4.3.7) as is the case with Christian. I still have the failing volume and disks and can share any information required. On Mon, Feb 24, 2020 at 6:21 PM Strahil Nikolov <hunter86_bg@yahoo.com> wrote:
On February 24, 2020 1:55:34 PM GMT+02:00, Hesham Ahmed <hsahmed@gmail.com> wrote:
Were you ever able to find a fix for this? I am facing the same problem and the case is similar to yours. We have a 6 node distributed-replicated Gluster, due to a network issue all servers got disconnected and upon recover one of the volumes started giving the same IO error. The files can be read as root but are giving error when read as vdsm. Everything else is as in your case including the oVirt versions. While doing a full dd if=IMAGE of=/dev/null allows the disk to be mounted on one server temporarily, upon reboot/restart it returns to failing with IO error. I had to create a completely new gluster volume and copy the disks from the failing volume as root to resolve this.
Did you create a bug report in Bugzilla for this?
Regards,
Hesham Ahmed
On Wed, Feb 5, 2020 at 1:01 AM Christian Reiss <email@christian-reiss.de> wrote:
Thanks for replying,
What I just wrote Stahil was:
ACL is correctly set:
# file: 5aab365f-b1b9-49d0-b011-566bf936a100 # owner: vdsm # group: kvm user::rw- group::rw- other::---
Doing a setfacl failed due to "Operation not supported", remounting with acl, too:
[root@node01 ~]# mount -o remount,acl
/rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net\:_ssd__storage/
/bin/sh: glusterfs: command not found
As I am running the oVirt node I am not sure how feasable down/upgrading is. I think I am stuck with what I have.
Also, if this would be a permission issue, I would not be able to access the file at all. Seems I can access some of it. And all of it when root loaded the whole file first.
I also did, even if it was correctly set, the chown from the mountpoint again, to no avail.
On 04/02/2020 21:53, Christian Reiss wrote:
ACL is correctly set:
# file: 5aab365f-b1b9-49d0-b011-566bf936a100 # owner: vdsm # group: kvm user::rw- group::rw- other::---
Doing a setfacl failed due to "Operation not supported", remounting
with
acl, too:
[root@node01 ~]# mount -o remount,acl /rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net \:_ssd__storage/ /bin/sh: glusterfs: command not found
As I am running the oVirt node I am not sure how feasable down/upgrading is. I think I am stuck with what I have.
Also, if this would be a permission issue, I would not be able to access the file at all. Seems I can access some of it. And all of it when root loaded the whole file first.
-- with kind regards, mit freundlichen Gruessen,
Christian Reiss _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/G2MOTXZU6AFAGV...
If you mean the ACL issue -> check https://bugzilla.redhat.com/show_bug.cgi?id=1797099 Ravi will be happy to have a setup that is already affected, so he can debug the issue. In my case , I have reverted to v7.0
Best Regards, Strahil Nikolov

On February 24, 2020 5:10:40 PM GMT+02:00, Hesham Ahmed <hsahmed@gmail.com> wrote:
My issue is with Gluster 6.7 (the default with oVirt 4.3.7) as is the case with Christian. I still have the failing volume and disks and can share any information required.
On Mon, Feb 24, 2020 at 6:21 PM Strahil Nikolov <hunter86_bg@yahoo.com> wrote:
Were you ever able to find a fix for this? I am facing the same
and the case is similar to yours. We have a 6 node distributed-replicated Gluster, due to a network issue all servers got disconnected and upon recover one of the volumes started giving the same IO error. The files can be read as root but are giving error when read as vdsm. Everything else is as in your case including the oVirt versions. While doing a full dd if=IMAGE of=/dev/null allows the disk to be mounted on one server temporarily, upon reboot/restart it returns to failing with IO error. I had to create a completely new gluster volume and copy the disks from
On February 24, 2020 1:55:34 PM GMT+02:00, Hesham Ahmed <hsahmed@gmail.com> wrote: problem the
failing volume as root to resolve this.
Did you create a bug report in Bugzilla for this?
Regards,
Hesham Ahmed
On Wed, Feb 5, 2020 at 1:01 AM Christian Reiss <email@christian-reiss.de> wrote:
Thanks for replying,
What I just wrote Stahil was:
ACL is correctly set:
# file: 5aab365f-b1b9-49d0-b011-566bf936a100 # owner: vdsm # group: kvm user::rw- group::rw- other::---
Doing a setfacl failed due to "Operation not supported", remounting with acl, too:
[root@node01 ~]# mount -o remount,acl
/rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net\:_ssd__storage/
/bin/sh: glusterfs: command not found
As I am running the oVirt node I am not sure how feasable down/upgrading is. I think I am stuck with what I have.
Also, if this would be a permission issue, I would not be able to access the file at all. Seems I can access some of it. And all of it when root loaded the whole file first.
I also did, even if it was correctly set, the chown from the mountpoint again, to no avail.
On 04/02/2020 21:53, Christian Reiss wrote:
ACL is correctly set:
# file: 5aab365f-b1b9-49d0-b011-566bf936a100 # owner: vdsm # group: kvm user::rw- group::rw- other::---
Doing a setfacl failed due to "Operation not supported",
remounting with
acl, too:
[root@node01 ~]# mount -o remount,acl /rhev/data-center/mnt/glusterSD/node01.dc-dus.dalason.net \:_ssd__storage/ /bin/sh: glusterfs: command not found
As I am running the oVirt node I am not sure how feasable down/upgrading is. I think I am stuck with what I have.
Also, if this would be a permission issue, I would not be able to access the file at all. Seems I can access some of it. And all of it when root loaded the whole file first.
-- with kind regards, mit freundlichen Gruessen,
Christian Reiss _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/G2MOTXZU6AFAGV...
If you mean the ACL issue -> check https://bugzilla.redhat.com/show_bug.cgi?id=1797099 Ravi will be happy to have a setup that is already affected, so he can debug the issue. In my case , I have reverted to v7.0
Best Regards, Strahil Nikolov
If this is production setup, consider downgrading to v6.5 (although it is not recommended) as an option. Another one is to mount with acl option and force a setfacl: find /mnt -exec setfacl -m u:root:rw {} \; Best Regards, Strahil Nikolov

Hey, I do not have the faulty cluster anymore; It's a production environment with HA requirements so I really cant take it down for days or even worse, weeks. I am now running of CentOS7 (manual install) with manual Gluster 7.0 installation and current ovirt. So far so good. Time will tell :) On 24/02/2020 18:11, Strahil Nikolov wrote:
On February 24, 2020 5:10:40 PM GMT+02:00, Hesham Ahmed <hsahmed@gmail.com> wrote:
My issue is with Gluster 6.7 (the default with oVirt 4.3.7) as is the case with Christian. I still have the failing volume and disks and can share any information required.
-- with kind regards, mit freundlichen Gruessen, Christian Reiss

In my case I am continuing with oVirt node 4.3.8 based gluster 6.7 for the time being. I have resolved the issue by manually copying all disk images to a new gluster volume which took days specially since disks on gluster still don't support sparse file copy. But the threat of a temporary network failure bringing down the complete oVirt setup is a bit too much risk. On Mon, Feb 24, 2020 at 9:29 PM Christian Reiss <email@christian-reiss.de> wrote:
Hey,
I do not have the faulty cluster anymore; It's a production environment with HA requirements so I really cant take it down for days or even worse, weeks.
I am now running of CentOS7 (manual install) with manual Gluster 7.0 installation and current ovirt. So far so good.
Time will tell :)
On 24/02/2020 18:11, Strahil Nikolov wrote:
On February 24, 2020 5:10:40 PM GMT+02:00, Hesham Ahmed < hsahmed@gmail.com> wrote:
My issue is with Gluster 6.7 (the default with oVirt 4.3.7) as is the case with Christian. I still have the failing volume and disks and can share any information required.
-- with kind regards, mit freundlichen Gruessen,
Christian Reiss

On February 24, 2020 7:50:15 PM GMT+02:00, Hesham Ahmed <hsahmed@gmail.com> wrote:
In my case I am continuing with oVirt node 4.3.8 based gluster 6.7 for the time being. I have resolved the issue by manually copying all disk images to a new gluster volume which took days specially since disks on gluster still don't support sparse file copy. But the threat of a temporary network failure bringing down the complete oVirt setup is a bit too much risk.
On Mon, Feb 24, 2020 at 9:29 PM Christian Reiss <email@christian-reiss.de> wrote:
Hey,
I do not have the faulty cluster anymore; It's a production environment with HA requirements so I really cant take it down for days or even worse, weeks.
I am now running of CentOS7 (manual install) with manual Gluster 7.0 installation and current ovirt. So far so good.
Time will tell :)
On February 24, 2020 5:10:40 PM GMT+02:00, Hesham Ahmed < hsahmed@gmail.com> wrote:
My issue is with Gluster 6.7 (the default with oVirt 4.3.7) as is
On 24/02/2020 18:11, Strahil Nikolov wrote: the
case with Christian. I still have the failing volume and disks and can share any information required.
-- with kind regards, mit freundlichen Gruessen,
Christian Reiss
Hey Hesham, Do you keep the old volumes? Maybe you can assist Ravi on debugging this issue ? Best Regards, Strahil Nikolov
participants (3)
-
Christian Reiss
-
Hesham Ahmed
-
Strahil Nikolov