How to make oVirt + GlusterFS bulletproof

Hi Guys, I had a situation 2 times that due to unexpected power outage something went wrong and VMs on glusterfs where not recoverable. Gluster heal did not help and I could not start the VMs any more. Is there a way to make such setup bulletproof? Does it matter which volume type I choose - raw or qcow2? Or thin provision versus reallocated? Any other advise?

IMO this is best handled at hardware level with UPS and battery/flash backed controllers. Can you share more details about your oVirt setup? How many servers are you working with andare you using replica 3 or replica 3 arbiter? On Thu, Oct 8, 2020 at 9:15 AM Jarosław Prokopowski <jprokopowski@gmail.com> wrote:
Hi Guys,
I had a situation 2 times that due to unexpected power outage something went wrong and VMs on glusterfs where not recoverable. Gluster heal did not help and I could not start the VMs any more. Is there a way to make such setup bulletproof? Does it matter which volume type I choose - raw or qcow2? Or thin provision versus reallocated? Any other advise? _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/MRM6H2YENBP3AH...

Hi Jayme, there is UPS but anyway the outages happened. We have also Raritan KVM but it is not supported by oVirt. The setup is 6 hosts - Tow pairs of 3 hosts each using one replica 3 volume. BTW what would be the best gluster volume solution for 6+ hosts?

are you using JBOD bricks or do you have some sort of RAID for each of the bricks? Are you using sharding? -wk On 10/8/2020 6:11 AM, Jarosław Prokopowski wrote:
Hi Jayme, there is UPS but anyway the outages happened. We have also Raritan KVM but it is not supported by oVirt. The setup is 6 hosts - Tow pairs of 3 hosts each using one replica 3 volume. BTW what would be the best gluster volume solution for 6+ hosts?
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/AUPSDDU3665CP2...

Hmm, I'm not sure. I just created glusterfs volumes on LVM volumes, changed ownership to vdsm.kvm and applied virt group. Then I added it to oVirt as storage for VMs

Based on the logs you shared, it looks like a network issue - but it could always be something else. If you ever experience something like that situation, please share the logs immediately and add the gluster mailing list - in order to get assistance with the root cause. Best Regards, Strahil Nikolov В петък, 9 октомври 2020 г., 16:26:14 Гринуич+3, Jarosław Prokopowski <jprokopowski@gmail.com> написа: Hmm, I'm not sure. I just created glusterfs volumes on LVM volumes, changed ownership to vdsm.kvm and applied virt group. Then I added it to oVirt as storage for VMs _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/DJEOW53SSPB4RE...

Thanks Strahil The data center is remote so I will definitely ask the lab guys to ensure the switch is connected to battery supported power socket. So the gluster's weak point is actually the switch in the network? Can it have difficulty finding out which version of data is correct after the switch was off for some time?

Hi Jaroslaw, That point was from someone else. I don't think that gluster has a such weak point. The only weak point I have seen is the infrastructure it relies ontop and of course the built-in limitations it has. You need to verify the following: - mount options are important . Using 'nobarrier' but without radi-controller protection is devastating. Also I use the following option when using gluster + selinux in enforcing mode: context=system_u:object_r:glusterd_brick_t:s0 - it tells the kernel what is the selinux context on all files/dirs in the gluster brick and this reduces I/O requests to the bricks My mount options are: noatime,nodiratime,inode64,nouuid,context="system_u:object_r:glusterd_brick_t:s0" - Next is your FS - if you use HW raid controller , you need to specify the sunit= and swidth= for the 'mkfs.xfs' (and don't forget the '-i size=512') This tells the XFS about the hardware beneath - If you use thin LVM , you need to be sure that your '_tmeta' LV of the Thinpool LV is not over a VDO device as it doesn't dedupe quite good I'm using VDO in 'emulate512' as my 'PHY-SEC' is 4096 and oVirt doesn't like it :) . You can check yours via 'lsblk -t'. - Configure and tune your VDO. I think that 1 VDO = 1 Fast disk (NVMe/SSD) as I'm not very good in tuning VDO. If you need dedupe - check RedHat's documentation about the indexing as the defaults are not optimal. - Next is the disk scheduler. In case you use NVMe - the linux kernel is taking care of it , but for SSDs and large HW arrays - you can enable the multiqueue and switch to 'none' via UDEV rules.Of course , testing is needed for every prod environment :) Also consider using noop/none I/O scheduler in the VMs as you don't want to reorder I/O requests on VM level , just to do it on Host level. - You can set your CPU to avoid switching to lower C states -> that adds extra latency for the host/VM processes - Transparent Huge Pages can be a real problem , especially with large VMs. oVirt 4.4.x now should support native Huge and Gumbo pages which will reduce the stress over the OS. - vm.swappiness, vm.dirty_background**** , vm.dirty_*** settings. You can check what RH gluster storage is using the ones in the redhat-storage-server rpms: in ftp://ftp.redhat.com/redhat/linux/enterprise/7Server/en/RHS/SRPMS/ They control the behaviour of the system when to start flushing memory to disk and when to block any process until all memory is flushed. Best Regards, Strahil Nikolov В събота, 10 октомври 2020 г., 18:18:55 Гринуич+3, Jarosław Prokopowski <jprokopowski@gmail.com> написа: Thanks Strahil The data center is remote so I will definitely ask the lab guys to ensure the switch is connected to battery supported power socket. So the gluster's weak point is actually the switch in the network? Can it have difficulty finding out which version of data is correct after the switch was off for some time? _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/VFP2FX2YRAPOH3...

Hi Nikolov, Thanks for the very interesting answer :-) I do not use any raid controller. I was hoping glusterfs would take care of fault tolerance but apparently it failed. I have one Samsung 1TB SSD drives in each server for VM storage. I see it is of type "multipath". There is XFS filesystem over standard LVM (not thin). Mount options are: inode64,noatime,nodiratime SELinux was in permissive mode. I must read more about the things you described as have never dived into it. Please let me know if you have any suggestions :-) Thanks a lot! Jarek

One recommendation is to get rid of the multipath for your SSD. Replica 3 volumes are quite resilient and I'm really surprised it happened to you. For the multipath stuff , you can create something like this: [root@ovirt1 ~]# cat /etc/multipath/conf.d/blacklist.conf blacklist { wwid Crucial_CT256MX100SSD1_14390D52DCF5 } As you are running multipath already , just run the following to get the wwid of your ssd : multipath -v4 | grep 'got wwid of' What were the gluster vol options you were running with ? oVirt is running the volume with 'performance.strict-o-direct' and Direct I/O , so you should not loose any data. Best Regards, Strahil Nikolov В вторник, 13 октомври 2020 г., 16:35:26 Гринуич+3, Jarosław Prokopowski <jprokopowski@gmail.com> написа: Hi Nikolov, Thanks for the very interesting answer :-) I do not use any raid controller. I was hoping glusterfs would take care of fault tolerance but apparently it failed. I have one Samsung 1TB SSD drives in each server for VM storage. I see it is of type "multipath". There is XFS filesystem over standard LVM (not thin). Mount options are: inode64,noatime,nodiratime SELinux was in permissive mode. I must read more about the things you described as have never dived into it. Please let me know if you have any suggestions :-) Thanks a lot! Jarek _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/RBIDHY6P3KKTXF...

Thanks. I will get rid of multipath. I did not set performance.strict-o-direct specifically, only changed permissions of the volume to vdsm.kvm and applied the virt gourp. Now is see performance.strict-o-direct was off. Could it be the reason of the data loss? Direct I/O is enabled in oVirt by gluster mount option "-o direct-io-mode=enable" right? Below is full list of the volume options. Option Value ------ ----- cluster.lookup-unhashed on cluster.lookup-optimize on cluster.min-free-disk 10% cluster.min-free-inodes 5% cluster.rebalance-stats off cluster.subvols-per-directory (null) cluster.readdir-optimize off cluster.rsync-hash-regex (null) cluster.extra-hash-regex (null) cluster.dht-xattr-name trusted.glusterfs.dht cluster.randomize-hash-range-by-gfid off cluster.rebal-throttle normal cluster.lock-migration off cluster.force-migration off cluster.local-volume-name (null) cluster.weighted-rebalance on cluster.switch-pattern (null) cluster.entry-change-log on cluster.read-subvolume (null) cluster.read-subvolume-index -1 cluster.read-hash-mode 1 cluster.background-self-heal-count 8 cluster.metadata-self-heal off cluster.data-self-heal off cluster.entry-self-heal off cluster.self-heal-daemon on cluster.heal-timeout 600 cluster.self-heal-window-size 1 cluster.data-change-log on cluster.metadata-change-log on cluster.data-self-heal-algorithm full cluster.eager-lock enable disperse.eager-lock on disperse.other-eager-lock on disperse.eager-lock-timeout 1 disperse.other-eager-lock-timeout 1 cluster.quorum-type auto cluster.quorum-count (null) cluster.choose-local off cluster.self-heal-readdir-size 1KB cluster.post-op-delay-secs 1 cluster.ensure-durability on cluster.consistent-metadata no cluster.heal-wait-queue-length 128 cluster.favorite-child-policy none cluster.full-lock yes diagnostics.latency-measurement off diagnostics.dump-fd-stats off diagnostics.count-fop-hits off diagnostics.brick-log-level INFO diagnostics.client-log-level INFO diagnostics.brick-sys-log-level CRITICAL diagnostics.client-sys-log-level CRITICAL diagnostics.brick-logger (null) diagnostics.client-logger (null) diagnostics.brick-log-format (null) diagnostics.client-log-format (null) diagnostics.brick-log-buf-size 5 diagnostics.client-log-buf-size 5 diagnostics.brick-log-flush-timeout 120 diagnostics.client-log-flush-timeout 120 diagnostics.stats-dump-interval 0 diagnostics.fop-sample-interval 0 diagnostics.stats-dump-format json diagnostics.fop-sample-buf-size 65535 diagnostics.stats-dnscache-ttl-sec 86400 performance.cache-max-file-size 0 performance.cache-min-file-size 0 performance.cache-refresh-timeout 1 performance.cache-priority performance.cache-size 32MB performance.io-thread-count 16 performance.high-prio-threads 16 performance.normal-prio-threads 16 performance.low-prio-threads 32 performance.least-prio-threads 1 performance.enable-least-priority on performance.iot-watchdog-secs (null) performance.iot-cleanup-disconnected-reqsoff performance.iot-pass-through false performance.io-cache-pass-through false performance.cache-size 128MB performance.qr-cache-timeout 1 performance.cache-invalidation false performance.ctime-invalidation false performance.flush-behind on performance.nfs.flush-behind on performance.write-behind-window-size 1MB performance.resync-failed-syncs-after-fsyncoff performance.nfs.write-behind-window-size1MB performance.strict-o-direct off performance.nfs.strict-o-direct off performance.strict-write-ordering off performance.nfs.strict-write-ordering off performance.write-behind-trickling-writeson performance.aggregate-size 128KB performance.nfs.write-behind-trickling-writeson performance.lazy-open yes performance.read-after-open yes performance.open-behind-pass-through false performance.read-ahead-page-count 4 performance.read-ahead-pass-through false performance.readdir-ahead-pass-through false performance.md-cache-pass-through false performance.md-cache-timeout 1 performance.cache-swift-metadata true performance.cache-samba-metadata false performance.cache-capability-xattrs true performance.cache-ima-xattrs true performance.md-cache-statfs off performance.xattr-cache-list performance.nl-cache-pass-through false features.encryption off network.frame-timeout 1800 network.ping-timeout 42 network.tcp-window-size (null) client.ssl off network.remote-dio enable client.event-threads 4 client.tcp-user-timeout 0 client.keepalive-time 20 client.keepalive-interval 2 client.keepalive-count 9 network.tcp-window-size (null) network.inode-lru-limit 16384 auth.allow * auth.reject (null) transport.keepalive 1 server.allow-insecure on server.root-squash off server.all-squash off server.anonuid 65534 server.anongid 65534 server.statedump-path /var/run/gluster server.outstanding-rpc-limit 64 server.ssl off auth.ssl-allow * server.manage-gids off server.dynamic-auth on client.send-gids on server.gid-timeout 300 server.own-thread (null) server.event-threads 4 server.tcp-user-timeout 42 server.keepalive-time 20 server.keepalive-interval 2 server.keepalive-count 9 transport.listen-backlog 1024 transport.address-family inet performance.write-behind on performance.read-ahead off performance.readdir-ahead on performance.io-cache off performance.open-behind on performance.quick-read off performance.nl-cache off performance.stat-prefetch on performance.client-io-threads on performance.nfs.write-behind on performance.nfs.read-ahead off performance.nfs.io-cache off performance.nfs.quick-read off performance.nfs.stat-prefetch off performance.nfs.io-threads off performance.force-readdirp true performance.cache-invalidation false performance.global-cache-invalidation true features.uss off features.snapshot-directory .snaps features.show-snapshot-directory off features.tag-namespaces off network.compression off network.compression.window-size -15 network.compression.mem-level 8 network.compression.min-size 0 network.compression.compression-level -1 network.compression.debug false features.default-soft-limit 80% features.soft-timeout 60 features.hard-timeout 5 features.alert-time 86400 features.quota-deem-statfs off geo-replication.indexing off geo-replication.indexing off geo-replication.ignore-pid-check off geo-replication.ignore-pid-check off features.quota off features.inode-quota off features.bitrot disable debug.trace off debug.log-history no debug.log-file no debug.exclude-ops (null) debug.include-ops (null) debug.error-gen off debug.error-failure (null) debug.error-number (null) debug.random-failure off debug.error-fops (null) nfs.disable on features.read-only off features.worm off features.worm-file-level off features.worm-files-deletable on features.default-retention-period 120 features.retention-mode relax features.auto-commit-period 180 storage.linux-aio off storage.batch-fsync-mode reverse-fsync storage.batch-fsync-delay-usec 0 storage.owner-uid 36 storage.owner-gid 36 storage.node-uuid-pathinfo off storage.health-check-interval 30 storage.build-pgfid off storage.gfid2path on storage.gfid2path-separator : storage.reserve 1 storage.health-check-timeout 10 storage.fips-mode-rchecksum off storage.force-create-mode 0000 storage.force-directory-mode 0000 storage.create-mask 0777 storage.create-directory-mask 0777 storage.max-hardlinks 100 features.ctime on config.gfproxyd off cluster.server-quorum-type server cluster.server-quorum-ratio 0 changelog.changelog off changelog.changelog-dir {{ brick.path }}/.glusterfs/changelogs changelog.encoding ascii changelog.rollover-time 15 changelog.fsync-interval 5 changelog.changelog-barrier-timeout 120 changelog.capture-del-path off features.barrier disable features.barrier-timeout 120 features.trash off features.trash-dir .trashcan features.trash-eliminate-path (null) features.trash-max-filesize 5MB features.trash-internal-op off cluster.enable-shared-storage disable locks.trace off locks.mandatory-locking off cluster.disperse-self-heal-daemon enable cluster.quorum-reads no client.bind-insecure (null) features.shard on features.shard-block-size 64MB features.shard-lru-limit 16384 features.shard-deletion-rate 100 features.scrub-throttle lazy features.scrub-freq biweekly features.scrub false features.expiry-time 120 features.cache-invalidation off features.cache-invalidation-timeout 60 features.leases off features.lease-lock-recall-timeout 60 disperse.background-heals 8 disperse.heal-wait-qlength 128 cluster.heal-timeout 600 dht.force-readdirp on disperse.read-policy gfid-hash cluster.shd-max-threads 8 cluster.shd-wait-qlength 10000 cluster.locking-scheme granular cluster.granular-entry-heal no features.locks-revocation-secs 0 features.locks-revocation-clear-all false features.locks-revocation-max-blocked 0 features.locks-monkey-unlocking false features.locks-notify-contention no features.locks-notify-contention-delay 5 disperse.shd-max-threads 1 disperse.shd-wait-qlength 1024 disperse.cpu-extensions auto disperse.self-heal-window-size 1 cluster.use-compound-fops off performance.parallel-readdir off performance.rda-request-size 131072 performance.rda-low-wmark 4096 performance.rda-high-wmark 128KB performance.rda-cache-limit 10MB performance.nl-cache-positive-entry false performance.nl-cache-limit 10MB performance.nl-cache-timeout 60 cluster.brick-multiplex off cluster.max-bricks-per-process 250 disperse.optimistic-change-log on disperse.stripe-cache 4 cluster.halo-enabled False cluster.halo-shd-max-latency 99999 cluster.halo-nfsd-max-latency 5 cluster.halo-max-latency 5 cluster.halo-max-replicas 99999 cluster.halo-min-replicas 2 features.selinux on cluster.daemon-log-level INFO debug.delay-gen off delay-gen.delay-percentage 10% delay-gen.delay-duration 100000 delay-gen.enable disperse.parallel-writes on features.sdfs off features.cloudsync off features.ctime on ctime.noatime on feature.cloudsync-storetype (null) features.enforce-mandatory-lock off

strict-o-direct just allows the app to define if direct I/O is needed and yes, that could be a reason for your data loss. The good thing is that the feature is part of the virt group and there is a "Optimize for Virt" button somewhere in the UI . Yet, I prefer the manual approach of building gluster volumes ,as UI's primary focus is oVirt (quite natural , right). Best Regards, Strahil Nikolov В сряда, 14 октомври 2020 г., 12:30:42 Гринуич+3, Jarosław Prokopowski <jprokopowski@gmail.com> написа: Thanks. I will get rid of multipath. I did not set performance.strict-o-direct specifically, only changed permissions of the volume to vdsm.kvm and applied the virt gourp. Now is see performance.strict-o-direct was off. Could it be the reason of the data loss? Direct I/O is enabled in oVirt by gluster mount option "-o direct-io-mode=enable" right? Below is full list of the volume options. Option Value ------ ----- cluster.lookup-unhashed on cluster.lookup-optimize on cluster.min-free-disk 10% cluster.min-free-inodes 5% cluster.rebalance-stats off cluster.subvols-per-directory (null) cluster.readdir-optimize off cluster.rsync-hash-regex (null) cluster.extra-hash-regex (null) cluster.dht-xattr-name trusted.glusterfs.dht cluster.randomize-hash-range-by-gfid off cluster.rebal-throttle normal cluster.lock-migration off cluster.force-migration off cluster.local-volume-name (null) cluster.weighted-rebalance on cluster.switch-pattern (null) cluster.entry-change-log on cluster.read-subvolume (null) cluster.read-subvolume-index -1 cluster.read-hash-mode 1 cluster.background-self-heal-count 8 cluster.metadata-self-heal off cluster.data-self-heal off cluster.entry-self-heal off cluster.self-heal-daemon on cluster.heal-timeout 600 cluster.self-heal-window-size 1 cluster.data-change-log on cluster.metadata-change-log on cluster.data-self-heal-algorithm full cluster.eager-lock enable disperse.eager-lock on disperse.other-eager-lock on disperse.eager-lock-timeout 1 disperse.other-eager-lock-timeout 1 cluster.quorum-type auto cluster.quorum-count (null) cluster.choose-local off cluster.self-heal-readdir-size 1KB cluster.post-op-delay-secs 1 cluster.ensure-durability on cluster.consistent-metadata no cluster.heal-wait-queue-length 128 cluster.favorite-child-policy none cluster.full-lock yes diagnostics.latency-measurement off diagnostics.dump-fd-stats off diagnostics.count-fop-hits off diagnostics.brick-log-level INFO diagnostics.client-log-level INFO diagnostics.brick-sys-log-level CRITICAL diagnostics.client-sys-log-level CRITICAL diagnostics.brick-logger (null) diagnostics.client-logger (null) diagnostics.brick-log-format (null) diagnostics.client-log-format (null) diagnostics.brick-log-buf-size 5 diagnostics.client-log-buf-size 5 diagnostics.brick-log-flush-timeout 120 diagnostics.client-log-flush-timeout 120 diagnostics.stats-dump-interval 0 diagnostics.fop-sample-interval 0 diagnostics.stats-dump-format json diagnostics.fop-sample-buf-size 65535 diagnostics.stats-dnscache-ttl-sec 86400 performance.cache-max-file-size 0 performance.cache-min-file-size 0 performance.cache-refresh-timeout 1 performance.cache-priority performance.cache-size 32MB performance.io-thread-count 16 performance.high-prio-threads 16 performance.normal-prio-threads 16 performance.low-prio-threads 32 performance.least-prio-threads 1 performance.enable-least-priority on performance.iot-watchdog-secs (null) performance.iot-cleanup-disconnected-reqsoff performance.iot-pass-through false performance.io-cache-pass-through false performance.cache-size 128MB performance.qr-cache-timeout 1 performance.cache-invalidation false performance.ctime-invalidation false performance.flush-behind on performance.nfs.flush-behind on performance.write-behind-window-size 1MB performance.resync-failed-syncs-after-fsyncoff performance.nfs.write-behind-window-size1MB performance.strict-o-direct off performance.nfs.strict-o-direct off performance.strict-write-ordering off performance.nfs.strict-write-ordering off performance.write-behind-trickling-writeson performance.aggregate-size 128KB performance.nfs.write-behind-trickling-writeson performance.lazy-open yes performance.read-after-open yes performance.open-behind-pass-through false performance.read-ahead-page-count 4 performance.read-ahead-pass-through false performance.readdir-ahead-pass-through false performance.md-cache-pass-through false performance.md-cache-timeout 1 performance.cache-swift-metadata true performance.cache-samba-metadata false performance.cache-capability-xattrs true performance.cache-ima-xattrs true performance.md-cache-statfs off performance.xattr-cache-list performance.nl-cache-pass-through false features.encryption off network.frame-timeout 1800 network.ping-timeout 42 network.tcp-window-size (null) client.ssl off network.remote-dio enable client.event-threads 4 client.tcp-user-timeout 0 client.keepalive-time 20 client.keepalive-interval 2 client.keepalive-count 9 network.tcp-window-size (null) network.inode-lru-limit 16384 auth.allow * auth.reject (null) transport.keepalive 1 server.allow-insecure on server.root-squash off server.all-squash off server.anonuid 65534 server.anongid 65534 server.statedump-path /var/run/gluster server.outstanding-rpc-limit 64 server.ssl off auth.ssl-allow * server.manage-gids off server.dynamic-auth on client.send-gids on server.gid-timeout 300 server.own-thread (null) server.event-threads 4 server.tcp-user-timeout 42 server.keepalive-time 20 server.keepalive-interval 2 server.keepalive-count 9 transport.listen-backlog 1024 transport.address-family inet performance.write-behind on performance.read-ahead off performance.readdir-ahead on performance.io-cache off performance.open-behind on performance.quick-read off performance.nl-cache off performance.stat-prefetch on performance.client-io-threads on performance.nfs.write-behind on performance.nfs.read-ahead off performance.nfs.io-cache off performance.nfs.quick-read off performance.nfs.stat-prefetch off performance.nfs.io-threads off performance.force-readdirp true performance.cache-invalidation false performance.global-cache-invalidation true features.uss off features.snapshot-directory .snaps features.show-snapshot-directory off features.tag-namespaces off network.compression off network.compression.window-size -15 network.compression.mem-level 8 network.compression.min-size 0 network.compression.compression-level -1 network.compression.debug false features.default-soft-limit 80% features.soft-timeout 60 features.hard-timeout 5 features.alert-time 86400 features.quota-deem-statfs off geo-replication.indexing off geo-replication.indexing off geo-replication.ignore-pid-check off geo-replication.ignore-pid-check off features.quota off features.inode-quota off features.bitrot disable debug.trace off debug.log-history no debug.log-file no debug.exclude-ops (null) debug.include-ops (null) debug.error-gen off debug.error-failure (null) debug.error-number (null) debug.random-failure off debug.error-fops (null) nfs.disable on features.read-only off features.worm off features.worm-file-level off features.worm-files-deletable on features.default-retention-period 120 features.retention-mode relax features.auto-commit-period 180 storage.linux-aio off storage.batch-fsync-mode reverse-fsync storage.batch-fsync-delay-usec 0 storage.owner-uid 36 storage.owner-gid 36 storage.node-uuid-pathinfo off storage.health-check-interval 30 storage.build-pgfid off storage.gfid2path on storage.gfid2path-separator : storage.reserve 1 storage.health-check-timeout 10 storage.fips-mode-rchecksum off storage.force-create-mode 0000 storage.force-directory-mode 0000 storage.create-mask 0777 storage.create-directory-mask 0777 storage.max-hardlinks 100 features.ctime on config.gfproxyd off cluster.server-quorum-type server cluster.server-quorum-ratio 0 changelog.changelog off changelog.changelog-dir {{ brick.path }}/.glusterfs/changelogs changelog.encoding ascii changelog.rollover-time 15 changelog.fsync-interval 5 changelog.changelog-barrier-timeout 120 changelog.capture-del-path off features.barrier disable features.barrier-timeout 120 features.trash off features.trash-dir .trashcan features.trash-eliminate-path (null) features.trash-max-filesize 5MB features.trash-internal-op off cluster.enable-shared-storage disable locks.trace off locks.mandatory-locking off cluster.disperse-self-heal-daemon enable cluster.quorum-reads no client.bind-insecure (null) features.shard on features.shard-block-size 64MB features.shard-lru-limit 16384 features.shard-deletion-rate 100 features.scrub-throttle lazy features.scrub-freq biweekly features.scrub false features.expiry-time 120 features.cache-invalidation off features.cache-invalidation-timeout 60 features.leases off features.lease-lock-recall-timeout 60 disperse.background-heals 8 disperse.heal-wait-qlength 128 cluster.heal-timeout 600 dht.force-readdirp on disperse.read-policy gfid-hash cluster.shd-max-threads 8 cluster.shd-wait-qlength 10000 cluster.locking-scheme granular cluster.granular-entry-heal no features.locks-revocation-secs 0 features.locks-revocation-clear-all false features.locks-revocation-max-blocked 0 features.locks-monkey-unlocking false features.locks-notify-contention no features.locks-notify-contention-delay 5 disperse.shd-max-threads 1 disperse.shd-wait-qlength 1024 disperse.cpu-extensions auto disperse.self-heal-window-size 1 cluster.use-compound-fops off performance.parallel-readdir off performance.rda-request-size 131072 performance.rda-low-wmark 4096 performance.rda-high-wmark 128KB performance.rda-cache-limit 10MB performance.nl-cache-positive-entry false performance.nl-cache-limit 10MB performance.nl-cache-timeout 60 cluster.brick-multiplex off cluster.max-bricks-per-process 250 disperse.optimistic-change-log on disperse.stripe-cache 4 cluster.halo-enabled False cluster.halo-shd-max-latency 99999 cluster.halo-nfsd-max-latency 5 cluster.halo-max-latency 5 cluster.halo-max-replicas 99999 cluster.halo-min-replicas 2 features.selinux on cluster.daemon-log-level INFO debug.delay-gen off delay-gen.delay-percentage 10% delay-gen.delay-duration 100000 delay-gen.enable disperse.parallel-writes on features.sdfs off features.cloudsync off features.ctime on ctime.noatime on feature.cloudsync-storetype (null) features.enforce-mandatory-lock off _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/C5YTREAG33NRCD...

Hi Jaroslaw, it's more important to find the root cause of the data loss , as this is definately not supposed to happen (I got myself several power outages without issues). Do you keep the logs ? For now , check if your gluster settings (gluster volume info VOL) matches the settings in the virt group (/var/lib/glusterd/group/virt - or somethinhg like that). Best Regards, Strahil Nikolov В четвъртък, 8 октомври 2020 г., 15:16:10 Гринуич+3, Jarosław Prokopowski <jprokopowski@gmail.com> написа: Hi Guys, I had a situation 2 times that due to unexpected power outage something went wrong and VMs on glusterfs where not recoverable. Gluster heal did not help and I could not start the VMs any more. Is there a way to make such setup bulletproof? Does it matter which volume type I choose - raw or qcow2? Or thin provision versus reallocated? Any other advise? _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/MRM6H2YENBP3AH...

Hi Strahil, I remember during after creating the volume I applied the virt group to it. Volume info: ---------------- Volume Name: data Type: Replicate Volume ID: 05842cd6-7f16-4329-9ffd-64a0b4366fbe Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: host1storage:/gluster_bricks/data/data Brick2: host2storage:/gluster_bricks/data/data Brick3: host3storage:/gluster_bricks/data/data Options Reconfigured: performance.client-io-threads: on nfs.disable: on transport.address-family: inet storage.owner-gid: 36 storage.owner-uid: 36 performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.low-prio-threads: 32 network.remote-dio: enable cluster.eager-lock: enable cluster.quorum-type: auto cluster.server-quorum-type: server cluster.data-self-heal-algorithm: full cluster.locking-scheme: granular cluster.shd-max-threads: 8 cluster.shd-wait-qlength: 10000 features.shard: on user.cifs: off cluster.choose-local: off client.event-threads: 4 server.event-threads: 4 I do not have full logs but I have some saved. /var/log/messages: ------------------------- Sep 14 08:36:20 host1 vdsm[4301]: ERROR Unhandled exception in <Task discardable <UpdateVolumes vm=fd5123cb-9364-448d-b41c-8a48fb1826c5 at 0x7f1244099050> timeout=30.0, duration=0.01 at 0x7f1244099210>#012Traceback (most recent call last):#012 File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task#012 task()#012 File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__#012 self._callable()#012 File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 315, in __call__#012 self._execute()#012 File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 357, in _execute#012 self._vm.updateDriveVolume(drive)#012 File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4189, in updateDriveVolume#012 vmDrive.volumeID)#012 File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 6101, in _getVolumeSize#012 (domainID, volumeID))#012StorageUnavailableError: Unable to get volume size for doma in 88f5972f-58bd-469f-bc77-5bf3b1802291 volume cdf313d7-bed3-4fae-a803-1297cdf8c82f Sep 14 08:37:20 host1 vdsm[4301]: ERROR Unhandled exception in <Task discardable <UpdateVolumes vm=fd5123cb-9364-448d-b41c-8a48fb1826c5 at 0x7f12442efad0> timeout=30.0, duration=0.00 at 0x7f1244078490>#012Traceback (most recent call last):#012 File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task#012 task()#012 File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__#012 self._callable()#012 File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 315, in __call__#012 self._execute()#012 File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 357, in _execute#012 self._vm.updateDriveVolume(drive)#012 File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4189, in updateDriveVolume#012 vmDrive.volumeID)#012 File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 6101, in _getVolumeSize#012 (domainID, volumeID))#012StorageUnavailableError: Unable to get volume size for doma in 88f5972f-58bd-469f-bc77-5bf3b1802291 volume cdf313d7-bed3-4fae-a803-1297cdf8c82f Sep 14 08:38:20 host1 vdsm[4301]: ERROR Unhandled exception in <Task discardable <UpdateVolumes vm=fd5123cb-9364-448d-b41c-8a48fb1826c5 at 0x7f12045aa550> timeout=30.0, duration=0.00 at 0x7f12045aaa90>#012Traceback (most recent call last):#012 File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task#012 task()#012 File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__#012 self._callable()#012 File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 315, in __call__#012 self._execute()#012 File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 357, in _execute#012 self._vm.updateDriveVolume(drive)#012 File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4189, in updateDriveVolume#012 vmDrive.volumeID)#012 File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 6101, in _getVolumeSize#012 (domainID, volumeID))#012StorageUnavailableError: Unable to get volume size for doma in 88f5972f-58bd-469f-bc77-5bf3b1802291 volume cdf313d7-bed3-4fae-a803-1297cdf8c82f Sep 14 08:39:20 host1 vdsm[4301]: ERROR Unhandled exception in <Task discardable <UpdateVolumes vm=fd5123cb-9364-448d-b41c-8a48fb1826c5 at 0x7f12441d6f50> timeout=30.0, duration=0.01 at 0x7f1287f189d0>#012Traceback (most recent call last):#012 File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task#012 task()#012 File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__#012 self._callable()#012 File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 315, in __call__#012 self._execute()#012 File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 357, in _execute#012 self._vm.updateDriveVolume(drive)#012 File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4189, in updateDriveVolume#012 vmDrive.volumeID)#012 File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 6101, in _getVolumeSize#012 (domainID, volumeID))#012StorageUnavailableError: Unable to get volume size for doma in 88f5972f-58bd-469f-bc77-5bf3b1802291 volume cdf313d7-bed3-4fae-a803-1297cdf8c82f Sep 14 08:40:01 host1 systemd: Created slice User Slice of root. Sep 14 08:40:01 host1 systemd: Started Session 52513 of user root. Sep 14 08:40:01 host1 systemd: Removed slice User Slice of root. Sep 14 08:40:20 host1 vdsm[4301]: ERROR Unhandled exception in <Task discardable <UpdateVolumes vm=fd5123cb-9364-448d-b41c-8a48fb1826c5 at 0x7f12442ef290> timeout=30.0, duration=0.00 at 0x7f11e73d5b50>#012Traceback (most recent call last):#012 File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task#012 task()#012 File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__#012 self._callable()#012 File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 315, in __call__#012 self._execute()#012 File "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 357, in _execute#012 self._vm.updateDriveVolume(drive)#012 File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4189, in updateDriveVolume#012 vmDrive.volumeID)#012 File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 6101, in _getVolumeSize#012 (domainID, volumeID))#012StorageUnavailableError: Unable to get volume size for doma in 88f5972f-58bd-469f-bc77-5bf3b1802291 volume cdf313d7-bed3-4fae-a803-1297cdf8c82f Sep 14 08:40:42 host1 systemd: Created slice User Slice of root. Sep 14 08:40:42 host1 systemd: Started Session c69119 of user root. Sep 14 08:40:42 host1 systemd: Removed slice User Slice of root. rhev-data-center-mnt-glusterSD-host1storage::_data.log ---------------------------------------------------------------- 2020-09-14 08:40:39.648159] I [MSGID: 114018] [client.c:2331:client_rpc_notify] 0-data-client-2: disconnected from data-client-2. Client process will keep trying to connect to glusterd until brick's port is available [2020-09-14 08:40:39.648183] W [MSGID: 108001] [afr-common.c:5613:afr_notify] 0-data-replicate-0: Client-quorum is not met [2020-09-14 08:40:50.243632] E [MSGID: 114058] [client-handshake.c:1449:client_query_portmap_cbk] 0-data-client-2: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running. [2020-09-14 08:40:50.243686] I [MSGID: 114018] [client.c:2331:client_rpc_notify] 0-data-client-2: disconnected from data-client-2. Client process will keep trying to connect to glusterd until brick's port is available [2020-09-14 08:41:05.147093] I [MSGID: 114018] [client.c:2331:client_rpc_notify] 0-data-client-1: disconnected from data-client-1. Client process will keep trying to connect to glusterd until brick's port is available [2020-09-14 08:41:05.147139] E [MSGID: 108006] [afr-common.c:5323:__afr_handle_child_down_event] 0-data-replicate-0: All subvolumes are down. Going offline until at least one of them comes back up. [2020-09-14 08:41:05.156727] I [MSGID: 108006] [afr-common.c:5669:afr_local_init] 0-data-replicate-0: no subvolumes up The message "I [MSGID: 108006] [afr-common.c:5669:afr_local_init] 0-data-replicate-0: no subvolumes up" repeated 195 times between [2020-09-14 08:41:05.156727] and [2020-09-14 08:41:15.222500] [2020-09-14 08:41:15.288560] E [MSGID: 114058] [client-handshake.c:1449:client_query_portmap_cbk] 0-data-client-1: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running. [2020-09-14 08:41:15.288608] I [MSGID: 114018] [client.c:2331:client_rpc_notify] 0-data-client-1: disconnected from data-client-1. Client process will keep trying to connect to glusterd until brick's port is available [2020-09-14 08:41:15.722793] I [MSGID: 108006] [afr-common.c:5669:afr_local_init] 0-data-replicate-0: no subvolumes up The message "I [MSGID: 108006] [afr-common.c:5669:afr_local_init] 0-data-replicate-0: no subvolumes up" repeated 170 times between [2020-09-14 08:41:15.722793] and [2020-09-14 08:41:24.230828] [2020-09-14 08:41:24.731222] I [MSGID: 108006] [afr-common.c:5669:afr_local_init] 0-data-replicate-0: no subvolumes up [2020-09-14 08:41:41.352791] I [socket.c:811:__socket_shutdown] 0-data-client-2: intentional socket shutdown(6) [2020-09-14 08:41:44.363054] I [socket.c:811:__socket_shutdown] 0-data-client-0: intentional socket shutdown(6) [2020-09-14 08:42:42.512364] I [socket.c:811:__socket_shutdown] 0-data-client-1: intentional socket shutdown(6) The message "I [MSGID: 108006] [afr-common.c:5669:afr_local_init] 0-data-replicate-0: no subvolumes up" repeated 708 times between [2020-09-14 08:41:24.731222] and [2020-09-14 08:43:23.874840] [2020-09-14 08:43:26.411190] I [MSGID: 108006] [afr-common.c:5669:afr_local_init] 0-data-replicate-0: no subvolumes up ... ... ... [2020-09-14 12:19:59.517980] I [MSGID: 108006] [afr-common.c:5669:afr_local_init] 0-data-replicate-0: no subvolumes up [2020-09-14 12:20:29.093669] W [socket.c:721:__socket_rwv] 0-glusterfs: readv on 192.168.0.101:24007 failed (No data available) [2020-09-14 12:20:29.093695] I [glusterfsd-mgmt.c:2443:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: host1storage.mydomain.com [2020-09-14 12:20:29.093707] I [glusterfsd-mgmt.c:2483:mgmt_rpc_notify] 0-glusterfsd-mgmt: connecting to next volfile server host2.mydomain.com [2020-09-14 12:20:39.630645] I [socket.c:811:__socket_shutdown] 0-data-client-2: intentional socket shutdown(7) [2020-09-14 12:20:40.635471] I [rpc-clnt.c:2028:rpc_clnt_reconfig] 0-data-client-1: changing port to 49152 (from 0) [2020-09-14 12:20:40.639281] I [MSGID: 114046] [client-handshake.c:1106:client_setvolume_cbk] 0-data-client-1: Connected to data-client-1, attached to remote volume '/gluster_bricks/data/data'. [2020-09-14 12:20:40.639299] I [MSGID: 114042] [client-handshake.c:930:client_post_handshake] 0-data-client-1: 5 fds open - Delaying child_up until they are re-opened [2020-09-14 12:20:40.639945] I [MSGID: 114041] [client-handshake.c:318:client_child_up_reopen_done] 0-data-client-1: last fd open'd/lock-self-heal'd - notifying CHILD-UP [2020-09-14 12:20:40.639967] I [MSGID: 108005] [afr-common.c:5245:__afr_handle_child_up_event] 0-data-replicate-0: Subvolume 'data-client-1' came back up; going online. [2020-09-14 12:20:42.640257] I [rpc-clnt.c:2028:rpc_clnt_reconfig] 0-data-client-0: changing port to 49152 (from 0) [2020-09-14 12:20:42.640288] I [socket.c:811:__socket_shutdown] 0-data-client-0: intentional socket shutdown(10) [2020-09-14 12:20:42.643664] I [MSGID: 114046] [client-handshake.c:1106:client_setvolume_cbk] 0-data-client-0: Connected to data-client-0, attached to remote volume '/gluster_bricks/data/data'. [2020-09-14 12:20:42.643683] I [MSGID: 114042] [client-handshake.c:930:client_post_handshake] 0-data-client-0: 5 fds open - Delaying child_up until they are re-opened [2020-09-14 12:20:42.644327] I [MSGID: 114041] [client-handshake.c:318:client_child_up_reopen_done] 0-data-client-0: last fd open'd/lock-self-heal'd - notifying CHILD-UP [2020-09-14 12:20:42.644347] I [MSGID: 108002] [afr-common.c:5607:afr_notify] 0-data-replicate-0: Client-quorum is met [2020-09-14 12:20:42.843176] I [rpc-clnt.c:2028:rpc_clnt_reconfig] 0-data-client-2: changing port to 49152 (from 0) [2020-09-14 12:20:42.846562] I [MSGID: 114046] [client-handshake.c:1106:client_setvolume_cbk] 0-data-client-2: Connected to data-client-2, attached to remote volume '/gluster_bricks/data/data'. [2020-09-14 12:20:42.846598] I [MSGID: 114042] [client-handshake.c:930:client_post_handshake] 0-data-client-2: 5 fds open - Delaying child_up until they are re-opened [2020-09-14 12:20:42.847429] I [MSGID: 114041] [client-handshake.c:318:client_child_up_reopen_done] 0-data-client-2: last fd open'd/lock-self-heal'd - notifying CHILD-UP [2020-09-14 12:20:49.640811] I [MSGID: 133022] [shard.c:3674:shard_delete_shards] 0-data-shard: Deleted shards of gfid=18250f19-3820-4a98-9c49-37ba23c08dfd from backend [2020-09-14 12:20:50.244771] E [MSGID: 114031] [client-rpc-fops_v2.c:2540:client4_0_opendir_cbk] 0-data-client-1: remote operation failed. Path: /88f5972f-58bd-469f-bc77-5bf3b1802291/images (5fdbb512-e924-4945-a633-10820133e5ff) [Permission denied] [2020-09-14 12:20:50.244829] E [MSGID: 114031] [client-rpc-fops_v2.c:2540:client4_0_opendir_cbk] 0-data-client-2: remote operation failed. Path: /88f5972f-58bd-469f-bc77-5bf3b1802291/images (5fdbb512-e924-4945-a633-10820133e5ff) [Permission denied] [2020-09-14 12:20:50.244880] W [MSGID: 114061] [client-common.c:3325:client_pre_readdirp_v2] 0-data-client-1: (5fdbb512-e924-4945-a633-10820133e5ff) remote_fd is -1. EBADFD [File descriptor in bad state] [2020-09-14 12:21:20.364907] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-data-client-1: remote operation failed. Path: /88f5972f-58bd-469f-bc77-5bf3b1802291/images/4e79467c-4707-4e7d-8a2a-909b208d4b97 (b0f8bb67-09e5-431b-acea-c03f0280fb34) [Permission denied] [2020-09-14 12:21:20.365181] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-data-client-2: remote operation failed. Path: /88f5972f-58bd-469f-bc77-5bf3b1802291/images/4e79467c-4707-4e7d-8a2a-909b208d4b97 (b0f8bb67-09e5-431b-acea-c03f0280fb34) [Permission denied] [2020-09-14 12:21:20.368381] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-data-client-2: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Permission denied] [2020-09-14 12:21:20.368401] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-data-client-1: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Permission denied] [2020-09-14 12:21:20.368462] W [MSGID: 108027] [afr-common.c:2274:afr_attempt_readsubvol_set] 0-data-replicate-0: no read subvols for /88f5972f-58bd-469f-bc77-5bf3b1802291/images/4e79467c-4707-4e7d-8a2a-909b208d4b97 [2020-09-14 12:21:20.369001] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-data-client-1: remote operation failed. Path: /88f5972f-58bd-469f-bc77-5bf3b1802291/images/4e79467c-4707-4e7d-8a2a-909b208d4b97 (00000000-0000-0000-0000-000000000000) [Permission denied] [2020-09-14 12:21:20.369002] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-data-client-2: remote operation failed. Path: /88f5972f-58bd-469f-bc77-5bf3b1802291/images/4e79467c-4707-4e7d-8a2a-909b208d4b97 (00000000-0000-0000-0000-000000000000) [Permission denied] [2020-09-14 12:21:20.370979] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-data-client-2: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Permission denied] [2020-09-14 12:21:20.371009] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-data-client-1: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Permission denied] [2020-09-14 12:21:20.371075] W [MSGID: 108027] [afr-common.c:2274:afr_attempt_readsubvol_set] 0-data-replicate-0: no read subvols for /88f5972f-58bd-469f-bc77-5bf3b1802291/images/4e79467c-4707-4e7d-8a2a-909b208d4b97 The message "I [MSGID: 108006] [afr-common.c:5669:afr_local_init] 0-data-replicate-0: no subvolumes up" repeated 102 times between [2020-09-14 12:19:59.517980] and [2020-09-14 12:20:39.539124] [2020-09-14 12:22:11.510751] I [MSGID: 133022] [shard.c:3674:shard_delete_shards] 0-data-shard: Deleted shards of gfid=40104663-41d6-4210-b9aa-065e0ba48c1f from backend [2020-09-14 12:22:11.853348] E [MSGID: 114031] [client-rpc-fops_v2.c:2540:client4_0_opendir_cbk] 0-data-client-2: remote operation failed. Path: /88f5972f-58bd-469f-bc77-5bf3b1802291/images (5fdbb512-e924-4945-a633-10820133e5ff) [Permission denied] glusterd.log ---------------- [2020-09-14 08:40:39.750074] E [socket.c:2282:__socket_read_frag] 0-rpc: wrong MSG-TYPE (9) received from 10.0.0.116:30052 [2020-09-14 08:42:36.920865] C [rpcsvc.c:1029:rpcsvc_notify] 0-rpcsvc: got MAP_XID event, which should have not come [2020-09-14 08:42:56.426832] E [socket.c:2282:__socket_read_frag] 0-rpc: wrong MSG-TYPE (3866624) received from 10.0.0.116:53501 [2020-09-14 08:48:50.776839] E [socket.c:2282:__socket_read_frag] 0-rpc: wrong MSG-TYPE (-66911352) received from 10.0.0.116:46153 [2020-09-14 09:00:51.692183] E [socket.c:2282:__socket_read_frag] 0-rpc: wrong MSG-TYPE (7602176) received from 10.0.0.116:64053 [2020-09-14 12:17:50.501625] W [MSGID: 101095] [xlator.c:210:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/6.6/xlator/encryption/crypt.so: cannot open shared object file: No such file or directory [2020-09-14 12:17:50.508662] E [MSGID: 101097] [xlator.c:218:xlator_volopt_dynload] 0-xlator: dlsym(xlator_api) missing: /usr/lib64/glusterfs/6.6/rpc-transport/socket.so: undefined symbol: xlator_api [2020-09-14 12:17:50.510694] W [MSGID: 101095] [xlator.c:210:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/6.6/xlator/nfs/server.so: cannot open shared object file: No such file or directory [2020-09-14 12:17:50.518341] W [MSGID: 101095] [xlator.c:210:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/6.6/xlator/storage/bd.so: cannot open shared object file: No such file or directory [2020-09-14 12:18:08.102606] I [MSGID: 106499] [glusterd-handler.c:4429:__glusterd_handle_status_volume] 0-management: Received status volume req for volume data [2020-09-14 12:18:08.118107] I [MSGID: 106499] [glusterd-handler.c:4429:__glusterd_handle_status_volume] 0-management: Received status volume req for volume engine [2020-09-14 12:18:37.847581] E [MSGID: 106537] [glusterd-volume-ops.c:1763:glusterd_op_stage_start_volume] 0-management: Volume engine already started [2020-09-14 12:18:37.847606] W [MSGID: 106121] [glusterd-mgmt.c:178:gd_mgmt_v3_pre_validate_fn] 0-management: Volume start prevalidation failed. [2020-09-14 12:18:37.847618] E [MSGID: 106121] [glusterd-mgmt.c:1079:glusterd_mgmt_v3_pre_validate] 0-management: Pre Validation failed for operation Start on local node [2020-09-14 12:18:37.847625] E [MSGID: 106121] [glusterd-mgmt.c:2457:glusterd_mgmt_v3_initiate_all_phases] 0-management: Pre Validation Failed The message "W [MSGID: 101095] [xlator.c:210:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/6.6/xlator/encryption/crypt.so: cannot open shared object file: No such file or directory" repeated 2 times between [2020-09-14 12:17:50.501625] and [2020-09-14 12:17:50.501663] The message "E [MSGID: 101097] [xlator.c:218:xlator_volopt_dynload] 0-xlator: dlsym(xlator_api) missing: /usr/lib64/glusterfs/6.6/rpc-transport/socket.so: undefined symbol: xlator_api" repeated 7 times between [2020-09-14 12:17:50.508662] and [2020-09-14 12:17:50.508709] The message "W [MSGID: 101095] [xlator.c:210:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/6.6/xlator/nfs/server.so: cannot open shared object file: No such file or directory" repeated 30 times between [2020-09-14 12:17:50.510694] and [2020-09-14 12:17:50.510909] [2020-09-14 12:19:41.946661] I [MSGID: 106487] [glusterd-handler.c:1516:__glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req [2020-09-14 12:19:49.871903] I [MSGID: 106533] [glusterd-volume-ops.c:982:__glusterd_handle_cli_heal_volume] 0-management: Received heal vol req for volume engine [2020-09-14 12:20:29.091554] W [glusterfsd.c:1570:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7e65) [0x7fabf0a81e65] -->/usr/sbin/glusterd(glusterfs_sigwaiter+0xe5) [0x5643e96971f5] -->/usr/sbin/glusterd(cleanup_and_exit+0x6b) [0x5643e969705b] ) 0-: received signum (15), shutting down [2020-09-14 12:20:29.123582] I [MSGID: 100030] [glusterfsd.c:2847:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 6.6 (args: /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO) [2020-09-14 12:20:29.124349] I [glusterfsd.c:2556:daemonize] 0-glusterfs: Pid of current running process is 33290 [2020-09-14 12:20:29.129801] I [MSGID: 106478] [glusterd.c:1422:init] 0-management: Maximum allowed open file descriptors set to 65536 [2020-09-14 12:20:29.129827] I [MSGID: 106479] [glusterd.c:1478:init] 0-management: Using /var/lib/glusterd as working directory [2020-09-14 12:20:29.129834] I [MSGID: 106479] [glusterd.c:1484:init] 0-management: Using /var/run/gluster as pid file working directory [2020-09-14 12:20:29.133500] I [socket.c:961:__socket_server_bind] 0-socket.management: process started listening on port (24007) [2020-09-14 12:20:29.136005] W [MSGID: 103071] [rdma.c:4472:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event channel creation failed [No such device] [2020-09-14 12:20:29.136024] W [MSGID: 103055] [rdma.c:4782:init] 0-rdma.management: Failed to initialize IB Device [2020-09-14 12:20:29.136032] W [rpc-transport.c:363:rpc_transport_load] 0-rpc-transport: 'rdma' initialization failed [2020-09-14 12:20:29.136102] W [rpcsvc.c:1985:rpcsvc_create_listener] 0-rpc-service: cannot create listener, initing the transport failed [2020-09-14 12:20:29.136112] E [MSGID: 106244] [glusterd.c:1785:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport [2020-09-14 12:20:29.137180] I [socket.c:904:__socket_server_bind] 0-socket.management: closing (AF_UNIX) reuse check socket 12 [2020-09-14 12:20:29.137478] I [MSGID: 106059] [glusterd.c:1865:init] 0-management: max-port override: 60999 [2020-09-14 12:20:30.969620] I [MSGID: 106513] [glusterd-store.c:2394:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 60000 [2020-09-14 12:20:31.061282] I [MSGID: 106544] [glusterd.c:152:glusterd_uuid_init] 0-management: retrieved UUID: 6fe3d6e3-ab45-4004-af59-93c2fc3afc93 [2020-09-14 12:20:31.086867] I [MSGID: 106498] [glusterd-handler.c:3687:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0 [2020-09-14 12:20:31.087760] I [MSGID: 106498] [glusterd-handler.c:3687:glusterd_friend_add_from_peerinfo] 0-management: connect returned 0 [2020-09-14 12:20:31.087814] W [MSGID: 106061] [glusterd-handler.c:3490:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout [2020-09-14 12:20:31.087844] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2020-09-14 12:20:31.091059] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2020-09-14 12:20:31.091052] W [MSGID: 106061] [glusterd-handler.c:3490:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout [2020-09-14 12:20:31.095394] I [MSGID: 101190] [event-epoll.c:680:event_dispatch_epoll_worker] 0-epoll: Started thread with index 0 [2020-09-14 12:20:40.073223] I [MSGID: 106493] [glusterd-rpc-ops.c:468:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: 1c2f1776-4830-4b6f-8e9e-c29d0133a0e9, host: myhost3.mydomain.com, port: 0 [2020-09-14 12:20:40.076519] C [MSGID: 106003] [glusterd-server-quorum.c:348:glusterd_do_volume_quorum_action] 0-management: Server quorum regained for volume data. Starting local bricks. [2020-09-14 12:20:40.077376] I [glusterd-utils.c:6312:glusterd_brick_start] 0-management: starting a fresh brick process for brick /gluster_bricks/data/data [2020-09-14 12:20:40.080459] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2020-09-14 12:20:40.093965] C [MSGID: 106003] [glusterd-server-quorum.c:348:glusterd_do_volume_quorum_action] 0-management: Server quorum regained for volume engine. Starting local bricks. [2020-09-14 12:20:40.094129] I [glusterd-utils.c:6312:glusterd_brick_start] 0-management: starting a fresh brick process for brick /gluster_bricks/engine/engine [2020-09-14 12:20:40.095873] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2020-09-14 12:20:40.112487] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2020-09-14 12:20:40.125181] I [MSGID: 106493] [glusterd-rpc-ops.c:468:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: b5a0980f-015b-4720-9b5b-7792ac022b54, host: myhost2.mydomain.com, port: 0 [2020-09-14 12:20:40.163361] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-nfs: setting frame-timeout to 600 [2020-09-14 12:20:40.163464] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: nfs already stopped [2020-09-14 12:20:40.163482] I [MSGID: 106568] [glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: nfs service is stopped [2020-09-14 12:20:40.163493] I [MSGID: 106599] [glusterd-nfs-svc.c:81:glusterd_nfssvc_manager] 0-management: nfs/server.so xlator is not installed [2020-09-14 12:20:40.163520] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-glustershd: setting frame-timeout to 600 [2020-09-14 12:20:40.165762] I [MSGID: 106568] [glusterd-proc-mgmt.c:92:glusterd_proc_stop] 0-management: Stopping glustershd daemon running in pid: 28027 [2020-09-14 12:20:40.170812] I [MSGID: 106492] [glusterd-handler.c:2796:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 1c2f1776-4830-4b6f-8e9e-c29d0133a0e9 [2020-09-14 12:20:40.170841] I [MSGID: 106502] [glusterd-handler.c:2837:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend [2020-09-14 12:20:40.175230] I [MSGID: 106493] [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: b5a0980f-015b-4720-9b5b-7792ac022b54 [2020-09-14 12:20:40.175329] I [MSGID: 106492] [glusterd-handler.c:2796:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: b5a0980f-015b-4720-9b5b-7792ac022b54 [2020-09-14 12:20:40.175353] I [MSGID: 106502] [glusterd-handler.c:2837:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend [2020-09-14 12:20:40.183400] I [MSGID: 106163] [glusterd-handshake.c:1389:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 60000 [2020-09-14 12:20:40.188715] I [MSGID: 106490] [glusterd-handler.c:2611:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: b5a0980f-015b-4720-9b5b-7792ac022b54 [2020-09-14 12:20:40.193866] I [MSGID: 106493] [glusterd-handler.c:3883:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to myhost2.mydomain.com (0), ret: 0, op_ret: 0 [2020-09-14 12:20:40.198118] I [MSGID: 106492] [glusterd-handler.c:2796:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: b5a0980f-015b-4720-9b5b-7792ac022b54 [2020-09-14 12:20:40.198144] I [MSGID: 106502] [glusterd-handler.c:2837:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend [2020-09-14 12:20:40.200072] I [MSGID: 106142] [glusterd-pmap.c:290:pmap_registry_bind] 0-pmap: adding brick /gluster_bricks/data/data on port 49152 [2020-09-14 12:20:40.200198] I [MSGID: 106142] [glusterd-pmap.c:290:pmap_registry_bind] 0-pmap: adding brick /gluster_bricks/engine/engine on port 49153 [2020-09-14 12:20:40.200245] I [MSGID: 106493] [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: b5a0980f-015b-4720-9b5b-7792ac022b54 [2020-09-14 12:20:40.875962] I [MSGID: 106493] [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 1c2f1776-4830-4b6f-8e9e-c29d0133a0e9 [2020-09-14 12:20:40.888913] I [MSGID: 106163] [glusterd-handshake.c:1389:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 60000 [2020-09-14 12:20:40.892888] I [MSGID: 106490] [glusterd-handler.c:2611:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 1c2f1776-4830-4b6f-8e9e-c29d0133a0e9 [2020-09-14 12:20:40.896510] I [MSGID: 106493] [glusterd-handler.c:3883:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to myhost3.mydomain.com (0), ret: 0, op_ret: 0 [2020-09-14 12:20:40.928796] I [MSGID: 106492] [glusterd-handler.c:2796:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 1c2f1776-4830-4b6f-8e9e-c29d0133a0e9 [2020-09-14 12:20:40.928820] I [MSGID: 106502] [glusterd-handler.c:2837:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend [2020-09-14 12:20:40.997183] I [MSGID: 106493] [glusterd-rpc-ops.c:681:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 1c2f1776-4830-4b6f-8e9e-c29d0133a0e9 [2020-09-14 12:20:41.165913] I [MSGID: 106568] [glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: glustershd service is stopped [2020-09-14 12:20:41.165965] I [MSGID: 106567] [glusterd-svc-mgmt.c:220:glusterd_svc_start] 0-management: Starting glustershd service [2020-09-14 12:20:42.168619] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-quotad: setting frame-timeout to 600 [2020-09-14 12:20:42.168901] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: quotad already stopped [2020-09-14 12:20:42.168917] I [MSGID: 106568] [glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: quotad service is stopped [2020-09-14 12:20:42.168942] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-bitd: setting frame-timeout to 600 [2020-09-14 12:20:42.169151] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: bitd already stopped [2020-09-14 12:20:42.169164] I [MSGID: 106568] [glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: bitd service is stopped [2020-09-14 12:20:42.169193] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-scrub: setting frame-timeout to 600 [2020-09-14 12:20:42.169377] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: scrub already stopped [2020-09-14 12:20:42.169388] I [MSGID: 106568] [glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: scrub service is stopped [2020-09-14 12:20:42.169420] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-snapd: setting frame-timeout to 600 [2020-09-14 12:20:42.169506] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-snapd: setting frame-timeout to 600 [2020-09-14 12:20:42.169586] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-snapd: setting frame-timeout to 600 [2020-09-14 12:20:42.169670] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-gfproxyd: setting frame-timeout to 600 [2020-09-14 12:20:42.169797] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-gfproxyd: setting frame-timeout to 600 [2020-09-14 12:20:42.169928] I [rpc-clnt.c:1005:rpc_clnt_connection_init] 0-gfproxyd: setting frame-timeout to 600 [2020-09-14 12:20:42.170136] I [glusterd-utils.c:6225:glusterd_brick_start] 0-management: discovered already-running brick /gluster_bricks/data/data [2020-09-14 12:20:42.170151] I [MSGID: 106142] [glusterd-pmap.c:290:pmap_registry_bind] 0-pmap: adding brick /gluster_bricks/data/data on port 49152 [2020-09-14 12:20:42.182826] I [glusterd-utils.c:6225:glusterd_brick_start] 0-management: discovered already-running brick /gluster_bricks/engine/engine [2020-09-14 12:20:42.182844] I [MSGID: 106142] [glusterd-pmap.c:290:pmap_registry_bind] 0-pmap: adding brick /gluster_bricks/engine/engine on port 49153 [2020-09-14 12:20:42.207292] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: nfs already stopped [2020-09-14 12:20:42.207314] I [MSGID: 106568] [glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: nfs service is stopped [2020-09-14 12:20:42.207324] I [MSGID: 106599] [glusterd-nfs-svc.c:81:glusterd_nfssvc_manager] 0-management: nfs/server.so xlator is not installed [2020-09-14 12:20:42.209056] I [MSGID: 106568] [glusterd-proc-mgmt.c:92:glusterd_proc_stop] 0-management: Stopping glustershd daemon running in pid: 33647 [2020-09-14 12:20:42.210310] I [MSGID: 106006] [glusterd-svc-mgmt.c:356:glusterd_svc_common_rpc_notify] 0-management: glustershd has disconnected from glusterd. [2020-09-14 12:20:43.209219] I [MSGID: 106568] [glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: glustershd service is stopped [2020-09-14 12:20:43.209621] I [MSGID: 106567] [glusterd-svc-mgmt.c:220:glusterd_svc_start] 0-management: Starting glustershd service [2020-09-14 12:20:44.212440] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: quotad already stopped [2020-09-14 12:20:44.212490] I [MSGID: 106568] [glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: quotad service is stopped [2020-09-14 12:20:44.212691] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: bitd already stopped [2020-09-14 12:20:44.212712] I [MSGID: 106568] [glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: bitd service is stopped [2020-09-14 12:20:44.212897] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: scrub already stopped [2020-09-14 12:20:44.212911] I [MSGID: 106568] [glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: scrub service is stopped [2020-09-14 12:20:44.213829] I [MSGID: 106499] [glusterd-handler.c:4429:__glusterd_handle_status_volume] 0-management: Received status volume req for volume engine [2020-09-14 12:20:46.779903] I [MSGID: 106488] [glusterd-handler.c:1577:__glusterd_handle_cli_get_volume] 0-management: Received get vol req The message "I [MSGID: 106488] [glusterd-handler.c:1577:__glusterd_handle_cli_get_volume] 0-management: Received get vol req" repeated 5 times between [2020-09-14 12:20:46.779903] and [2020-09-14 12:20:48.081053] [2020-09-14 12:20:54.612834] I [MSGID: 106533] [glusterd-volume-ops.c:982:__glusterd_handle_cli_heal_volume] 0-management: Received heal vol req for volume engine [2020-09-14 12:21:01.461754] I [MSGID: 106533] [glusterd-volume-ops.c:982:__glusterd_handle_cli_heal_volume] 0-management: Received heal vol req for volume data [2020-09-14 12:21:37.658372] I [MSGID: 106488] [glusterd-handler.c:1577:__glusterd_handle_cli_get_volume] 0-management: Received get vol req The message "I [MSGID: 106488] [glusterd-handler.c:1577:__glusterd_handle_cli_get_volume] 0-management: Received get vol req" repeated 8 times between [2020-09-14 12:21:37.658372] and [2020-09-14 12:22:28.492880] [2020-09-14 12:22:49.552197] I [MSGID: 106488] [glusterd-handler.c:1577:__glusterd_handle_cli_get_volume] 0-management: Received get vol req [2020-09-14 12:23:39.208008] I [MSGID: 106533] [glusterd-volume-ops.c:982:__glusterd_handle_cli_heal_volume] 0-management: Received heal vol req for volume data [2020-09-14 13:23:42.678379] I [MSGID: 106488] [glusterd-handler.c:1577:__glusterd_handle_cli_get_volume] 0-management: Received get vol req [2020-09-14 13:23:43.038221] I [MSGID: 106488] [glusterd-handler.c:1577:__glusterd_handle_cli_get_volume] 0-management: Received get vol req glustersh.log ----------------- [2020-09-14 08:40:09.990714] I [socket.c:811:__socket_shutdown] 0-data-client-0: intentional socket shutdown(5) [2020-09-14 08:40:15.005498] I [socket.c:811:__socket_shutdown] 0-engine-client-0: intentional socket shutdown(5) [2020-09-14 08:40:27.068884] W [socket.c:721:__socket_rwv] 0-engine-client-2: readv on 192.168.0.103:49153 failed (No data available) [2020-09-14 08:40:27.068922] I [MSGID: 114018] [client.c:2331:client_rpc_notify] 0-engine-client-2: disconnected from engine-client-2. Client process will keep trying to connect to glusterd until brick's port is available [2020-09-14 08:40:27.068954] W [MSGID: 108001] [afr-common.c:5613:afr_notify] 0-engine-replicate-0: Client-quorum is not met [2020-09-14 08:40:37.082276] E [MSGID: 114058] [client-handshake.c:1449:client_query_portmap_cbk] 0-engine-client-2: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running. [2020-09-14 08:40:39.648157] W [socket.c:721:__socket_rwv] 0-data-client-2: readv on 192.168.0.103:49152 failed (No data available) [2020-09-14 08:40:39.648190] I [MSGID: 114018] [client.c:2331:client_rpc_notify] 0-data-client-2: disconnected from data-client-2. Client process will keep trying to connect to glusterd until brick's port is available [2020-09-14 08:40:39.648217] W [MSGID: 108001] [afr-common.c:5613:afr_notify] 0-data-replicate-0: Client-quorum is not met [2020-09-14 08:40:50.133996] E [MSGID: 114058] [client-handshake.c:1449:client_query_portmap_cbk] 0-data-client-2: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running. [2020-09-14 08:40:50.134055] I [MSGID: 114018] [client.c:2331:client_rpc_notify] 0-data-client-2: disconnected from data-client-2. Client process will keep trying to connect to glusterd until brick's port is available [2020-09-14 08:41:05.100136] W [socket.c:721:__socket_rwv] 0-engine-client-1: readv on 192.168.0.102:49153 failed (No data available) [2020-09-14 08:41:05.100176] I [MSGID: 114018] [client.c:2331:client_rpc_notify] 0-engine-client-1: disconnected from engine-client-1. Client process will keep trying to connect to glusterd until brick's port is available [2020-09-14 08:41:05.100193] E [MSGID: 108006] [afr-common.c:5323:__afr_handle_child_down_event] 0-engine-replicate-0: All subvolumes are down. Going offline until at least one of them comes back up. [2020-09-14 08:41:05.147135] W [socket.c:721:__socket_rwv] 0-data-client-1: readv on 192.168.0.102:49152 failed (No data available) [2020-09-14 08:41:05.147165] I [MSGID: 114018] [client.c:2331:client_rpc_notify] 0-data-client-1: disconnected from data-client-1. Client process will keep trying to connect to glusterd until brick's port is available [2020-09-14 08:41:05.147179] E [MSGID: 108006] [afr-common.c:5323:__afr_handle_child_down_event] 0-data-replicate-0: All subvolumes are down. Going offline until at least one of them comes back up. [2020-09-14 08:41:15.279570] E [MSGID: 114058] [client-handshake.c:1449:client_query_portmap_cbk] 0-engine-client-1: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running. [2020-09-14 08:41:15.279610] I [MSGID: 114018] [client.c:2331:client_rpc_notify] 0-engine-client-1: disconnected from engine-client-1. Client process will keep trying to connect to glusterd until brick's port is available [2020-09-14 08:41:15.281851] E [MSGID: 114058] [client-handshake.c:1449:client_query_portmap_cbk] 0-data-client-1: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running. [2020-09-14 08:41:15.281891] I [MSGID: 114018] [client.c:2331:client_rpc_notify] 0-data-client-1: disconnected from data-client-1. Client process will keep trying to connect to glusterd until brick's port is available [2020-09-14 08:40:37.082339] I [MSGID: 114018] [client.c:2331:client_rpc_notify] 0-engine-client-2: disconnected from engine-client-2. Client process will keep trying to connect to glusterd until brick's port is available [2020-09-14 08:42:16.707442] I [socket.c:811:__socket_shutdown] 0-data-client-0: intentional socket shutdown(5) [2020-09-14 08:42:21.745058] I [socket.c:811:__socket_shutdown] 0-engine-client-0: intentional socket shutdown(5) [2020-09-14 08:42:40.886787] I [socket.c:811:__socket_shutdown] 0-engine-client-2: intentional socket shutdown(5) ... ... [2020-09-14 12:20:29.094083] W [socket.c:721:__socket_rwv] 0-glusterfs: readv on 127.0.0.1:24007 failed (No data available) [2020-09-14 12:20:29.094148] I [glusterfsd-mgmt.c:2443:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: localhost [2020-09-14 12:20:34.927760] I [socket.c:811:__socket_shutdown] 0-engine-client-2: intentional socket shutdown(5) [2020-09-14 12:20:39.959928] I [glusterfsd-mgmt.c:2019:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing [2020-09-14 12:20:40.166029] W [glusterfsd.c:1570:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7e65) [0x7f5f21f62e65] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x557934e2e1f5] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x557934e2e05b] ) 0-: received signum (15), shutting down [2020-09-14 12:20:41.186725] I [MSGID: 100030] [glusterfsd.c:2847:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 6.6 (args: /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/9713eea6455553ca.socket --xlator-option *replicate*.node-uuid=6fe3d6e3-ab45-4004-af59-93c2fc3afc93 --process-name glustershd --client-pid=-6) [2020-09-14 12:20:41.187235] I [glusterfsd.c:2556:daemonize] 0-glusterfs: Pid of current running process is 33647 [2020-09-14 12:20:41.191605] I [socket.c:904:__socket_server_bind] 0-socket.glusterfsd: closing (AF_UNIX) reuse check socket 9 [2020-09-14 12:20:41.196867] I [MSGID: 101190] [event-epoll.c:680:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2020-09-14 12:20:41.196870] I [MSGID: 101190] [event-epoll.c:680:event_dispatch_epoll_worker] 0-epoll: Started thread with index 0 [2020-09-14 12:20:42.209282] W [glusterfsd.c:1570:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7e65) [0x7f995868ee65] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x55963354e1f5] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x55963354e05b] ) 0-: received signum (15), shutting down [2020-09-14 12:20:43.236232] I [MSGID: 100030] [glusterfsd.c:2847:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 6.6 (args: /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/9713eea6455553ca.socket --xlator-option *replicate*.node-uuid=6fe3d6e3-ab45-4004-af59-93c2fc3afc93 --process-name glustershd --client-pid=-6) [2020-09-14 12:20:43.236667] I [glusterfsd.c:2556:daemonize] 0-glusterfs: Pid of current running process is 33705 [2020-09-14 12:20:43.240743] I [socket.c:904:__socket_server_bind] 0-socket.glusterfsd: closing (AF_UNIX) reuse check socket 9 [2020-09-14 12:20:43.246262] I [MSGID: 101190] [event-epoll.c:680:event_dispatch_epoll_worker] 0-epoll: Started thread with index 0 [2020-09-14 12:20:43.246283] I [MSGID: 101190] [event-epoll.c:680:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2020-09-14 12:20:44.217875] I [MSGID: 101190] [event-epoll.c:680:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2 [2020-09-14 12:20:44.217962] I [MSGID: 101190] [event-epoll.c:680:event_dispatch_epoll_worker] 0-epoll: Started thread with index 3 [2020-09-14 12:20:44.221374] I [MSGID: 114020] [client.c:2401:notify] 0-data-client-0: parent translators are ready, attempting connect on transport [2020-09-14 12:20:44.224361] I [MSGID: 114020] [client.c:2401:notify] 0-data-client-1: parent translators are ready, attempting connect on transport [2020-09-14 12:20:44.224637] I [rpc-clnt.c:2028:rpc_clnt_reconfig] 0-data-client-0: changing port to 49152 (from 0) [2020-09-14 12:20:44.224673] I [socket.c:811:__socket_shutdown] 0-data-client-0: intentional socket shutdown(12) [2020-09-14 12:20:44.227134] I [MSGID: 114020] [client.c:2401:notify] 0-data-client-2: parent translators are ready, attempting connect on transport [2020-09-14 12:20:44.227728] I [rpc-clnt.c:2028:rpc_clnt_reconfig] 0-data-client-1: changing port to 49152 (from 0) [2020-09-14 12:20:44.227758] I [socket.c:811:__socket_shutdown] 0-data-client-1: intentional socket shutdown(13) [2020-09-14 12:20:44.230217] I [MSGID: 114046] [client-handshake.c:1106:client_setvolume_cbk] 0-data-client-0: Connected to data-client-0, attached to remote volume '/gluster_bricks/data/data'. [2020-09-14 12:20:44.230242] I [MSGID: 108005] [afr-common.c:5245:__afr_handle_child_up_event] 0-data-replicate-0: Subvolume 'data-client-0' came back up; going online. [2020-09-14 12:20:44.231537] I [MSGID: 114020] [client.c:2401:notify] 0-engine-client-0: parent translators are ready, attempting connect on transport [2020-09-14 12:20:44.232072] I [rpc-clnt.c:2028:rpc_clnt_reconfig] 0-data-client-2: changing port to 49152 (from 0) [2020-09-14 12:20:44.232112] I [socket.c:811:__socket_shutdown] 0-data-client-2: intentional socket shutdown(15) [2020-09-14 12:20:44.234981] I [MSGID: 114046] [client-handshake.c:1106:client_setvolume_cbk] 0-data-client-1: Connected to data-client-1, attached to remote volume '/gluster_bricks/data/data'. [2020-09-14 12:20:44.235005] I [MSGID: 108002] [afr-common.c:5607:afr_notify] 0-data-replicate-0: Client-quorum is met [2020-09-14 12:20:44.235988] I [MSGID: 114020] [client.c:2401:notify] 0-engine-client-1: parent translators are ready, attempting connect on transport [2020-09-14 12:20:44.236176] I [rpc-clnt.c:2028:rpc_clnt_reconfig] 0-engine-client-0: changing port to 49153 (from 0) [2020-09-14 12:20:44.236217] I [socket.c:811:__socket_shutdown] 0-engine-client-0: intentional socket shutdown(12) [2020-09-14 12:20:44.239425] I [MSGID: 114046] [client-handshake.c:1106:client_setvolume_cbk] 0-data-client-2: Connected to data-client-2, attached to remote volume '/gluster_bricks/data/data'. [2020-09-14 12:20:44.240456] I [MSGID: 114020] [client.c:2401:notify] 0-engine-client-2: parent translators are ready, attempting connect on transport [2020-09-14 12:20:44.240960] I [rpc-clnt.c:2028:rpc_clnt_reconfig] 0-engine-client-1: changing port to 49153 (from 0) [2020-09-14 12:20:44.240987] I [socket.c:811:__socket_shutdown] 0-engine-client-1: intentional socket shutdown(17) [2020-09-14 12:20:44.243360] I [MSGID: 114046] [client-handshake.c:1106:client_setvolume_cbk] 0-engine-client-0: Connected to engine-client-0, attached to remote volume '/gluster_bricks/engine/engine'. [2020-09-14 12:20:44.243385] I [MSGID: 108005] [afr-common.c:5245:__afr_handle_child_up_event] 0-engine-replicate-0: Subvolume 'engine-client-0' came back up; going online. [2020-09-14 12:20:44.245314] I [rpc-clnt.c:2028:rpc_clnt_reconfig] 0-engine-client-2: changing port to 49153 (from 0) [2020-09-14 12:20:44.245344] I [socket.c:811:__socket_shutdown] 0-engine-client-2: intentional socket shutdown(15) [2020-09-14 12:20:44.248543] I [MSGID: 114046] [client-handshake.c:1106:client_setvolume_cbk] 0-engine-client-1: Connected to engine-client-1, attached to remote volume '/gluster_bricks/engine/engine'. [2020-09-14 12:20:44.248564] I [MSGID: 108002] [afr-common.c:5607:afr_notify] 0-engine-replicate-0: Client-quorum is met [2020-09-14 12:20:44.253661] I [MSGID: 114046] [client-handshake.c:1106:client_setvolume_cbk] 0-engine-client-2: Connected to engine-client-2, attached to remote volume '/gluster_bricks/engine/engine'. [2020-09-14 12:20:44.486491] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on b72e302c-478d-4f90-bdd7-5d542928bcc1. sources=[1] 2 sinks=0 [2020-09-14 12:20:47.661708] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on b0ba8fbc-750e-4a2d-a218-d49500d95d26. sources=[1] 2 sinks=0 [2020-09-14 12:20:53.451995] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on bbfa9185-5383-44f8-a32d-9b1aaa6e43a7. sources=[1] 2 sinks=0 [2020-09-14 12:20:54.148249] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on 1f2ee964-e4bc-432e-be43-5e95783cace1. sources=[1] 2 sinks=0 [2020-09-14 12:20:54.148386] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on 937a7d82-5b99-487b-9614-3aad7ef2dc97. sources=[1] 2 sinks=0 [2020-09-14 12:20:54.148420] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on b9a47d47-111b-403b-9d05-b741c2878ff6. sources=[1] 2 sinks=0 [2020-09-14 12:20:54.443785] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on abe59023-a60b-495e-a1a3-2698dbf747b5. sources=[1] 2 sinks=0 [2020-09-14 12:20:54.647683] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on 7d1de36f-d6ff-4a34-91f3-30d69e28cadb. sources=[1] 2 sinks=0 [2020-09-14 12:20:54.796212] I [MSGID: 108026] [afr-self-heal-entry.c:898:afr_selfheal_entry_do] 0-engine-replicate-0: performing entry selfheal on 65fc9ad6-69bb-4d7a-87de-8f286450ea5b [2020-09-14 12:20:54.805989] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-engine-replicate-0: Completed entry selfheal on 65fc9ad6-69bb-4d7a-87de-8f286450ea5b. sources=[0] 1 sinks=2 [2020-09-14 12:20:55.295470] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on 7bae68bc-7629-4fa3-8fe4-b13768a98fcc. sources=[1] 2 sinks=0 [2020-09-14 12:21:03.753629] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on 7432e014-3990-44fe-9409-9eba4090f547. sources=[1] 2 sinks=0 [2020-09-14 12:21:03.773390] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on cf3abe71-ba89-490a-bb33-5e900cb78967. sources=[1] 2 sinks=0 [2020-09-14 12:21:04.421092] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on f623d6ec-1541-4db2-bc21-35a450d142de. sources=[1] 2 sinks=0 [2020-09-14 12:21:04.503141] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on 06874587-6885-4ab3-9cc4-fabb2db99d25. sources=[1] 2 sinks=0 [2020-09-14 12:21:04.505516] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on 46c3a7b2-7d5a-432e-857a-dbf332908c89. sources=[1] 2 sinks=0 [2020-09-14 12:21:04.506978] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on f93818dc-0279-4280-856c-151c4eb00f8f. sources=[1] 2 sinks=0 [2020-09-14 12:21:04.507327] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on 2fcf8b6a-79a2-4328-bde8-e5e8021c0256. sources=[1] 2 sinks=0 [2020-09-14 12:21:04.507510] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on 7f20e494-afc6-4d60-822e-33aa30fe52f7. sources=[1] 2 sinks=0 [2020-09-14 12:21:11.025834] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on 4719908b-2613-4976-bdc9-ed10847ea24e. sources=[1] 2 sinks=0 [2020-09-14 12:21:11.692657] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on db070d51-7c16-44f5-8b58-cb923edfc72a. sources=[1] 2 sinks=0 [2020-09-14 12:21:19.500323] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on 1f87aefd-684f-4dd3-ad77-a3e47006ef14. sources=[1] 2 sinks=0 [2020-09-14 12:21:19.556305] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on 22639cf7-62f5-4198-912f-2b446062ee9b. sources=[1] 2 sinks=0 [2020-09-14 12:21:20.147475] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on 0a109124-0e71-4912-9475-47be55805c7f. sources=[1] 2 sinks=0 [2020-09-14 12:21:20.369201] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on 028b1032-61ab-49e8-a427-b982d22ede87. sources=[1] 2 sinks=0 [2020-09-14 12:21:20.498106] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on 7d99c05f-45b7-4c5e-a7a0-29a6eb6d5353. sources=[1] 2 sinks=0 [2020-09-14 12:21:20.627626] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on 731016ec-16be-456f-9f94-d545e83fb730. sources=[1] 2 sinks=0 [2020-09-14 12:21:20.932841] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on 54a4dd15-f2ed-4cee-b52c-d76c3a14ee1d. sources=[1] 2 sinks=0 [2020-09-14 12:21:22.950551] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on d9f8db65-bbf6-43d2-a58d-02062e68b094. sources=[1] 2 sinks=0 [2020-09-14 12:21:27.591944] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on 2a369725-e692-4985-b1e6-c2ca919977f7. sources=[1] 2 sinks=0 [2020-09-14 12:21:27.811709] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on 220f5b3f-7f1f-4afa-8e02-7039dfc46c71. sources=[1] 2 sinks=0 [2020-09-14 12:21:27.847105] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on cf06956a-4726-4a62-9800-310b860ad8c0. sources=[1] 2 sinks=0 [2020-09-14 12:21:27.850830] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on c25c8901-a0f6-40fe-92a2-c616e9bad32e. sources=[1] 2 sinks=0 [2020-09-14 12:21:28.051319] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on ad151cc0-f703-4ba8-8d88-9f21b9b808af. sources=[1] 2 sinks=0 [2020-09-14 12:21:28.254506] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on 24a52f25-a440-431a-9ed2-9b7bfa640507. sources=[1] 2 sinks=0 [2020-09-14 12:21:28.436180] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on 7f56b5c9-632c-4112-9c9e-bf62b8af09ff. sources=[1] 2 sinks=0 [2020-09-14 12:21:30.127830] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on 0f96020d-b860-45ab-9e1d-4ca1839e19f6. sources=[1] 2 sinks=0 [2020-09-14 12:21:31.986013] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on b3b7dc3b-8429-4cb6-8f80-4fe7c23715ac. sources=[1] 2 sinks=0 [2020-09-14 12:21:32.928726] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on d0c4461a-0fa2-439c-ab81-8b97c16a9a8b. sources=[1] 2 sinks=0 [2020-09-14 12:21:33.096508] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on 5dff0bbe-cfc6-458f-871a-47cfebfa018a. sources=[1] 2 sinks=0 [2020-09-14 12:21:33.394394] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on f9705bf7-4213-45ca-a17a-f1ebbf11f9ce. sources=[1] 2 sinks=0 [2020-09-14 12:21:38.089910] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on ebe1cce3-3abd-42d3-9504-9e7893721e4b. sources=[1] 2 sinks=0 [2020-09-14 12:21:43.452533] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on ee204b66-5890-469b-a796-7d877b4bf1ab. sources=[1] 2 sinks=0 [2020-09-14 12:21:43.454253] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on ea047187-b01d-49e2-8d02-c38dbf9e77b5. sources=[1] 2 sinks=0 [2020-09-14 12:21:43.710631] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on ce6556b5-a644-4634-91fa-bc70765dbc5a. sources=[1] 2 sinks=0 [2020-09-14 12:21:44.641002] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on 96691acb-b98f-4913-a261-c06e15ca18d1. sources=[1] 2 sinks=0 [2020-09-14 12:21:45.332554] I [MSGID: 108026] [afr-self-heal-common.c:1742:afr_log_selfheal] 0-data-replicate-0: Completed data selfheal on 8a84ab8b-3890-4bf6-bd49-ebd8d9b96ce2. sources=[1] 2 sinks=0

A few things to consider, what is your RAID situation per host. If you're using mdadm based soft raid, you need to make sure your drives support power loss data protection. This is mostly only a feature on enterprise drives. Essenstially it ensures the drives reserve enough energy to flush the write cache to disk on power loss. Most modern drives have a non-trivial amount of built in write cache and losing that data on power loss will gladly corrupt files, especially on soft raid setups. If you're using hardware raid, make sure you have disabled drive based write cache, and that you have a battery / capacitor connected for the raid cards cache module. If you're using ZFS, which isn't really supported, you need a good UPS and to have it set up to shut systems down cleanly. ZFS will not take power outages well. Power loss data protection is really important too, but it's not a fixall for ZFS as it also caches writes in systems RAM quite a bit. A dedicated cache device with power loss data protection can help mitigate that, but really the power issues are a more pressing concern in this situation. As far as gluster is concerned, there is not much that can easily corrupt data on power loss. My only thought would be if your switches are not also battery backed, this would be an issue. On 2020-10-08 08:15, Jarosław Prokopowski wrote:
Hi Guys,
I had a situation 2 times that due to unexpected power outage something went wrong and VMs on glusterfs where not recoverable. Gluster heal did not help and I could not start the VMs any more. Is there a way to make such setup bulletproof? Does it matter which volume type I choose - raw or qcow2? Or thin provision versus reallocated? Any other advise? _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/MRM6H2YENBP3AH...

Thanks Alex. I actually think that the issue was caused by power loss on the switch socket.
participants (5)
-
Alex McWhirter
-
Jarosław Prokopowski
-
Jayme
-
Strahil Nikolov
-
WK