Thx for your help, Strahil!
Hmmm, I see DNS resolution failed in hostname without FQDN. I'll try to fix it.
19.03.2019, 09:43, "Strahil" <hunter86_bg@yahoo.com>:
Hi Alexei,
>> 1.2 All bricks healed (gluster volume heal data info summary) and no split-brain
>
>
>
> gluster volume heal data info
>
> Brick node-msk-gluster203:/opt/gluster/data
> Status: Connected
> Number of entries: 0
>
> Brick node-msk-gluster205:/opt/gluster/data
> <gfid:18c78043-0943-48f8-a4fe-9b23e2ba3404>
> <gfid:b6f7d8e7-1746-471b-a49d-8d824db9fd72>
> <gfid:6db6a49e-2be2-4c4e-93cb-d76c32f8e422>
> <gfid:e39cb2a8-5698-4fd2-b49c-102e5ea0a008>
> <gfid:5fad58f8-4370-46ce-b976-ac22d2f680ee>
> <gfid:7d0b4104-6ad6-433f-9142-7843fd260c70>
> <gfid:706cd1d9-f4c9-4c89-aa4c-42ca91ab827e>
> Status: Connected
> Number of entries: 7
>
> Brick node-msk-gluster201:/opt/gluster/data
> <gfid:18c78043-0943-48f8-a4fe-9b23e2ba3404>
> <gfid:b6f7d8e7-1746-471b-a49d-8d824db9fd72>
> <gfid:6db6a49e-2be2-4c4e-93cb-d76c32f8e422>
> <gfid:e39cb2a8-5698-4fd2-b49c-102e5ea0a008>
> <gfid:5fad58f8-4370-46ce-b976-ac22d2f680ee>
> <gfid:7d0b4104-6ad6-433f-9142-7843fd260c70>
> <gfid:706cd1d9-f4c9-4c89-aa4c-42ca91ab827e>
> Status: Connected
> Number of entries: 7
>
Data needs healing.
Run: cluster volume heal data full
This does not work.
If it still doesn't heal (check in 5 min),go to /rhev/data-center/mnt/glusterSD/msk-gluster-facility.xxxx_data
And run 'find . -exec stat {}\;' without the quotes.
As I have understood you, ovirt Hosted Engine is running and can be started on all nodes except 1.
Ovirt Hosted Engine works and can be run on all nodes with no exceptions.
Hosted Engine volume /rhev/data-center/mnt/glusterSD/msk-gluster-facility.xxxx_engine can be mounted by all nodes without problems.
>>
>> 2. Go to the problematic host and check the mount point is there
>
>
>
> No mount point on problematic node /rhev/data-center/mnt/glusterSD/msk-gluster-facility.xxxx:_data
> If I create a mount point manually, it is deleted after the node is activated.
>
> Other nodes can mount this volume without problems. Only this node have connection problems after update.
>
> Here is a part of the log at the time of activation of the node:
>
> vdsm log
>
> 2019-03-18 16:46:00,548+0300 INFO (jsonrpc/5) [vds] Setting Hosted Engine HA local maintenance to False (API:1630)
> 2019-03-18 16:46:00,549+0300 INFO (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call Host.setHaMaintenanceMode succeeded in 0.00 seconds (__init__:573)
> 2019-03-18 16:46:00,581+0300 INFO (jsonrpc/7) [vdsm.api] START connectStorageServer(domType=7, spUUID=u'5a5cca91-01f8-01af-0297-00000000025f', conList=[{u'id': u'5799806e-7969-45da-b17d-b47a63e6a8e4', u'connection': u'msk-gluster-facility.xxxx:/data', u'iqn': u'', u'user': u'', u'tpgt': u'1', u'vfs_type': u'glusterfs', u'password': '********', u'port': u''}], options=None) from=::ffff:10.77.253.210,56630, flow_id=81524ed, task_id=5f353993-95de-480d-afea-d32dc94fd146 (api:46)
> 2019-03-18 16:46:00,621+0300 INFO (jsonrpc/7) [storage.StorageServer.MountConnection] Creating directory u'/rhev/data-center/mnt/glusterSD/msk-gluster-facility.xxxx:_data' (storageServer:167)
> 2019-03-18 16:46:00,622+0300 INFO (jsonrpc/7) [storage.fileUtils] Creating directory: /rhev/data-center/mnt/glusterSD/msk-gluster-facility.xxxx:_data mode: None (fileUtils:197)
> 2019-03-18 16:46:00,622+0300 WARN (jsonrpc/7) [storage.StorageServer.MountConnection] gluster server u'msk-gluster-facility.xxxx' is not in bricks ['node-msk-gluster203', 'node-msk-gluster205', 'node-msk-gluster201'], possibly mounting duplicate servers (storageServer:317)
This seems very strange. As you have hidden the hostname, I'm not use which on is this.
Check that DNS can be resolved from all hosts and the hostname of this Host is resolvable.
Name resolution works without problems.
dig msk-gluster-facility.xxxx
;; ANSWER SECTION:
msk-gluster-facility.xxxx. 1786 IN A 10.77.253.205 # <-- node-msk-gluster205.xxxx
msk-gluster-facility.xxxx. 1786 IN A 10.77.253.201 # <-- node-msk-gluster201.xxxx
msk-gluster-facility.xxxx. 1786 IN A 10.77.253.203 # <-- node-msk-gluster203.xxxx
;; Query time: 5 msec
;; SERVER: 10.77.16.155#53(10.77.16.155)
;; WHEN: Tue Mar 19 14:55:10 MSK 2019
;; MSG SIZE rcvd: 110
Also check if it in the peer list.
msk-gluster-facility.xxxx is just an A type record in dns. It is used on a webUI for mounting gluster volumes and gluster storage HA.
Try to manually mount the cluster volume:
mount -t glusterfs msk-gluster-facility.xxxx:/data /mnt
Well, the mount works from hypervisor node77-202.
And does not work with the hypervisor node77-204 (problematic node).
node77- 204/var/log/glusterfs/mnt.log
[2019-03-19 12:15:11.106226] I [MSGID: 100030] [glusterfsd.c:2511:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.12.15 (args: /usr/sbin/glusterfs --volfile-server=msk-gluster-facility.xxxx --volfile-id=/data /mnt)
[2019-03-19 12:15:11.109577] W [MSGID: 101002] [options.c:995:xl_opt_validate] 0-glusterfs: option 'address-family' is deprecated, preferred is 'transport.address-family', continuing with correction
[2019-03-19 12:15:11.129652] I [MSGID: 101190] [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2019-03-19 12:15:11.135384] I [MSGID: 101190] [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2
[2019-03-19 12:15:11.135696] W [MSGID: 101174] [graph.c:363:_log_if_unknown_option] 0-data-readdir-ahead: option 'parallel-readdir' is not recognized
[2019-03-19 12:15:11.135993] I [MSGID: 114020] [client.c:2360:notify] 0-data-client-0: parent translators are ready, attempting connect on transport
[2019-03-19 12:15:11.197155] E [MSGID: 101075] [common-utils.c:324:gf_resolve_ip6] 0-resolver: getaddrinfo failed (Name or service not known)
[2019-03-19 12:15:11.197190] E [name.c:267:af_inet_client_get_remote_sockaddr] 0-data-client-0: DNS resolution failed on host node-msk-gluster203
[2019-03-19 12:15:11.197268] I [MSGID: 114020] [client.c:2360:notify] 0-data-client-1: parent translators are ready, attempting connect on transport
[2019-03-19 12:15:11.197293] E [MSGID: 108006] [afr-common.c:5214:__afr_handle_child_down_event] 0-data-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.
[2019-03-19 12:15:11.263720] E [MSGID: 101075] [common-utils.c:324:gf_resolve_ip6] 0-resolver: getaddrinfo failed (Name or service not known)
[2019-03-19 12:15:11.263741] E [name.c:267:af_inet_client_get_remote_sockaddr] 0-data-client-1: DNS resolution failed on host node-msk-gluster205
[2019-03-19 12:15:11.263809] I [MSGID: 114020] [client.c:2360:notify] 0-data-client-2: parent translators are ready, attempting connect on transport
[2019-03-19 12:15:11.263812] E [MSGID: 108006] [afr-common.c:5214:__afr_handle_child_down_event] 0-data-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.
[2019-03-19 12:15:15.142350] E [MSGID: 101075] [common-utils.c:324:gf_resolve_ip6] 0-resolver: getaddrinfo failed (Name or service not known)
[2019-03-19 12:15:15.142400] E [name.c:267:af_inet_client_get_remote_sockaddr] 0-data-client-0: DNS resolution failed on host node-msk-gluster203
[2019-03-19 12:15:15.198231] E [name.c:267:af_inet_client_get_remote_sockaddr] 0-data-client-1: DNS resolution failed on host node-msk-gluster205
[2019-03-19 12:15:18.221112] E [name.c:267:af_inet_client_get_remote_sockaddr] 0-data-client-0: DNS resolution failed on host node-msk-gluster203
[2019-03-19 12:15:18.249776] E [name.c:267:af_inet_client_get_remote_sockaddr] 0-data-client-1: DNS resolution failed on host node-msk-gluster205
[2019-03-19 12:15:21.252556] I [fuse-bridge.c:4205:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel 7.22
[2019-03-19 12:15:21.252586] I [fuse-bridge.c:4835:fuse_graph_sync] 0-fuse: switched to graph 0
The message "E [MSGID: 101075] [common-utils.c:324:gf_resolve_ip6] 0-resolver: getaddrinfo failed (Name or service not known)" repeated 3 times between [2019-03-19 12:15:15.142350] and [2019-03-19 12:15:18.249774]
[2019-03-19 12:15:21.252696] I [MSGID: 108006] [afr-common.c:5494:afr_local_init] 0-data-replicate-0: no subvolumes up
[2019-03-19 12:15:21.252750] E [fuse-bridge.c:4271:fuse_first_lookup] 0-fuse: first lookup on root failed (Transport endpoint is not connected)
[2019-03-19 12:15:21.283951] E [MSGID: 101075] [common-utils.c:324:gf_resolve_ip6] 0-resolver: getaddrinfo failed (Name or service not known)
[2019-03-19 12:15:21.283983] E [name.c:267:af_inet_client_get_remote_sockaddr] 0-data-client-0: DNS resolution failed on host node-msk-gluster203
[2019-03-19 12:15:21.304237] E [name.c:267:af_inet_client_get_remote_sockaddr] 0-data-client-2: DNS resolution failed on host node-msk-gluster201
Is this a second FQDN/IP of this server?
If so, gluster accepts that via gluster peer probe IP2
Hmmm, I see DNS resolution failed in hostname without FQDN. I'll try to fix it.
>> 2.1. Check permissions (should be vdsm:kvm) and fix with chown -R if needed
>> 2.2. Check the OVF_STORE from the logs that it exists
>
>
> How can i do this?
Go to /rhev/data-center/mnt/glusterSD/host_engine and use find inside the domain UUID for files that are not owned by vdsm:KVM.
I usually run 'chown -R vdsm:KVM 823xx-xxxx-yyyy-zzz' and it will fix any misconfiguration.
Best Regards,
Strahil Nikolov