Thx for answer!
 
 
 
18.03.2019, 14:52, "Strahil Nikolov" <hunter86_bg@yahoo.com>:
 
Hi Alexei,
 
In order to debug it check the following:
 
1. Check gluster:
1.1 All bricks up ?
 
All peers up. Gluster version is 3.12.15
 
[root@node-msk-gluster203 ~]# gluster peer status
Number of Peers: 2
 
Hostname: node-msk-gluster205.xxxx
Uuid: 188d8444-3246-4696-a0a7-2872e0a01067
State: Peer in Cluster (Connected)
 
Hostname: node-msk-gluster201.xxxx
Uuid: 919b0a60-b9b7-4091-a60a-51d43b995285
State: Peer in Cluster (Connected)
 
All bricks on all gluster servers are UP.
 
Volume Name: data
Type: Replicate
Volume ID: 8fb43ba3-b2e9-4e33-b4c3-b0b03cd8cba3
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: node-msk-gluster203.xxxx:/opt/gluster/data
Brick2: node-msk-gluster205.xxxx:/opt/gluster/data
Brick3: node-msk-gluster201.xxxx:/opt/gluster/data (arbiter)
 
Volume Name: engine
Type: Replicate
Volume ID: 5dda8427-c69b-4b96-bcd6-eff3be2e0b5c
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp 
Bricks:
Brick1: node-msk-gluster205.xxxx:/opt/gluster/engine
Brick2: node-msk-gluster203.xxxx:/opt/gluster/engine
Brick3: node-msk-gluster201.xxxx:/opt/gluster/engine (arbiter)
 
 
 
1.2 All bricks healed (gluster volume heal data info summary) and no split-brain
 
 
gluster volume heal data info
 
Brick node-msk-gluster203:/opt/gluster/data
Status: Connected
Number of entries: 0
 
Brick node-msk-gluster205:/opt/gluster/data
<gfid:18c78043-0943-48f8-a4fe-9b23e2ba3404>
<gfid:b6f7d8e7-1746-471b-a49d-8d824db9fd72>
<gfid:6db6a49e-2be2-4c4e-93cb-d76c32f8e422>
<gfid:e39cb2a8-5698-4fd2-b49c-102e5ea0a008>
<gfid:5fad58f8-4370-46ce-b976-ac22d2f680ee>
<gfid:7d0b4104-6ad6-433f-9142-7843fd260c70>
<gfid:706cd1d9-f4c9-4c89-aa4c-42ca91ab827e>
Status: Connected
Number of entries: 7
 
Brick node-msk-gluster201:/opt/gluster/data
<gfid:18c78043-0943-48f8-a4fe-9b23e2ba3404>
<gfid:b6f7d8e7-1746-471b-a49d-8d824db9fd72>
<gfid:6db6a49e-2be2-4c4e-93cb-d76c32f8e422>
<gfid:e39cb2a8-5698-4fd2-b49c-102e5ea0a008>
<gfid:5fad58f8-4370-46ce-b976-ac22d2f680ee>
<gfid:7d0b4104-6ad6-433f-9142-7843fd260c70>
<gfid:706cd1d9-f4c9-4c89-aa4c-42ca91ab827e>
Status: Connected
Number of entries: 7
 
gluster volume heal engine info
 
Brick node-msk-gluster205.xxxx:/opt/gluster/engine
Status: Connected
Number of entries: 0
 
Brick node-msk-gluster203.xxxx:/opt/gluster/engine
Status: Connected
Number of entries: 0
 
Brick node-msk-gluster201.xxxx:/opt/gluster/engine
Status: Connected
Number of entries: 0
 
 
 
 
2. Go to the problematic host and check the mount point is there
 
 
No mount point on problematic node /rhev/data-center/mnt/glusterSD/msk-gluster-facility.xxxx:_data
If I create a mount point manually, it is deleted after the node is activated.
 
Other nodes can mount this volume without problems. Only this node have connection problems after update.
 
Here is a part of the log at the time of activation of the node:
 
vdsm log
 
2019-03-18 16:46:00,548+0300 INFO  (jsonrpc/5) [vds] Setting Hosted Engine HA local maintenance to False (API:1630)
2019-03-18 16:46:00,549+0300 INFO  (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call Host.setHaMaintenanceMode succeeded in 0.00 seconds (__init__:573)
2019-03-18 16:46:00,581+0300 INFO  (jsonrpc/7) [vdsm.api] START connectStorageServer(domType=7, spUUID=u'5a5cca91-01f8-01af-0297-00000000025f', conList=[{u'id': u'5799806e-7969-45da-b17d-b47a63e6a8e4', u'connection': u'msk-gluster-facility.xxxx:/data', u'iqn': u'', u'user': u'', u'tpgt': u'1', u'vfs_type': u'glusterfs', u'password': '********', u'port': u''}], options=None) from=::ffff:10.77.253.210,56630, flow_id=81524ed, task_id=5f353993-95de-480d-afea-d32dc94fd146 (api:46)
2019-03-18 16:46:00,621+0300 INFO  (jsonrpc/7) [storage.StorageServer.MountConnection] Creating directory u'/rhev/data-center/mnt/glusterSD/msk-gluster-facility.xxxx:_data' (storageServer:167)
2019-03-18 16:46:00,622+0300 INFO  (jsonrpc/7) [storage.fileUtils] Creating directory: /rhev/data-center/mnt/glusterSD/msk-gluster-facility.xxxx:_data mode: None (fileUtils:197)
2019-03-18 16:46:00,622+0300 WARN  (jsonrpc/7) [storage.StorageServer.MountConnection] gluster server u'msk-gluster-facility.xxxx' is not in bricks ['node-msk-gluster203', 'node-msk-gluster205', 'node-msk-gluster201'], possibly mounting duplicate servers (storageServer:317)
2019-03-18 16:46:00,622+0300 INFO  (jsonrpc/7) [storage.Mount] mounting msk-gluster-facility.xxxx:/data at /rhev/data-center/mnt/glusterSD/msk-gluster-facility.xxxx:_data (mount:204)
2019-03-18 16:46:00,809+0300 ERROR (jsonrpc/7) [storage.HSM] Could not connect to storageServer (hsm:2415)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2412, in connectStorageServer
    conObj.connect()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/storageServer.py", line 179, in connect
    six.reraise(t, v, tb)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/storageServer.py", line 171, in connect
    self._mount.mount(self.options, self._vfsType, cgroup=self.CGROUP)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 207, in mount
    cgroup=cgroup)
  File "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 55, in __call__
    return callMethod()
  File "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 53, in <lambda>
    **kwargs)
  File "<string>", line 2, in mount
  File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod
    raise convert_to_error(kind, result)
MountError: (1, ';Running scope as unit run-72797.scope.\nMount failed. Please check the log file for more details.\n')
 
 
2.1. Check permissions (should be vdsm:kvm) and fix with chown -R if needed
2.2. Check the OVF_STORE from the logs that it exists
 
How can i do this?
 
 
2.3. Check that vdsm can extract the file:
sudo -u vdsm tar -tvf /rhev/data-center/mnt/glusterSD/msk-gluster-facility.xxxx:_data/DOMAIN-UUID/Volume-UUID/Image-ID
 
Basing on that error "ConnectStoragePoolVDS failed: Cannot find master domain: u'spUUID=5a5cca91-01f8-01af-0297-00000000025f, msdUUID=7d5de684-58ff-4fbc-905d-3048fc55b2b1'"
 
All OVF_STORE files from msk-gluster-facility.xxxx:_data are accesible from NOT problematic nodes:
 
sudo -u vdsm tar -tvf /rhev/data-center/mnt/glusterSD/msk-gluster-facility.xxxx\:_data/7d5de684-58ff-4fbc-905d-3048fc55b2b1/images/05da3b87-ca77-4982-ac91-449c6ffb92da/8d9fee07-0526-4552-9ae6-2d2f050d7443
-rw-r--r-- 0/0             138 2019-03-15 14:13 info.json
-rw-r--r-- 0/0           26445 2019-03-15 14:13 efb12716-bd76-4d76-ab5a-a4782cc185ab.ovf
-rw-r--r-- 0/0           43804 2019-03-15 14:13 f8b046a1-bf32-4811-9a8d-526c1f89bed5.ovf
-rw-r--r-- 0/0            9488 2019-03-15 14:13 d108204e-4fe6-457d-859b-9afbaffd62f0.ovf
-rw-r--r-- 0/0           12753 2019-03-15 14:13 056c96d6-eb0a-4cba-8bb2-66c736bdd2bb.ovf
-rw-r--r-- 0/0           42148 2019-03-15 14:13 a57ce81b-657d-4c78-a224-ff73dd8d6830.ovf
-rw-r--r-- 0/0           12929 2019-03-15 14:13 1c93561a-9167-474c-b983-2a15279164c5.ovf
-rw-r--r-- 0/0              23 2019-03-15 14:13 metadata.json
 
sudo -u vdsm tar -tvf /rhev/data-center/mnt/glusterSD/msk-gluster-facility.xxxx\:_data/7d5de684-58ff-4fbc-905d-3048fc55b2b1/images/bbf6f780-2088-492c-8bca-75e8521c1242/c4472869-893f-44e8-a569-465932efd339
-rw-r--r-- 0/0             138 2017-12-31 00:11 info.json
-rw-r--r-- 0/0           11181 2017-12-31 00:11 056c96d6-eb0a-4cba-8bb2-66c736bdd2bb.ovf
-rw-r--r-- 0/0           11293 2017-12-31 00:11 921b66f7-a686-4537-b0f1-f91ecc0f75e1.ovf
 
 sudo -u vdsm tar -tvf /rhev/data-center/mnt/glusterSD/msk-gluster-facility.xxxx\:_data/7d5de684-58ff-4fbc-905d-3048fc55b2b1/images/d8c5e208-678b-4b2c-8afe-d9bf8e45639d/cdb605f9-2a01-4297-8c14-6b405c58225c
-rw-r--r-- 0/0             138 2019-03-15 14:13 info.json
-rw-r--r-- 0/0           26445 2019-03-15 14:13 efb12716-bd76-4d76-ab5a-a4782cc185ab.ovf
-rw-r--r-- 0/0           43804 2019-03-15 14:13 f8b046a1-bf32-4811-9a8d-526c1f89bed5.ovf
-rw-r--r-- 0/0            9488 2019-03-15 14:13 d108204e-4fe6-457d-859b-9afbaffd62f0.ovf
-rw-r--r-- 0/0           12753 2019-03-15 14:13 056c96d6-eb0a-4cba-8bb2-66c736bdd2bb.ovf
-rw-r--r-- 0/0           42148 2019-03-15 14:13 a57ce81b-657d-4c78-a224-ff73dd8d6830.ovf
-rw-r--r-- 0/0           12929 2019-03-15 14:13 1c93561a-9167-474c-b983-2a15279164c5.ovf
-rw-r--r-- 0/0              23 2019-03-15 14:13 metadata.json
 
 sudo -u vdsm tar -tvf /rhev/data-center/mnt/glusterSD/msk-gluster-facility.xxxx\:_data/7d5de684-58ff-4fbc-905d-3048fc55b2b1/images/c4ebed37-879f-46b5-ac24-78008ee11a55/e1340f68-16d3-45cf-8fd4-e7efed7c158e
-rw-r--r-- 0/0             138 2017-12-31 00:11 info.json
-rw-r--r-- 0/0           11181 2017-12-31 00:11 056c96d6-eb0a-4cba-8bb2-66c736bdd2bb.ovf
-rw-r--r-- 0/0           11293 2017-12-31 00:11 921b66f7-a686-4537-b0f1-f91ecc0f75e1.ovf
 
 
 
3 Configure virsh alias, as it's quite helpful:
alias virsh='virsh -c qemu:///system?authfile=/etc/ovirt-hosted-engine/virsh_auth.conf'
 
4. If VM is running - go to the host and get the xml:
virsh dumpxml HostedEngine > /root/HostedEngine.xml
 
Got it. https://yadi.sk/d/MB9BeehjmyxWFQ 
 
4.1. Get the Network:
virsh net-dumpxml vdsm-ovirtmgmt > /root/vdsm-ovirtmgmt.xml
 
Got it.
 
<network>
  <name>vdsm-ovirtmgmt</name>
  <uuid>a618e93f-00e0-4f11-bb7a-4b8de9d56570</uuid>
  <forward mode='bridge'/>
  <bridge name='ovirtmgmt'/>
</network>
 
4.2 If not , Here is mine:
[root@ovirt1 ~]# virsh net-dumpxml vdsm-ovirtmgmt
<network>
  <name>vdsm-ovirtmgmt</name>
  <uuid>7ae538ce-d419-4dae-93b8-3a4d27700227</uuid>
  <forward mode='bridge'/>
  <bridge name='ovirtmgmt'/>
</network>

 
UUID is not important, as my first recovery was with different one.
 
5. If you Hosted Engine is down:
 
Engine UP and works without problem on all nodes include ONE problematic.
 
5.1 Remove the VM (if exists anywhere)on all nodes:
virsh undefine HostedEngine
5.2 Verify that the nodes are in global maintenance:
hosted-engine --vm-status
5.3 Define the Engine on only 1 machine
virsh define HostedEngine.xml
virsh net-define vdsm-ovirtmgmt.xml
 
virsh start HostedEngine
 
Note: if it complains about the storage - there is no link in /var/run/vdsm/storage/DOMAIN-UUID/Volume-UUID to your Volume-UUID
Here is how it looks mine:
[root@ovirt1 808423f9-8a5c-40cd-bc9f-2568c85b8c74]# ll /var/run/vdsm/storage/808423f9-8a5c-40cd-bc9f-2568c85b8c74
total 24
lrwxrwxrwx. 1 vdsm kvm 139 Mar 17 07:42 2c74697a-8bd9-4472-8a98-bf624f3462d5 -> /rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/2c74697a-8bd9-4472-8a98-bf624f3462d5
lrwxrwxrwx. 1 vdsm kvm 139 Mar 17 07:45 3ec27d6d-921c-4348-b799-f50543b6f919 -> /rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/3ec27d6d-921c-4348-b799-f50543b6f919
lrwxrwxrwx. 1 vdsm kvm 139 Mar 17 08:28 441abdc8-6cb1-49a4-903f-a1ec0ed88429 -> /rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/441abdc8-6cb1-49a4-903f-a1ec0ed88429
lrwxrwxrwx. 1 vdsm kvm 139 Mar 17 21:15 8ec7a465-151e-4ac3-92a7-965ecf854501 -> /rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/8ec7a465-151e-4ac3-92a7-965ecf854501
lrwxrwxrwx. 1 vdsm kvm 139 Mar 17 08:28 94ade632-6ecc-4901-8cec-8e39f3d69cb0 -> /rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/94ade632-6ecc-4901-8cec-8e39f3d69cb0
lrwxrwxrwx. 1 vdsm kvm 139 Mar 17 07:42 fe62a281-51e9-4b23-87b3-2deb52357304 -> /rhev/data-center/mnt/glusterSD/ovirt1.localdomain:_engine/808423f9-8a5c-40cd-bc9f-2568c85b8c74/images/fe62a281-51e9-4b23-87b3-2deb52357304

 
 
Once you create your link , start it again.
 
6. Wait till OVF is fixed (takes more than the settings in the engine :) )
 
Good Luck!
 
Best Regards,
Strahil Nikolov
 
В понеделник, 18 март 2019 г., 12:57:30 ч. Гринуич+2, Николаев Алексей <alexeynikolaev.post@yandex.ru> написа:
 
 
Hi all!
 
I have a very similar problem after update one of the two nodes to version 4.3.1. This node77-02 lost connection to gluster volume named DATA, but not to volume with hosted engine.
 
 
node77-02 /var/log/messages
 
Mar 18 13:40:00 node77-02 journal: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm ERROR Failed scanning for OVF_STORE due to Command Volume.getInfo with args {'storagepoolID': '00000000-0000-0000-0000-000000000000', 'storagedomainID': '2ee71105-1810-46eb-9388-cc6caccf9fac', 'volumeID': u'224e4b80-2744-4d7f-bd9f-43eb8fe6cf11', 'imageID': u'43b75b50-cad4-411f-8f51-2e99e52f4c77'} failed:#012(code=201, message=Volume does not exist: (u'224e4b80-2744-4d7f-bd9f-43eb8fe6cf11',))
Mar 18 13:40:00 node77-02 journal: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm ERROR Unable to identify the OVF_STORE volume, falling back to initial vm.conf. Please ensure you already added your first data domain for regular VMs
 
HostedEngine VM works fine on all nodes. But node77-02 failed with
error in webUI:
 
ConnectStoragePoolVDS failed: Cannot find master domain: u'spUUID=5a5cca91-01f8-01af-0297-00000000025f, msdUUID=7d5de684-58ff-4fbc-905d-3048fc55b2b1'
 
node77-02 vdsm.log
 
2019-03-18 13:51:46,287+0300 WARN  (jsonrpc/7) [storage.StorageServer.MountConnection] gluster server u'msk-gluster-facility.xxxx' is not in bricks ['node-msk-gluster203', 'node-msk-gluster205', 'node-msk-gluster201'], possibly mounting duplicate servers (storageServer:317)
2019-03-18 13:51:46,287+0300 INFO  (jsonrpc/7) [storage.Mount] mounting msk-gluster-facility.xxxx:/data at /rhev/data-center/mnt/glusterSD/msk-gluster-facility.xxxx:_data (mount:204)
2019-03-18 13:51:46,474+0300 ERROR (jsonrpc/7) [storage.HSM] Could not connect to storageServer (hsm:2415)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2412, in connectStorageServer
    conObj.connect()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/storageServer.py", line 179, in connect
    six.reraise(t, v, tb)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/storageServer.py", line 171, in connect
    self._mount.mount(self.options, self._vfsType, cgroup=self.CGROUP)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 207, in mount
    cgroup=cgroup)
  File "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 55, in __call__
    return callMethod()
  File "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 53, in <lambda>
    **kwargs)
  File "<string>", line 2, in mount
  File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod
    raise convert_to_error(kind, result)
MountError: (1, ';Running scope as unit run-10121.scope.\nMount failed. Please check the log file for more details.\n')
 
------------------------------
 
2019-03-18 13:51:46,830+0300 ERROR (jsonrpc/4) [storage.TaskManager.Task] (Task='fe81642e-2421-4169-a08b-51467e8f01fe') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
    return fn(*args, **kargs)
  File "<string>", line 2, in connectStoragePool
  File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 1035, in connectStoragePool
    spUUID, hostID, msdUUID, masterVersion, domainsMap)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 1097, in _connectStoragePool
    res = pool.connect(hostID, msdUUID, masterVersion)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 700, in connect
    self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1274, in __rebuild
    self.setMasterDomain(msdUUID, masterVersion)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1495, in setMasterDomain
    raise se.StoragePoolMasterNotFound(self.spUUID, msdUUID)
StoragePoolMasterNotFound: Cannot find master domain: u'spUUID=5a5cca91-01f8-01af-0297-00000000025f, msdUUID=7d5de684-58ff-4fbc-905d-3048fc55b2b1'
 
What the bestpractice to recovery this problem?
 
 
 
 
15.03.2019, 13:47, "Strahil Nikolov" <hunter86_bg@yahoo.com>:
 
On Fri, Mar 15, 2019 at 8:12 AM Strahil Nikolov <hunter86_bg@yahoo.com> wrote:
Ok,
 
I have managed to recover again and no issues are detected this time.
I guess this case is quite rare and nobody has experienced that.
 
>Hi,
>can you please explain how you fixed it?
 
I have set again to global maintenance, defined the HostedEngine from the old xml (taken from old vdsm log) , defined the network and powered it off.
Set the OVF update period to 5 min , but it took several hours until the OVF_STORE were updated. Once this happened I restarted the ovirt-ha-agent ovirt-ha-broker on both nodes.Then I powered off the HostedEngine and undefined it from ovirt1.
 
then I set the maintenance to 'none' and the VM powered on ovirt1.
In order to test a failure, I removed the global maintenance and powered off the HostedEngine from itself (via ssh). It was brought back to the other node.
 
In order to test failure of ovirt2, I set ovirt1 in local maintenance and removed it (mode 'none') and again shutdown the VM via ssh and it started again to ovirt1.
 
It seems to be working, as I have later shut down the Engine several times and it managed to start without issues.
 
I'm not sure this is related, but I had detected that ovirt2 was out-of-sync of the vdsm-ovirtmgmt network , but it got fixed easily via the UI.
 
 
 
Best Regards,
Strahil Nikolov
 
,

_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/3B7OQUA733ETUA66TB7HF5Y24BLSI4XO/

_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/EMPIGC7JHHWZOONGOLYJWOHNXMYDDSHX/