Can't connect vdsm storage: Command StorageDomain.getInfo with args failed: (code=350, message=Error in storage domain action
by asm@pioner.kz
Hi! I trying to upgrade my hosts and have problem with it. After uprgading one host i see that this one NonOperational. All was fine with vdsm-4.30.24-1.el7 but after upgrading with new version vdsm-4.30.40-1.el7.x86_64 and some others i have errors.
Firtst of all i see in ovirt Events: Host srv02 cannot access the Storage Domain(s) <UNKNOWN> attached to the Data Center Default. Setting Host state to Non-Operational. My Default storage domain with HE VM data on NFS storage.
In messages log of host:
srv02 journal: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Traceback (most recent call last):#012 File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/a
gent.py", line 131, in _run_agent#012 return action(he)#012 File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper#012 return he.start_monitoring
()#012 File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 432, in start_monitoring#012 self._initialize_broker()#012 File "/usr/lib/python2.7/site-packages/
ovirt_hosted_engine_ha/agent/hosted_engine.py", line 556, in _initialize_broker#012 m.get('options', {}))#012 File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 8
9, in start_monitor#012 ).format(t=type, o=options, e=e)#012RequestError: brokerlink - failed to start monitor via ovirt-ha-broker: [Errno 2] No such file or directory, [monitor: 'network', options:
{'tcp_t_address': None, 'network_test': None, 'tcp_t_port': None, 'addr': '192.168.2.248'}]
Feb 1 15:41:42 srv02 journal: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Trying to restart agent
In broker log:
MainThread::WARNING::2020-02-01 15:43:35,167::storage_broker::97::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__) Can't connect vdsm storage: Command StorageDomain.getInfo with ar
gs {'storagedomainID': 'bbdddea7-9cd6-41e7-ace5-fb9a6795caa8'} failed:
(code=350, message=Error in storage domain action: (u'sdUUID=bbdddea7-9cd6-41e7-ace5-fb9a6795caa8',))
In vdsm.lod
2020-02-01 15:44:19,930+0600 INFO (jsonrpc/0) [vdsm.api] FINISH getStorageDomainInfo error=[Errno 1] Operation not permitted from=::1,57528, task_id=40683f67-d7b0-4105-aab8-6338deb54b00 (api:52)
2020-02-01 15:44:19,930+0600 ERROR (jsonrpc/0) [storage.TaskManager.Task] (Task='40683f67-d7b0-4105-aab8-6338deb54b00') Unexpected error (task:875)
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
return fn(*args, **kargs)
File "<string>", line 2, in getStorageDomainInfo
File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method
ret = func(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2753, in getStorageDomainInfo
dom = self.validateSdUUID(sdUUID)
File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 305, in validateSdUUID
sdDom = sdCache.produce(sdUUID=sdUUID)
File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 110, in produce
domain.getRealDomain()
File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain
return self._cache._realProduce(self._sdUUID)
File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 134, in _realProduce
domain = self._findDomain(sdUUID)
File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 151, in _findDomain
return findMethod(sdUUID)
File "/usr/lib/python2.7/site-packages/vdsm/storage/nfsSD.py", line 145, in findDomain
return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID))
File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 378, in __init__
manifest.sdUUID, manifest.mountpoint)
File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 853, in _detect_block_size
block_size = iop.probe_block_size(mountpoint)
File "/usr/lib/python2.7/site-packages/vdsm/storage/outOfProcess.py", line 384, in probe_block_size
return self._ioproc.probe_block_size(dir_path)
File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 602, in probe_block_size
"probe_block_size", {"dir": dir_path}, self.timeout)
File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 448, in _sendCommand
raise OSError(errcode, errstr)
OSError: [Errno 1] Operation not permitted
2020-02-01 15:44:19,930+0600 INFO (jsonrpc/0) [storage.TaskManager.Task] (Task='40683f67-d7b0-4105-aab8-6338deb54b00') aborting: Task is aborted: u'[Errno 1] Operation not permitted' - code 100 (task:1
181)
2020-02-01 15:44:19,930+0600 ERROR (jsonrpc/0) [storage.Dispatcher] FINISH getStorageDomainInfo error=[Errno 1] Operation not permitted (dispatcher:87)
But i see that this domain is mounted (by mount command):
storage:/volume3/ovirt-hosted on /rhev/data-center/mnt/storage:_volume3_ovirt-hosted type nfs4 (rw,relatime,vers=4.1,rsize=131072,wsize=131072,namlen=255,soft,nosharecache,proto=tcp,timeo=600,retrans=6,sec=sys,clientaddr=192.168.2.251,local_lock=none,addr=192.168.2.248)
I didnt see storage directory in /var/run/vdsm? I see many differences with another hosts. Here is listing of var/run/vdsm:
bonding-defaults.json
dhclientmon
nets_restored
payload
svdsm.sock
v2v
vhostuser
bonding-name2numeric.json
mom-vdsm.sock
ovirt-imageio-daemon.sock
supervdsmd.lock
trackedInterfaces
vdsmd.lock
What whe problem? Please help.
4 years, 9 months
Re: Can't connect vdsm storage: Command StorageDomain.getInfo with args failed: (code=350, message=Error in storage domain action
by Alexandr Mikhailov
Yes. The problem with permission. But not with permissoin on export directory. Two ways to resolve this:
1) Make 777 permission on directory - dont right solution.
2) anonuid=36,anongid=36 in export - this right solution, very strange that this is not in any documentation, but very important!
I has right permission on my export directory, but this is not working before i mke 777 chmod. After i chande parameters in exports file i return chmod to 755 and all working fine now!
Thank you and all very much.
4 years, 9 months
Device /dev/sdb excluded by a filter.\n
by Steve Watkins
Since I managed to crash my last attempt at installing by uploading an ISO, I wound up just reloading all the nodes and starting from scratch. Now one node gets "Device /dev/sdb excluded by a filter.\n" and fails when creating the volumes. Can't seem to get passed that -- the other machiens are set up identally and don't fail, and it worked before when installed but now...
Any ideas?
4 years, 9 months
oVirt upgrade problems...
by matteo fedeli
Hi at all, I have many problems after upgrade ovirt version. I come from 4.3.5.2.
When I upgraded to 4.3.7 my three host (hyperconverged emviroment) I passed a few days of instability: HA agent down gluster problem...
When I rebooted for the umpteenth time (after reinitialized lockspace, heal, heal full...) all cames up ha agent and broker included.
Two days after arrived 4.3.8, so i decided to start the update, america with success, europa after a various problems (as described before) came up.
Asia host, after 30 second of updating the engine goes down together with all three nodes ... Are there any known important bugs?
This is the state of my nodes:
https://pastebin.com/2XsnTyHi
https://pastebin.com/ZeKTdaZ7
https://pastebin.com/ZDUBg4vG
4 years, 9 months
Gluster Heal Issue
by Christian Reiss
Hey folks,
in our production setup with 3 nodes (HCI) we took one host down
(maintenance, stop gluster, poweroff via ssh/ovirt engine). Once it was
up the gluster hat 2k healing entries that went down in a matter on 10
minutes to 2.
Those two give me a headache:
[root@node03:~] # gluster vol heal ssd_storage info
Brick node01:/gluster_bricks/ssd_storage/ssd_storage
<gfid:a121e4fb-0984-4e41-94d7-8f0c4f87f4b6>
<gfid:6f8817dc-3d92-46bf-aa65-a5d23f97490e>
Status: Connected
Number of entries: 2
Brick node02:/gluster_bricks/ssd_storage/ssd_storage
Status: Connected
Number of entries: 0
Brick node03:/gluster_bricks/ssd_storage/ssd_storage
<gfid:a121e4fb-0984-4e41-94d7-8f0c4f87f4b6>
<gfid:6f8817dc-3d92-46bf-aa65-a5d23f97490e>
Status: Connected
Number of entries: 2
No paths, only gfid. We took down node2, so it does not have the file:
[root@node01:~] # md5sum
/gluster_bricks/ssd_storage/ssd_storage/.glusterfs/a1/21/a121e4fb-0984-4e41-94d7-8f0c4f87f4b6
75c4941683b7eabc223fc9d5f022a77c
/gluster_bricks/ssd_storage/ssd_storage/.glusterfs/a1/21/a121e4fb-0984-4e41-94d7-8f0c4f87f4b6
[root@node02:~] # md5sum
/gluster_bricks/ssd_storage/ssd_storage/.glusterfs/a1/21/a121e4fb-0984-4e41-94d7-8f0c4f87f4b6
md5sum:
/gluster_bricks/ssd_storage/ssd_storage/.glusterfs/a1/21/a121e4fb-0984-4e41-94d7-8f0c4f87f4b6:
No such file or directory
[root@node03:~] # md5sum
/gluster_bricks/ssd_storage/ssd_storage/.glusterfs/a1/21/a121e4fb-0984-4e41-94d7-8f0c4f87f4b6
75c4941683b7eabc223fc9d5f022a77c
/gluster_bricks/ssd_storage/ssd_storage/.glusterfs/a1/21/a121e4fb-0984-4e41-94d7-8f0c4f87f4b6
The other two files are md5-identical.
These flags are identical, too:
[root@node01:~] # getfattr -d -m . -e hex
/gluster_bricks/ssd_storage/ssd_storage/.glusterfs/a1/21/a121e4fb-0984-4e41-94d7-8f0c4f87f4b6
getfattr: Removing leading '/' from absolute path names
# file:
gluster_bricks/ssd_storage/ssd_storage/.glusterfs/a1/21/a121e4fb-0984-4e41-94d7-8f0c4f87f4b6
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.ssd_storage-client-1=0x0000004f0000000100000000
trusted.gfid=0xa121e4fb09844e4194d78f0c4f87f4b6
trusted.gfid2path.d4cf876a215b173f=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f38366461303238392d663734662d343230302d393238342d3637386537626437363139352e31323030
trusted.glusterfs.mdata=0x010000000000000000000000005e349b1e000000001139aa2a000000005e349b1e000000001139aa2a000000005e34994900000000304a5eb2
getfattr: Removing leading '/' from absolute path names
# file:
gluster_bricks/ssd_storage/ssd_storage/.glusterfs/a1/21/a121e4fb-0984-4e41-94d7-8f0c4f87f4b6
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.ssd_storage-client-1=0x0000004f0000000100000000
trusted.gfid=0xa121e4fb09844e4194d78f0c4f87f4b6
trusted.gfid2path.d4cf876a215b173f=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f38366461303238392d663734662d343230302d393238342d3637386537626437363139352e31323030
trusted.glusterfs.mdata=0x010000000000000000000000005e349b1e000000001139aa2a000000005e349b1e000000001139aa2a000000005e34994900000000304a5eb2
Now, I dont dare simply proceeding withouth some advice.
Anyone got a clue on who to resolve this issue? File #2 is identical to
this one, from a problem point of view.
Have a great weekend!
-Chris.
--
with kind regards,
mit freundlichen Gruessen,
Christian Reiss
4 years, 9 months
Ovirt-engine-ha cannot to see live status of Hosted Engine
by asm@pioner.kz
Good day for all.
I have some issues with Ovirt 4.2.6. But now the main this of it:
I have two Centos 7 Nodes with same config and last Ovirt 4.2.6 with Hostedengine with disk on NFS storage.
Also some of virtual machines working good.
But, when HostedEngine running on one node (srv02.local) everything is fine.
After migrating to another node (srv00.local), i see that agent cannot to check livelinness of HostedEngine. After few minutes HostedEngine going to reboot and after some time i see some situation. After migration to another node (srv00.local) all looks OK.
hosted-engine --vm-status commang when HosterEngine on srv00 node:
--== Host 1 status ==--
conf_on_shared_storage : True
Status up-to-date : True
Hostname : srv02.local
Host ID : 1
Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down_unexpected", "detail": "unknown"}
Score : 0
stopped : False
Local maintenance : False
crc32 : ecc7ad2d
local_conf_timestamp : 78328
Host timestamp : 78328
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=78328 (Tue Sep 18 12:44:18 2018)
host-id=1
score=0
vm_conf_refresh_time=78328 (Tue Sep 18 12:44:18 2018)
conf_on_shared_storage=True
maintenance=False
state=EngineUnexpectedlyDown
stopped=False
timeout=Fri Jan 2 03:49:58 1970
--== Host 2 status ==--
conf_on_shared_storage : True
Status up-to-date : True
Hostname : srv00.local
Host ID : 2
Engine status : {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "Up"}
Score : 3400
stopped : False
Local maintenance : False
crc32 : 1d62b106
local_conf_timestamp : 326288
Host timestamp : 326288
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=326288 (Tue Sep 18 12:44:21 2018)
host-id=2
score=3400
vm_conf_refresh_time=326288 (Tue Sep 18 12:44:21 2018)
conf_on_shared_storage=True
maintenance=False
state=EngineStarting
stopped=False
Log agent.log from srv00.local:
MainThread::INFO::2018-09-18 12:40:51,749::states::779::ovirt_hosted_engine_ha.agent.hosted_engine.HostedE
ngine::(consume) VM is powering up..
MainThread::INFO::2018-09-18 12:40:52,052::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.
HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400)
MainThread::INFO::2018-09-18 12:41:01,066::states::779::ovirt_hosted_engine_ha.agent.hosted_engine.HostedE
ngine::(consume) VM is powering up..
MainThread::INFO::2018-09-18 12:41:01,374::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.
HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400)
MainThread::INFO::2018-09-18 12:41:11,393::state_machine::169::ovirt_hosted_engine_ha.agent.hosted_engine.
HostedEngine::(refresh) Global metadata: {'maintenance': False}
MainThread::INFO::2018-09-18 12:41:11,393::state_machine::174::ovirt_hosted_engine_ha.agent.hosted_engine.
HostedEngine::(refresh) Host srv02.local.pioner.kz (id 1): {'conf_on_shared_storage': True, 'extra': 'meta
data_parse_version=1\nmetadata_feature_version=1\ntimestamp=78128 (Tue Sep 18 12:40:58 2018)\nhost-id=1\ns
core=0\nvm_conf_refresh_time=78128 (Tue Sep 18 12:40:58 2018)\nconf_on_shared_storage=True\nmaintenance=Fa
lse\nstate=EngineUnexpectedlyDown\nstopped=False\ntimeout=Fri Jan 2 03:49:58 1970\n', 'hostname': 'srv02.
local.pioner.kz', 'alive': True, 'host-id': 1, 'engine-status': {'reason': 'vm not running on this host',
'health': 'bad', 'vm': 'down_unexpected', 'detail': 'unknown'}, 'score': 0, 'stopped': False, 'maintenance
': False, 'crc32': 'e18e3f22', 'local_conf_timestamp': 78128, 'host-ts': 78128}
MainThread::INFO::2018-09-18 12:41:11,393::state_machine::177::ovirt_hosted_engine_ha.agent.hosted_engine.
HostedEngine::(refresh) Local (id 2): {'engine-health': {'reason': 'failed liveliness check', 'health': 'b
ad', 'vm': 'up', 'detail': 'Up'}, 'bridge': True, 'mem-free': 12763.0, 'maintenance': False, 'cpu-load': 0
.0364, 'gateway': 1.0, 'storage-domain': True}
MainThread::INFO::2018-09-18 12:41:11,393::states::779::ovirt_hosted_engine_ha.agent.hosted_engine.HostedE
ngine::(consume) VM is powering up..
MainThread::INFO::2018-09-18 12:41:11,703::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.
HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400)
MainThread::INFO::2018-09-18 12:41:21,716::states::779::ovirt_hosted_engine_ha.agent.hosted_engine.HostedE
ngine::(consume) VM is powering up..
MainThread::INFO::2018-09-18 12:41:22,020::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.
HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400)
MainThread::INFO::2018-09-18 12:41:31,033::states::779::ovirt_hosted_engine_ha.agent.hosted_engine.HostedE
ngine::(consume) VM is powering up..
MainThread::INFO::2018-09-18 12:41:31,344::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.
HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400)
As we can see, agent thinking that HostedEngine just in powering up mode. I cannot to do anythink with it. I allready reinstalled many times srv00 node without success.
One time i even has to uninstall ovirt* and vdsm* software. Also here one interesting point, after installing just "yum install http://resources.ovirt.org/pub/yum-repo/ovirt-release42.rpm" on this node i try to install this node from engine web interface with "Deploy" action. But, installation was unsuccesfull, before i didnt install ovirt-hosted-engine-ha on this node. I dont see in documentation that its need bofore installation of new hosts. But this is for information and checking. After installing ovirt-hosted-engine-ha node was installed with HostedEngine support. But the main issue not changed.
Thanks in advance for help.
BR,
Alexandr
4 years, 9 months