I didn't report it as nobody has mentioned it and I thought it's a onetime issue.
I am,now, quite confident that is a bug. Are you using the gluster fuse mounts ( the ones
in /rhev...) or libgfapi ?
Can you open a case in the
?
Best Regards,
Strahil NikolovOn Dec 15, 2019 13:16, Jayme <jaymef(a)gmail.com> wrote:
I compared each file across my nodes and synced them. It seems to have resolved my
issue.
I wonder if there is a problem with 6.5 to 6.6 upgrade that is causing the problem? It’s
strange that it seems to have happened to more than one person. I was also following
proper upgrade procedure.
On Sun, Dec 15, 2019 at 3:09 AM <hunter86_bg(a)yahoo.com> wrote:
>
> I don't know. I had the same issues when I migrated my gluster from v6.5 to 6.6
(currently running v7.0).
> Just get the newest file and rsync it to the rest of the bricks. It will solve the
'?????? ??????' problem.
>
> Best Regards,
> Strahil Nikolov
> В неделя, 15 декември 2019 г., 3:49:27 ч. Гринуич+2, Jayme <jaymef(a)gmail.com>
написа:
>
>
> on that page it says to check open bugs and the migration bug you mention does not
appear to be on the list. Has it been resolved or is it just missing from this page?
>
> On Sat, Dec 14, 2019 at 7:53 PM Strahil Nikolov <hunter86_bg(a)yahoo.com> wrote:
>>
>> Nah... this is not gonna fix your issue and is unnecessary.
>> Just compare the data from all bricks ... most probably the 'Last
Updated' is different and the gfid of the file is different.
>> Find the brick that has the most fresh data, and replace (move away as a backup
and rsync) the file from last good copy to the other bricks.
>> You can also run a 'full heal'.
>>
>> Best Regards,
>> Strahil Nikolov
>>
>> В събота, 14 декември 2019 г., 21:18:44 ч. Гринуич+2, Jayme
<jaymef(a)gmail.com> написа:
>>
>>
>> *Update*
>>
>> Situation has improved. All VMs and engine are running. I'm left right now
with about 2 heal entries in each glusterfs storage volume that will not heal.
>>
>> In all cases each heal entry is related to an OVF_STORE image and the problem
appears to be an issue with the gluster metadata for those ovf_store images. When I look
at the files shown in gluster volume heal info output I'm seeing question marks on the
meta files which indicates an attribute/gluster problem (even though there is no
split-brain). And I get input/output error when attempting to do anything with the
files.
>>
>> If I look at the files on each host in /gluster_bricks they all look fine. I
only see question marks on the meta files when look at the file in /rhev mounts
>>
>> Does anyone know how I can correct the attributes on these OVF_STORE files?
I've tried putting each host in maintenance and re-activating to re-mount gluster
volumes. I've also stopped and started all gluster volumes.
>>
>> I'm thinking I might be able to solve this by shutting down all VMs and
placing all hosts in maintenance and safely restarting the entire cluster.. but that may
not be necessary?
>>
>> On Fri, Dec 13, 2019 at 12:59 AM Jayme <jaymef(a)gmail.com> wrote:
>>>
>>> I believe I was able to get past this by stopping the engine volume then
unmounting the glusterfs engine mount on all hosts and re-starting the volume. I was able
to start hostedengine on host0.
>>>
>>> I'm still facing a few problems:
>>>
>>> 1. I'm still seeing this issue in each host's logs:
>>>
>>> Dec 13 00:57:54 orchard0 journal: ovirt-ha-agent
ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm ERROR Failed scanning
for OVF_STORE due to Command Volume.getInfo with args {'storagepoolID':
'00000000-0000-0000-0000-000000000000', 'storagedomainID':
'd70b171e-7488-4d52-8cad-bbc581dbf16e', 'volumeID':
u'2632f423-ed89-43d9-93a9-36738420b866', 'imageID':
u'd909dc74-5bbd-4e39-b9b5-755c167a6ee8'} failed:#012(code=201, message=Volume does
not exist: (u'2632f423-ed89-43d9-93a9-36738420b866',))
>>> Dec 13 00:57:54 orchard0 journal: ovirt-ha-agent
ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm ERROR Unable to identify
the OVF_STORE volume, falling back to initial vm.conf. Please ensure you already added
your first data domain for regular VMs
>>>
>>>
>>> 2. Most of my gluster volumes still have un-healed entires which I can't
seem to heal. I'm not sure what the answer is here.
>>>
>>> On Fri, Dec 13, 2019 at 12:33 AM Jayme <jaymef(a)gmail.com> wrote:
>>>>
>>>> I was able to get the hosted engine started manually via Virsh after
re-creating a missing symlink in /var/run/vdsm/storage -- I later shut it down and am
still having the same problem with ha broker starting. It appears that the problem
*might* be with a corrupt ha metadata file, although gluster is not stating there is split
brain on the engine volume
>>>>
>>>> I'm seeing this:
>>>>
>>>> ls -al
/rhev/data-center/mnt/glusterSD/orchard0\:_engine/d70b171e-7488-4d52-8cad-bbc581dbf16e/ha_agent/
>>>> ls: cannot access
/rhev/data-center/mnt/glusterSD/orchard0:_engine/d70b171e-7488-4d52-8cad-bbc581dbf16e/ha_agent/hosted-engine.metadata:
Input/output error
>>>> total 0
>>>> drwxr-xr-x. 2 vdsm kvm 67 Dec 13 00:30 .
>>>> drwxr-xr-x. 6 vdsm kvm 64 Aug 6 2018 ..
>>>> lrwxrwxrwx. 1 vdsm kvm 132 Dec 13 00:30 hosted-engine.lockspace ->
/var/run/vdsm/storage/d70b171e-7488-4d52-8cad-bbc581dbf16e/03a8ee8e-91f5-4e06-904b-9ed92a9706eb/db2699ce-6349-4020-b52d-8ab11d01e26d
>>>> l?????????? ? ? ? ? ? hosted-engine.metadata
>>>>
>>>> Clearly showing some sort of issue with hosted-engine.metadata on the
client mount.
>>>>
>>>> on each node in /gluster_bricks I see this:
>>>>
>>>> # ls -al
/gluster_bricks/engine/engine/d70b171e-7488-4d52-8cad-bbc581dbf16e/ha_agent/
>>>> total 0
>>>> drwxr-xr-x. 2 vdsm kvm 67 Dec 13 00:31 .
>>>> drwxr-xr-x. 6 vdsm kvm 64 Aug 6 2018 ..
>>>> lrwxrwxrwx. 2 vdsm kvm 132 Dec 13 00:31 hosted-engine.lockspace ->
/var/run/vdsm/storage/d70b171e-7488-4d52-8cad-bbc581dbf16e/03a8ee8e-91f5-4e06-904b-9ed92a9706eb/db2699ce-6349-4020-b52d-8ab11d01e26d
>>>> lrwxrwxrwx. 2 vdsm kvm 132 Dec 12 16:30 hosted-engine.metadata ->
/var/run/vdsm/storage/d70b171e-7488-4d52-8cad-bbc581dbf16e/66bf05fa-bf50-45ec-98d8-d00002040317/a2250415-5ff0-42ab-8071-cd9d67c3048c
>>>>
>>>> ls -al
/var/run/vdsm/storage/d70b171e-7488-4d52-8cad-bbc581dbf16e/66bf05fa-bf50-45ec-98d8-d00002040317/a2250415-5ff0-42ab-8071-cd9d67c3048c
>>>> -rw-rw----. 1 vdsm kvm 1073741824 Dec 12 16:48
/var/run/vdsm/storage/d70b171e-7488-4d52-8cad-bbc581dbf16e/66bf05fa-bf50-45ec-98d8-d00002040317/a2250415-5ff0-42ab-8071-cd9d67c3048c
>>>>
>>>>
>>>> I'm not sure how to proceed at this point. Do I have data
corruption, a gluster split-brain issue or something else? Maybe I just need to
re-generate metadata for the hosted engine?
>>>>
>>>> On Thu, Dec 12, 2019 at 6:36 PM Jayme <jaymef(a)gmail.com> wrote:
>>>>>
>>>>> I'm running a three server HCI. Up and running on 4.3.7 with no
problems. Today I updated to 4.3.8. Engine upgraded fine, rebooted. First host updated
fine, rebooted and let all gluster volumes heal. Put second host in maintenance, upgraded
successfully, rebooted. Waited for gluster volumes to heal for over an hour but the heal
process was not completing. I tried restarting gluster services as well as the host with
no success.
>>>>>
>>>>> I'm in a state right now where there are pending heals on almost
all of my volumes. Nothing is reporting split-brain, but the heals are not completing.
>>>>>
>>>>> All vms are still currently running except hosted engine. Hosted
engine was running but on the 2nd host I upgraded I was seeing errors such as:
>>>>>
>>>>> Dec 12 16:34:39 orchard2 journal: ovirt-ha-agent
ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm ERROR Failed scanning
for OVF_STORE due to Command Volume.getInfo with args {'storagepoolID':
'00000000-0000-0000-0000-000000000000', 'storagedomainID':
'd70b171e-7488-4d52-8cad-bbc581dbf16e', 'volumeID':
u'2632f423-ed89-43d9-93a9-36738420b866', 'imageID':
u'd909dc74-5bbd-4e39-b9b5-755c167a6ee8'} failed:#012(code=201, message=Volume does
not exist: (u'2632f423-ed89-43d9-93a9-36738420b866',))
>>>>>
>>>>> I shut down the engine VM and attempted a manual heal on the engine
volume. I cannot start the engine on any host now. I get:
>>>>>
>>>>> The hosted engine configuration has not been retrieved from shared
storage. Please ensure that ovirt-ha-agent is running and the storage server is
reachable.
>>>>>
>>>>> I'm seeing ovirt-ha-agent crashing on all three nodes:
>>>>>
>>>>> Dec 12 18:30:48 orchard0 python: detected unhandled Python exception
in '/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker'
>>>>> Dec 12 18:30:48 orchard0 abrt-server: Duplicate: core backtrace
>>>>> Dec 12 18:30:48 orchard0 abrt-server: DUP_OF_DIR:
/var/tmp/abrt/Python-2019-03-14-21:02:52-44318
>>>>> Dec 12 18:30:48 orchard0 abrt-server: Deleting problem directory
Python-2019-12-12-18:30:48-23193 (dup of Python-2019-03-14-21:02:52-44318)
>>>>> Dec 12 18:30:49 orchard0 vdsm[6087]: ERROR failed to retrieve Hosted
Engine HA score '[Errno 2] No such file or directory'Is the Hosted Engine setup
finished?
>>>>> Dec 12 18:30:49 orchard0 systemd: ovirt-ha-broker.service: main
process exited, code=exited, status=1/FAILURE
>>>>> Dec 12 18:30:49 orchard0 systemd: Unit ovirt-ha-broker.service
entered failed state.
>>>>> Dec 12 18:30:49 orchard0 systemd: ovirt-ha-broker.service failed.
>>>>> Dec 12 18:30:49 orchard0 systemd: ovirt-ha-broker.service holdoff
time over, scheduling restart.
>>>>> Dec 12 18:30:49 orchard0 systemd: Cannot add dependency job for unit
lvm2-lvmetad.socket, ignoring: Unit is masked.
>>>>> Dec 12 18:30:49 orchard0 systemd: Stopped oVirt Hosted Engine High
Availability Communications Broker.
>>>>>
>>>>>
>>>>> Here is what gluster volume heal info on engine looks like, it's
similar on other volumes as well (although more heals pending on some of those):
>>>>>
>>>>> gluster volume heal engine info
>>>>> Brick gluster0:/gluster_bricks/engine/engine
>>>>>
/d70b171e-7488-4d52-8cad-bbc581dbf16e/images/d909dc74-5bbd-4e39-b9b5-755c167a6ee8/2632f423-ed89-43d9-93a9-36738420b866.meta
>>>>>
/d70b171e-7488-4d52-8cad-bbc581dbf16e/images/053171e4-f782-42d7-9115-c602beb3c826/627b8f93-5373-48bb-bd20-a308a455e082.meta
>>>>>
/d70b171e-7488-4d52-8cad-bbc581dbf16e/master/tasks/a9b11e33-9b93-46a0-a36e-85063fd53ebe.backup
>>>>> /d70b171e-7488-4d52-8cad-bbc581dbf16e/dom_md/ids
>>>>> Status: Connected
>>>>> Number of entries: 4
>>>>>
>>>>> Brick gluster1:/gluster_bricks/engine/engine
>>>>>
/d70b171e-7488-4d52-8cad-bbc581dbf16e/images/d909dc74-5bbd-4e39-b9b5-755c167a6ee8/2632f423-ed89-43d9-93a9-36738420b866.meta
>>>>>
/d70b171e-7488-4d52-8cad-bbc581dbf16e/master/tasks/a9b11e33-9b93-46a0-a36e-85063fd53ebe.backup
>>>>>
/d70b171e-7488-4d52-8cad-bbc581dbf16e/images/053171e4-f782-42d7-9115-c602beb3c826/627b8f93-5373-48bb-bd20-a308a455e082.meta
>>>>> /d70b171e-7488-4d52-8cad-bbc581dbf16e/dom_md/ids
>>>>> Status: Connected
>>>>> Number of entries: 4
>>>>>
>>>>> Brick gluster2:/gluster_bricks/engine/engine
>>>>> Status: Connected
>>>>> Number of entries: 0
>>>>>
>>>>> I don't see much in vdsm.log and gluster logs look fairly normal
to me, I'm not seeing any obvious errors in the gluster logs.
>>>>>
>>>>> As far as I can tell the underlying storage is fine. Why are my
gluster volumes not healing and why is self-hosted engine failing to start?
>>>>>
>>>>> More agent and broker logs:
>>>>>
>>>>> ==> agent.log <==
>>>>> MainThread::ERROR::2019-12-12
18:36:09,056::hosted_engine::559::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker)
Failed to start necessary monitors
>>>>> MainThread::ERROR::2019-12-12
18:36:09,058::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback
(most recent call last):
>>>>> File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line
131, in _run_agent
>>>>> return action(he)
>>>>> File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line
55, in action_proper
>>>>> return he.start_monitoring()
>>>>> File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
line 432, in start_monitoring
>>>>> self._initialize_broker()
>>>>> File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
line 556, in _initialize_broker
>>>>> m.get('options', {}))
>>>>> File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
line 89, in start_monitor
>>>>> ).format(t=type, o=options, e=e)
>>>>> RequestError: brokerlink - failed to start monitor via
ovirt-ha-broker: [Errno 2] No such file or directory, [monitor: 'network',
options: {'tcp_t_address': None, 'network_test': None,
'tcp_t_port': None, 'addr': '10.11.0.254'}]
>>>>>
>>>>> MainThread::ERROR::2019-12-12
18:36:09,058::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to
restart agent
>>>>> MainThread::ERROR::2019-12-12
18:36:19,619::hosted_engine::559::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker)
Failed to start necessary monitors
>>>>> MainThread::ERROR::2019-12-12
18:36:19,619::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback
(most recent call last):
>>>>> File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line
131, in _run_agent
>>>>> return action(he)
>>>>> File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line
55, in action_proper
>>>>> return he.start_monitoring()
>>>>> File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
line 432, in start_monitoring
>>>>> self._initialize_broker()
>>>>> File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
line 556, in _initialize_broker
>>>>> m.get('options', {}))
>>>>> File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
line 89, in start_monitor
>>>>> ).format(t=type, o=options, e=e)
>>>>> RequestError: brokerlink - failed to start monitor via
ovirt-ha-broker: [Errno 2] No such file or directory, [monitor: 'network',
options: {'tcp_t_address': None, 'network_test': None,
'tcp_t_port': None, 'addr': '10.11.0.254'}]
>>>>>
>>>>> MainThread::ERROR::2019-12-12
18:36:19,619::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to
restart agent
>>>>> MainThread::ERROR::2019-12-12
18:36:30,568::hosted_engine::559::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker)
Failed to start necessary monitors
>>>>> MainThread::ERROR::2019-12-12
18:36:30,570::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback
(most recent call last):
>>>>> File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line
131, in _run_agent
>>>>> return action(he)
>>>>> File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line
55, in action_proper
>>>>> return he.start_monitoring()
>>>>> File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
line 432, in start_monitoring
>>>>> self._initialize_broker()
>>>>> File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
line 556, in _initialize_broker
>>>>> m.get('options', {}))
>>>>> File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
line 89, in start_monitor
>>>>> ).format(t=type, o=options, e=e)
>>>>> RequestError: brokerlink - failed to start monitor via
ovirt-ha-broker: [Errno 2] No such file or directory, [monitor: 'network',
options: {'tcp_t_address': None, 'network_test': None,
'tcp_t_port': None, 'addr': '10.11.0.254'}]
>>>>>
>>>>> MainThread::ERROR::2019-12-12
18:36:30,570::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to
restart agent
>>>>> MainThread::ERROR::2019-12-12
18:36:41,581::hosted_engine::559::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker)
Failed to start necessary monitors
>>>>> MainThread::ERROR::2019-12-12
18:36:41,583::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback
(most recent call last):
>>>>> File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line
131, in _run_agent
>>>>> return action(he)
>>>>> File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line
55, in action_proper
>>>>> return he.start_monitoring()
>>>>> File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
line 432, in start_monitoring
>>>>> self._initialize_broker()
>>>>> File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
line 556, in _initialize_broker
>>>>> m.get('options', {}))
>>>>> File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
line 89, in start_monitor
>>>>> ).format(t=type, o=options, e=e)
>>>>> RequestError: brokerlink - failed to start monitor via
ovirt-ha-broker: [Errno 2] No such file or directory, [monitor: 'network',
options: {'tcp_t_address': None, 'network_test': None,
'tcp_t_port': None, 'addr': '10.11.0.254'}]
>>>>>
>>>>> MainThread::ERROR::2019-12-12
18:36:41,583::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to
restart agent
>>>>>
>>>>>
>> _______________________________________________
>> Users mailing list -- users(a)ovirt.org
>> To unsubscribe send an email to users-leave(a)ovirt.org
>> Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
>> oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
>> List Archives:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/U5YFDWCQJYN...