
On 07/22/2014 07:21 AM, Itamar Heim wrote:
On 07/22/2014 04:28 AM, Vijay Bellur wrote:
On 07/21/2014 05:09 AM, Pranith Kumar Karampuri wrote:
On 07/21/2014 02:08 PM, Jiri Moskovcak wrote:
On 07/19/2014 08:58 AM, Pranith Kumar Karampuri wrote:
On 07/19/2014 11:25 AM, Andrew Lau wrote:
On Sat, Jul 19, 2014 at 12:03 AM, Pranith Kumar Karampuri <pkarampu@redhat.com <mailto:pkarampu@redhat.com>> wrote:
On 07/18/2014 05:43 PM, Andrew Lau wrote: > > > On Fri, Jul 18, 2014 at 10:06 PM, Vijay Bellur > <vbellur@redhat.com <mailto:vbellur@redhat.com>> wrote: > > [Adding gluster-devel] > > > On 07/18/2014 05:20 PM, Andrew Lau wrote: > > Hi all, > > As most of you have got hints from previous messages, > hosted engine > won't work on gluster . A quote from BZ1097639 > > "Using hosted engine with Gluster backed storage is > currently something > we really warn against. > > > I think this bug should be closed or re-targeted at > documentation, because there is nothing we can do here. > Hosted engine assumes that all writes are atomic and > (immediately) available for all hosts in the cluster. > Gluster violates those assumptions. > " > > I tried going through BZ1097639 but could not find much > detail with respect to gluster there. > > A few questions around the problem: > > 1. Can somebody please explain in detail the scenario that > causes the problem? > > 2. Is hosted engine performing synchronous writes to ensure > that writes are durable? > > Also, if there is any documentation that details the hosted > engine architecture that would help in enhancing our > understanding of its interactions with gluster. > > > > > Now my question, does this theory prevent a scenario of > perhaps > something like a gluster replicated volume being mounted > as a glusterfs > filesystem and then re-exported as the native kernel NFS > share for the > hosted-engine to consume? It could then be possible to > chuck ctdb in > there to provide a last resort failover solution. I have > tried myself > and suggested it to two people who are running a similar > setup. Now > using the native kernel NFS server for hosted-engine and > they haven't > reported as many issues. Curious, could anyone validate > my theory on this? > > > If we obtain more details on the use case and obtain gluster > logs from the failed scenarios, we should be able to > understand the problem better. That could be the first step > in validating your theory or evolving further > recommendations :). > > > I'm not sure how useful this is, but Jiri Moskovcak tracked > this down in an off list message. > > Message Quote: > > == > > We were able to track it down to this (thanks Andrew for > providing the testing setup): > > -b686-4363-bb7e-dba99e5789b6/ha_agent > service_type=hosted-engine' > Traceback (most recent call last): > File > "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/listener.py", > > > > line 165, in handle > response = "success " + self._dispatch(data) > File > "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/listener.py", > > > > line 261, in _dispatch > .get_all_stats_for_service_type(**options) > File > "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", > > > > line 41, in get_all_stats_for_service_type > d = self.get_raw_stats_for_service_type(storage_dir, > service_type) > File > "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", > > > > line 74, in get_raw_stats_for_service_type > f = os.open(path, direct_flag | os.O_RDONLY) > OSError: [Errno 116] Stale file handle: > '/rhev/data-center/mnt/localhost:_mnt_hosted-engine/c898fd2a-b686-4363-bb7e-dba99e5789b6/ha_agent/hosted-engine.metadata' > > > Andrew/Jiri, Would it be possible to post gluster logs of both the mount and bricks on the bz? I can take a look at it once. If I gather nothing then probably I will ask for your help in re-creating the issue.
Pranith
Unfortunately, I don't have the logs for that setup any more.. I'll try replicate when I get a chance. If I understand the comment from the BZ, I don't think it's a gluster bug per-say, more just how gluster does its replication.
hi Andrew, Thanks for that. I couldn't come to any conclusions because no logs were available. It is unlikely that self-heal is involved because there were no bricks going down/up according to the bug description.
Hi, I've never had such setup, I guessed problem with gluster based on "OSError: [Errno 116] Stale file handle:" which happens when the file opened by application on client gets removed on the server. I'm pretty sure we (hosted-engine) don't remove that file, so I think it's some gluster magic moving the data around... Hi, Without bricks going up/down or there are new bricks added data is not moved around by gluster unless a file operation comes to gluster. So I am still not sure why this happened.
Does hosted engine perform deletion & re-creation of file <uuid>/ha_agent/hosted-engine.metadata in some operational sequence? In such a case, if this file is accessed by a stale gfid, ESTALE is possible.
I see references to 2 hosted engines being operational in the bug report and that makes me wonder if this is a likely scenario?
I am also curious to understand why NFS was chosen as the access method to the gluster volume. Isn't FUSE based access a possibility here?
it is, but it wasn't enabled in the setup due to multiple reports around gluster robustness with sanlock at the time. iiuc, with replica 3 we should be in a much better place and re-enable it (also validating its replica 3 probably?)
Yes, replica 3 would provide better protection for split-brains. Thanks, Vijay