This is a multi-part message in MIME format.
--------------040907040103060503090706
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
On 01/31/2013 10:27 AM, Shu Ming wrote:
Royce,
It is a good trouble shooting. See my comments about the IOError in
serve_client() below.
Royce Lv:
>
> Hi,
> I reproduced this issue, and I believe it's a python bug.
> 1. How to reproduce:
> with the test case attached, put it under /usr/share/vdsm/tests/,
> run #./run_tests.sh superVdsmTests.py
> and this issue will be reproduced.
> 2.Log analyse:
> We notice a strange pattern in this log: connectStorageServer be
> called twice, first supervdsm call succeed, second fails becasue of
> validateAccess().
> That is because for the first call validateAccess returns normally
> and leave a child there, when the second validateAccess call arrives
> and multirprocessing manager is receiving the method message, it is
> just the time first child exit and SIGCHLD comming, this signal
> interrupted multiprocessing receive system call, python managers.py
> should handle INTR and retry recv() like we do in vdsm but it's not,
> so the second one raise error.
> >Thread-18::DEBUG::2013-01-22
10:41:03,570::misc::85::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n
/bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3
192.168.0.1:/ovirt/silvermoon /rhev/data-center/mnt/192.168.0.1:_ovirt_silvermoon'
(cwd None)
> >Thread-18::DEBUG::2013-01-22
10:41:03,607::misc::85::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n
/bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3
192.168.0.1:/ovirt/undercity /rhev/data-center/mnt/192.168.0.1:_ovirt_undercity' (cwd
None)
> >Thread-18::ERROR::2013-01-22
10:41:03,627::hsm::2215::Storage.HSM::(connectStorageServer) Could not connect to
storageServer
> >Traceback (most recent call last):
> > File "/usr/share/vdsm/storage/hsm.py", line 2211, in
connectStorageServer
> > conObj.connect()
> > File "/usr/share/vdsm/storage/storageServer.py", line 303, in
connect
> > return self._mountCon.connect()
> > File "/usr/share/vdsm/storage/storageServer.py", line 209, in
connect
> > fileSD.validateDirAccess(self.getMountObj().getRecord().fs_file)
> > File "/usr/share/vdsm/storage/fileSD.py", line 55, in
validateDirAccess
> > (os.R_OK | os.X_OK))
> > File "/usr/share/vdsm/supervdsm.py", line 81, in __call__
> > return callMethod()
> > File "/usr/share/vdsm/supervdsm.py", line 72, in <lambda>
> > **kwargs)
> > File "<string>", line 2, in validateAccess
> > File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in
_callmethod
> > raise convert_to_error(kind, result)
> the vdsm side receive RemoteError because of supervdsm server
> multiprocessing manager raise error KIND='TRACEBACK'
> >RemoteError:
> The upper part is the trace back from the client side, the following
> part is from server side:
> >---------------------------------------------------------------------------
> >Traceback (most recent call last):
> > File "/usr/lib64/python2.6/multiprocessing/managers.py", line 214, in
serve_client
> > request = recv()
> >IOError: [Errno 4] Interrupted system call
> >---------------------------------------------------------------------------
>
> Corresponding Python source code:managers.py(Server side)
> def serve_client(self, conn):
> '''
> Handle requests from the proxies in a particular process/thread
> '''
> util.debug('starting server thread to service %r',
> threading.current_thread().name)
> recv = conn.recv
> send = conn.send
> id_to_obj = self.id_to_obj
> while not self.stop:
> try:
> methodname = obj = None
> request = recv()<------------------this line been interrupted by
SIGCHLD
> ident, methodname, args, kwds = request
> obj, exposed, gettypeid = id_to_obj[ident]
> if methodname not in exposed:
> raise AttributeError(
> 'method %r of %r object is not in exposed=%r' %
> (methodname, type(obj), exposed)
> )
> function = getattr(obj, methodname)
> try:
> res = function(*args, **kwds)
> except Exception, e:
> msg = ('#ERROR', e)
> else:
> typeid = gettypeid and gettypeid.get(methodname, None)
> if typeid:
> rident, rexposed = self.create(conn, typeid, res)
> token = Token(typeid, self.address, rident)
> msg = ('#PROXY', (rexposed, token))
> else:
> msg = ('#RETURN', res)
> except AttributeError:
> if methodname is None:
> msg = ('#TRACEBACK', format_exc())
> else:
> try:
> fallback_func = self.fallback_mapping[methodname]
> result = fallback_func(
> self, conn, ident, obj, *args, **kwds
> )
> msg = ('#RETURN', result)
> except Exception:
> msg = ('#TRACEBACK', format_exc())
>
> except EOFError:
> util.debug('got EOF -- exiting thread serving %r',
> threading.current_thread().name)
> sys.exit(0)
>
> except Exception:<------does not handle IOError,INTR here should
retry recv()
> msg = ('#TRACEBACK', format_exc())
Not sure I under what is your opinion here.
Whenever this line was interrupted by a SIGCHLD ,
request = recv()<------------------this line been interrupted by SIGCHLD
the code flow should fall into
except Exception:<------does not handle IOError,INTR here should retry recv()
msg = ('#TRACEBACK', format_exc())
However, the big while loop was not stopped, it should reach this line again.
Yep,
I'm not saying this error makes baseManager server totally
non-functional,The loop does not stop but this call returns error msg
to vdsm and make it raise RemoteError , see line#273:
272
<
http://hg.python.org/cpython/file/2c0197c95ec6/Lib/multiprocessing/manage...
try:
273
<
http://hg.python.org/cpython/file/2c0197c95ec6/Lib/multiprocessing/manage...
send(msg)
274
<
http://hg.python.org/cpython/file/2c0197c95ec6/Lib/multiprocessing/manage...
except Exception as e:
This makes validateAccess fails every other time.
try:
methodname = obj = None
request = recv()<------------------this line been interrupted by
SIGCHLD
Then the interrupted recv() will be called again and again until the
recv() call is not interrupted. What's wrong here?
>
>
> 3. Actions we will take:
> (1)As a work round we can first remove the zombie reaper from
> supervdsm server
> (2)I'll see whether python has a fixed version for this
> (3)Yaniv is working on changing vdsm/svdsm communication channel to
> pipe and handle it ourselves, I believe we'll get rid of this with
> that properly handled.
>
>
> On 01/25/2013 06:00 AM, Dead Horse wrote:
>> Tried some manual edits to SD states in the dbase. The net result
>> was I was able to get a node active. However as reconstructing the
>> master storage domain kicked in it was unable to do so. It was also
>> not able to recognize the other SD with similar failure modes to the
>> unrecognized master above. Guessing the newer VDSM version borked
>> things pretty good. So being as this is a test harness and the SD
>> data is not worth saving I just smoked the all SD. I ran
>> engine-cleanup and started from fresh and all is well now.
>>
>> - DHC
>>
>>
>> On Thu, Jan 24, 2013 at 11:53 AM, Dead Horse
>> <deadhorseconsulting(a)gmail.com
>> <mailto:deadhorseconsulting@gmail.com>> wrote:
>>
>> This test harness setup here consists of two servers tied to NFS
>> storage via IB (NFS mounts are via IPoIB, NFS over RDMA is
>> disabled) . All storage domains are NFS. The issue does occur
>> with both servers on when attempting to bring them out of
>> maintenance mode with the end result being non-operational due
>> to storage attach fail.
>>
>> The current issue is now that with a working older commit the
>> master storage domain is "stuck" in state "locked" and I
see the
>> secondary issue wherein VDSM cannot seem to find or contact the
>> master storage domain even though it is there. I am can mount
>> the master storage domain manually and and all content appears
>> to be accounted for accordingly on either host.
>>
>> Here is the current contents of the master storage domain metadata:
>> CLASS=Data
>> DESCRIPTION=orgrimmar
>> IOOPTIMEOUTSEC=1
>> LEASERETRIES=3
>> LEASETIMESEC=5
>> LOCKPOLICY=
>> LOCKRENEWALINTERVALSEC=5
>> MASTER_VERSION=417
>> POOL_DESCRIPTION=Azeroth
>>
POOL_DOMAINS=0549ee91-4498-4130-8c23-4c173b5c0959:Active,d8b55105-c90a-465d-9803-8130da9a671e:Active,67534cca-1327-462a-b455-a04464084b31:Active,c331a800-839d-4d23-9059-870a7471240a:Active,f8984825-ff8d-43d9-91db-0d0959f8bae9:Active,c434056e-96be-4702-8beb-82a408a5c8cb:Active,f7da73c7-b5fe-48b6-93a0-0c773018c94f:Active,82e3b34a-6f89-4299-8cd8-2cc8f973a3b4:Active,e615c975-6b00-469f-8fb6-ff58ae3fdb2c:Active,5bc86532-55f7-4a91-a52c-fad261f322d5:Active,1130b87a-3b34-45d6-8016-d435825c68ef:Active
>> POOL_SPM_ID=1
>> POOL_SPM_LVER=6
>> POOL_UUID=f90a0d1c-06ca-11e2-a05b-00151712f280
>> REMOTE_PATH=192.168.0.1:/ovirt/orgrimmar
>> ROLE=Master
>> SDUUID=67534cca-1327-462a-b455-a04464084b31
>> TYPE=NFS
>> VERSION=3
>> _SHA_CKSUM=1442bb078fd8c9468d241ff141e9bf53839f0721
>>
>> So now with the older working commit I now get this the
>> "StoragePoolMasterNotFound: Cannot find master domain" error
>> (prior details above when I worked backwards to that commit)
>>
>> This is odd as the nodes can definitely reach the master storage
>> domain:
>>
>> showmount from one of the el6.3 nodes:
>> [root@kezan ~]# showmount -e 192.168.0.1
>> Export list for 192.168.0.1 <
http://192.168.0.1>:
>> /ovirt/orgrimmar 192.168.0.0/16 <
http://192.168.0.0/16>
>>
>> mount/ls from one of the nodes:
>> [root@kezan ~]# mount 192.168.0.1:/ovirt/orgrimmar /mnt
>> [root@kezan ~]# ls -al
>> /mnt/67534cca-1327-462a-b455-a04464084b31/dom_md/
>> total 1100
>> drwxr-xr-x 2 vdsm kvm 4096 Jan 24 11:44 .
>> drwxr-xr-x 5 vdsm kvm 4096 Oct 19 16:16 ..
>> -rw-rw---- 1 vdsm kvm 1048576 Jan 19 22:09 ids
>> -rw-rw---- 1 vdsm kvm 0 Sep 25 00:46 inbox
>> -rw-rw---- 1 vdsm kvm 2097152 Jan 10 13:33 leases
>> -rw-r--r-- 1 vdsm kvm 903 Jan 10 13:39 metadata
>> -rw-rw---- 1 vdsm kvm 0 Sep 25 00:46 outbox
>>
>>
>> - DHC
>>
>>
>>
>> On Thu, Jan 24, 2013 at 7:51 AM, ybronhei <ybronhei(a)redhat.com
>> <mailto:ybronhei@redhat.com>> wrote:
>>
>> On 01/24/2013 12:44 AM, Dead Horse wrote:
>>
>> I narrowed down on the commit where the originally
>> reported issue crept in:
>> commitfc3a44f71d2ef202cff18d7203b9e4165b546621building
>> and testing with
>>
>> this commit or subsequent commits yields the original issue.
>>
>> Interesting.. it might be related to this commit and we're
>> trying to reproduce it.
>>
>> Did you try to remove that code and run again? does it work
>> without the additional of zombieReaper?
>> does the connectivity to the storage work well? when you run
>> 'ls' on the mounted folder you get see the files without a
>> long delay ? it might related to too long timeout when
>> validating access to this mount..
>> we work on that.. any additional info can help
>>
>> Thanks.
>>
>>
>> - DHC
>>
>>
>> On Wed, Jan 23, 2013 at 3:56 PM, Dead Horse
>> <deadhorseconsulting(a)gmail.com
>> <mailto:deadhorseconsulting@gmail.com>>wrote:
>>
>> Indeed reverting back to an older vdsm clears up the
>> above issue. However
>> now I the issue is see is:
>> Thread-18::ERROR::2013-01-23
>> 15:50:42,885::task::833::TaskManager.Task::(_setError)
>> Task=`08709e68-bcbc-40d8-843a-d69d4df40ac6`::Unexpected
>> error
>>
>> Traceback (most recent call last):
>> File "/usr/share/vdsm/storage/task.py", line 840,
>> in _run
>> return fn(*args, **kargs)
>> File "/usr/share/vdsm/logUtils.py", line 42, in
>> wrapper
>> res = f(*args, **kwargs)
>> File "/usr/share/vdsm/storage/hsm.py", line 923,
>> in connectStoragePool
>> masterVersion, options)
>> File "/usr/share/vdsm/storage/hsm.py", line 970,
>> in _connectStoragePool
>> res = pool.connect(hostID, scsiKey, msdUUID,
>> masterVersion)
>> File "/usr/share/vdsm/storage/sp.py", line 643,
>> in connect
>> self.__rebuild(msdUUID=msdUUID,
>> masterVersion=masterVersion)
>> File "/usr/share/vdsm/storage/sp.py", line 1167,
>> in __rebuild
>> self.masterDomain =
>> self.getMasterDomain(msdUUID=msdUUID,
>> masterVersion=masterVersion)
>> File "/usr/share/vdsm/storage/sp.py", line 1506,
>> in getMasterDomain
>> raise se.StoragePoolMasterNotFound(self.spUUID,
>> msdUUID)
>> StoragePoolMasterNotFound: Cannot find master domain:
>> 'spUUID=f90a0d1c-06ca-11e2-a05b-00151712f280,
>> msdUUID=67534cca-1327-462a-b455-a04464084b31'
>> Thread-18::DEBUG::2013-01-23
>> 15:50:42,887::task::852::TaskManager.Task::(_run)
>> Task=`08709e68-bcbc-40d8-843a-d69d4df40ac6`::Task._run:
>> 08709e68-bcbc-40d8-843a-d69d4df40ac6
>> ('f90a0d1c-06ca-11e2-a05b-00151712f280', 2,
>> 'f90a0d1c-06ca-11e2-a05b-00151712f280',
>> '67534cca-1327-462a-b455-a04464084b31', 433) {}
>> failed - stopping task
>>
>> This is with vdsm built from
>> commit25a2d8572ad32352227c98a86631300fbd6523c1
>> - DHC
>>
>>
>> On Wed, Jan 23, 2013 at 10:44 AM, Dead Horse <
>> deadhorseconsulting(a)gmail.com
>> <mailto:deadhorseconsulting@gmail.com>> wrote:
>>
>> VDSM was built from:
>> commit 166138e37e75767b32227746bb671b1dab9cdd5e
>>
>> Attached is the full vdsm log
>>
>> I should also note that from engine perspective
>> it sees the master
>> storage domain as locked and the others as unknown.
>>
>>
>> On Wed, Jan 23, 2013 at 2:49 AM, Dan Kenigsberg
>> <danken(a)redhat.com
<mailto:danken@redhat.com>>wrote:
>>
>> On Tue, Jan 22, 2013 at 04:02:24PM -0600,
>> Dead Horse wrote:
>>
>> Any ideas on this one? (from VDSM log):
>> Thread-25::DEBUG::2013-01-22
>> 15:35:29,065::BindingXMLRPC::914::vds::(wrapper)
>> client
>>
>> [3.57.111.30]::call
>>
>> getCapabilities with () {}
>> Thread-25::ERROR::2013-01-22
>> 15:35:29,113::netinfo::159::root::(speed)
>> cannot read ib0 speed
>> Traceback (most recent call last):
>> File
>>
"/usr/lib64/python2.6/site-packages/vdsm/netinfo.py",
>> line 155,
>>
>> in
>>
>> speed
>> s =
>> int(file('/sys/class/net/%s/speed' %
>> dev).read())
>> IOError: [Errno 22] Invalid argument
>>
>> Causes VDSM to fail to attach storage
>>
>>
>> I doubt that this is the cause of the
>> failure, as vdsm has always
>> reported "0" for ib devices, and still is.
>>
>> it happens only when you call to getCapabilities.. so it
>> doesn't related to the flow, and it can't effect the storage.
>> Dan: I guess this is not the issue but why is the IOError?
>>
>>
>> Does a former version works with your Engine?
>> Could you share more of your vdsm.log? I
>> suppose the culprit lies in one
>> one of the storage-related commands, not in
>> statistics retrieval.
>>
>>
>> Engine side sees:
>> ERROR
>> [org.ovirt.engine.core.bll.storage.NFSStorageHelper]
>> (QuartzScheduler_Worker-96) [553ef26e]
>> The connection with details
>> 192.168.0.1:/ovirt/ds failed because of
>> error code 100 and error
>>
>> message
>>
>> is: general exception
>> 2013-01-22 15:35:30,160 INFO
>>
[org.ovirt.engine.core.bll.SetNonOperationalVdsCommand]
>> (QuartzScheduler_Worker-96) [1ab78378]
>> Running command:
>> SetNonOperationalVdsCommand internal:
>> true. Entities affected : ID:
>> 8970b3fe-1faf-11e2-bc1f-00151712f280
>> Type: VDS
>> 2013-01-22 15:35:30,200 INFO
>>
[org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand]
>> (QuartzScheduler_Worker-96) [1ab78378]
>> START,
>> SetVdsStatusVDSCommand(HostName = kezan,
>> HostId =
>> 8970b3fe-1faf-11e2-bc1f-00151712f280,
>> status=NonOperational,
>> nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE),
>> log id: 4af5c4cd
>> 2013-01-22 15:35:30,211 INFO
>>
[org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand]
>> (QuartzScheduler_Worker-96) [1ab78378]
>> FINISH, SetVdsStatusVDSCommand,
>>
>> log
>>
>> id: 4af5c4cd
>> 2013-01-22 15:35:30,242 ERROR
>>
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>> (QuartzScheduler_Worker-96) [1ab78378]
>> Try to add duplicate audit log
>> values with the same name. Type:
>> VDS_SET_NONOPERATIONAL_DOMAIN. Value:
>> storagepoolname
>>
>> Engine = latest master
>> VDSM = latest master
>>
>>
>> Since "latest master" is an unstable
>> reference by definition, I'm sure
>> that History would thank you if you post the
>> exact version (git hash?)
>> of the code.
>>
>> node = el6
>>
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Users mailing list
>> Users(a)ovirt.org <mailto:Users@ovirt.org>
>>
http://lists.ovirt.org/mailman/listinfo/users
>>
>>
>>
>> --
>> Yaniv Bronhaim.
>> RedHat, Israel
>> 09-7692289
>> 054-7744187
>>
>>
>>
>
>
>
>
> _______________________________________________
> Users mailing list
> Users(a)ovirt.org
>
http://lists.ovirt.org/mailman/listinfo/users
--
---
?? Shu Ming
Open Virtualization Engineerning; CSTL, IBM Corp.
Tel: 86-10-82451626 Tieline: 9051626 E-mail:shuming@cn.ibm.com
orshuming(a)linux.vnet.ibm.com
Address: 3/F Ring Building, ZhongGuanCun Software Park, Haidian District, Beijing 100193,
PRC
--------------040907040103060503090706
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">On 01/31/2013 10:27 AM, Shu Ming
wrote:<br>
</div>
<blockquote cite="mid:5109D67E.6070107@linux.vnet.ibm.com"
type="cite">
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
<div class="moz-cite-prefix">Royce,<br>
<br>
It is a good trouble shooting. See my comments about the
IOError in serve_client() below.<br>
<br>
Royce Lv:<br>
</div>
<blockquote cite="mid:51020A6F.6040201@linux.vnet.ibm.com"
type="cite">
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
<div class="moz-cite-prefix"><br>
Hi,<br>
I reproduced this issue, and I believe it's a python
bug.<br>
1. How to reproduce:<br>
with the test case attached, put it under
/usr/share/vdsm/tests/, run #./run_tests.sh superVdsmTests.py<br>
and this issue will be reproduced.<br>
2.Log analyse:<br>
We notice a strange pattern in this log:
connectStorageServer be called twice, first supervdsm call
succeed, second fails becasue of validateAccess().<br>
That is because for the first call validateAccess returns
normally and leave a child there, when the second
validateAccess call arrives and multirprocessing manager is
receiving the method message, it is just the time first child
exit and SIGCHLD comming, this signal interrupted
multiprocessing receive system call, python managers.py should
handle INTR and retry recv() like we do in vdsm but it's not,
so the second one raise error.<br>
<pre wrap="">>Thread-18::DEBUG::2013-01-22
10:41:03,570::misc::85::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo
-n /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3
192.168.0.1:/ovirt/silvermoon /rhev/data-center/mnt/192.168.0.1:_ovirt_silvermoon'
(cwd None)
>Thread-18::DEBUG::2013-01-22
10:41:03,607::misc::85::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo
-n /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3
192.168.0.1:/ovirt/undercity /rhev/data-center/mnt/192.168.0.1:_ovirt_undercity' (cwd
None)
>Thread-18::ERROR::2013-01-22
10:41:03,627::hsm::2215::Storage.HSM::(connectStorageServer) Could not connect to
storageServer
>Traceback (most recent call last):
> File "/usr/share/vdsm/storage/hsm.py", line 2211, in
connectStorageServer
> conObj.connect()
> File "/usr/share/vdsm/storage/storageServer.py", line 303, in connect
> return self._mountCon.connect()
> File "/usr/share/vdsm/storage/storageServer.py", line 209, in connect
> fileSD.validateDirAccess(self.getMountObj().getRecord().fs_file)
> File "/usr/share/vdsm/storage/fileSD.py", line 55, in
validateDirAccess
> (os.R_OK | os.X_OK))
> File "/usr/share/vdsm/supervdsm.py", line 81, in __call__
> return callMethod()
> File "/usr/share/vdsm/supervdsm.py", line 72, in
<lambda>
> **kwargs)
> File "<string>", line 2, in validateAccess
> File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in
_callmethod
> raise convert_to_error(kind, result)
</pre>
the vdsm side receive RemoteError because of supervdsm server
multiprocessing manager raise error KIND='TRACEBACK'<br>
<pre> >RemoteError:
</pre>
The upper part is the trace back from the client side, the
following part is from server side:<br>
<pre>>---------------------------------------------------------------------------
>Traceback (most recent call last):
> File "/usr/lib64/python2.6/multiprocessing/managers.py", line 214, in
serve_client
> request = recv()
>IOError: [Errno 4] Interrupted system call
>---------------------------------------------------------------------------
</pre>
Corresponding Python source code:managers.py(Server side)<br>
<pre> def serve_client(self,
conn):</pre>
<pre>
'''</pre>
<pre> Handle
requests from the proxies in a particular process/thread</pre>
<pre>
'''</pre>
<pre>
util.debug('starting server thread to service %r',</pre>
<pre>
threading.current_thread().name)</pre>
<pre> recv =
conn.recv</pre>
<pre> send =
conn.send</pre>
<pre>
id_to_obj = self.id_to_obj</pre>
<pre> while
not self.stop:</pre>
<pre>
try:</pre>
<pre>
methodname = obj = None</pre>
<pre>
request = recv()<------------------<font color="#3366ff">this line
been interrupted by SIGCHLD</font></pre>
<pre>
ident, methodname, args, kwds = request</pre>
<pre>
obj, exposed, gettypeid = id_to_obj[ident]</pre>
<pre>
if methodname not in exposed:</pre>
<pre>
raise AttributeError(</pre>
<pre>
'method %r of %r object is not in exposed=%r' %</pre>
<pre>
(methodname, type(obj), exposed)</pre>
<pre>
)</pre>
<pre>
function = getattr(obj, methodname)
try:</pre>
<pre>
res = function(*args, **kwds)</pre>
<pre>
except Exception, e:</pre>
<pre>
msg = ('#ERROR', e)</pre>
<pre>
else:</pre>
<pre>
typeid = gettypeid and gettypeid.get(methodname, None)</pre>
<pre>
if typeid:</pre>
<pre>
rident, rexposed = self.create(conn, typeid, res)</pre>
<pre>
token = Token(typeid, self.address, rident)</pre>
<pre>
msg = ('#PROXY', (rexposed, token))</pre>
<pre>
else:</pre>
<pre>
msg = ('#RETURN', res)
except AttributeError:
if methodname is None:
msg = ('#TRACEBACK', format_exc())
else:
try:
fallback_func = self.fallback_mapping[methodname]
result = fallback_func(
self, conn, ident, obj, *args, **kwds
)
msg = ('#RETURN', result)
except Exception:
msg = ('#TRACEBACK', format_exc())
except EOFError:
util.debug('got EOF -- exiting thread serving %r',
threading.current_thread().name)
sys.exit(0)
except Exception:<------<font color="#3366ff">does not
handle IOError,INTR here should retry recv() </font>
msg = ('#TRACEBACK', format_exc())</pre>
</div>
</blockquote>
Not sure I under what is your opinion here. <br>
<br>
Whenever this line was interrupted by a SIGCHLD , <br>
<pre>request = recv()<------------------<font
color="#3366ff">this line been interrupted by
SIGCHLD</font></pre>
the code flow should fall into <br>
<br>
<pre> except Exception:<------<font
color="#3366ff">does not handle IOError,INTR here should retry recv()
</font>
msg = ('#TRACEBACK', format_exc())
However, the big while loop was not stopped, it should reach this line again.</pre>
</blockquote>
Yep, I'm not saying this error makes baseManager server totally
non-functional,The loop does not stop but this call returns error
msg to vdsm and make it raise RemoteError , see line#273:<br>
<div class="parity1 source"><a
href="http://hg.python.org/cpython/file/2c0197c95ec6/Lib/multiproces...
id="l272"> 272</a> <span
class="k">try</span><span
class="p">:</span></div>
<div class="parity0 source"><a
href="http://hg.python.org/cpython/file/2c0197c95ec6/Lib/multiproces...
id="l273"> 273</a>
<span class="n">send</span><span
class="p">(</span><span
class="n">msg</span><span
class="p">)</span></div>
<div class="parity1 source"><a
href="http://hg.python.org/cpython/file/2c0197c95ec6/Lib/multiproces...
id="l274"> 274</a> <span
class="k">except</span> <span
class="ne">Exception</span> <span
class="k">as</span> <span
class="n">e</span><span class="p">:<br>
<br>
This makes validateAccess fails every other time.<br>
</span></div>
<br>
<blockquote cite="mid:5109D67E.6070107@linux.vnet.ibm.com"
type="cite">
<pre>
</pre>
try:
<pre>
methodname = obj = None</pre>
<pre>
request = recv()<------------------<font color="#3366ff">this line
been interrupted by SIGCHLD</font>
</pre>
Then the interrupted recv() will be called again and again until
the recv() call is not interrupted. What's wrong
here?<br>
<br>
<blockquote cite="mid:51020A6F.6040201@linux.vnet.ibm.com"
type="cite">
<div class="moz-cite-prefix"> <br>
<br>
3. Actions we will take:<br>
(1)As a work round we can first remove the zombie reaper from
supervdsm server<br>
(2)I'll see whether python has a fixed version for this<br>
(3)Yaniv is working on changing vdsm/svdsm communication
channel to pipe and handle it ourselves, I believe we'll get
rid of this with that properly handled.<br>
<br>
<br>
On 01/25/2013 06:00 AM, Dead Horse wrote:<br>
</div>
<blockquote
cite="mid:CAEWPe=qjozgWjK3U7-PDky8kToy8=R9A=ES32hQj1qmhhczTjA@mail.gmail.com"
type="cite">
<div dir="ltr">
<div>Tried some manual edits to SD states in the dbase. The
net result was I was able to get a node active. However as
reconstructing the master storage domain kicked in it was
unable to do so. It was also not able to recognize the
other SD with similar failure modes to the unrecognized
master above. Guessing the newer VDSM version borked
things pretty good. So being as this is a test harness and
the SD data is not worth saving I just smoked the all SD.
I ran engine-cleanup and started from fresh and all is
well now.<br>
<br>
</div>
- DHC<br>
</div>
<div class="gmail_extra"><br>
<br>
<div class="gmail_quote">On Thu, Jan 24, 2013 at 11:53 AM,
Dead Horse <span dir="ltr"><<a
moz-do-not-send="true"
href="mailto:deadhorseconsulting@gmail.com"
target="_blank">deadhorseconsulting(a)gmail.com</a>&gt;</span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">
<div>
<div>
<div>
<div>This test harness setup here consists of
two servers tied to NFS storage via IB (NFS
mounts are via IPoIB, NFS over RDMA is
disabled) . All storage domains are NFS. The
issue does occur with both servers on when
attempting to bring them out of maintenance
mode with the end result being non-operational
due to storage attach fail.<br>
<br>
</div>
The current issue is now that with a working
older commit the master storage domain is
"stuck" in state "locked" and I see the
secondary issue wherein VDSM cannot seem to find
or contact the master storage domain even though
it is there. I am can mount the master storage
domain manually and and all content appears to
be accounted for accordingly on either host. <br>
<br>
</div>
Here is the current contents of the master storage
domain metadata:<br>
CLASS=Data<br>
DESCRIPTION=orgrimmar<br>
IOOPTIMEOUTSEC=1<br>
LEASERETRIES=3<br>
LEASETIMESEC=5<br>
LOCKPOLICY=<br>
LOCKRENEWALINTERVALSEC=5<br>
MASTER_VERSION=417<br>
POOL_DESCRIPTION=Azeroth<br>
POOL_DOMAINS=0549ee91-4498-4130-8c23-4c173b5c0959:Active,d8b55105-c90a-465d-9803-8130da9a671e:Active,67534cca-1327-462a-b455-a04464084b31:Active,c331a800-839d-4d23-9059-870a7471240a:Active,f8984825-ff8d-43d9-91db-0d0959f8bae9:Active,c434056e-96be-4702-8beb-82a408a5c8cb:Active,f7da73c7-b5fe-48b6-93a0-0c773018c94f:Active,82e3b34a-6f89-4299-8cd8-2cc8f973a3b4:Active,e615c975-6b00-469f-8fb6-ff58ae3fdb2c:Active,5bc86532-55f7-4a91-a52c-fad261f322d5:Active,1130b87a-3b34-45d6-8016-d435825c68ef:Active<br>
POOL_SPM_ID=1<br>
POOL_SPM_LVER=6<br>
POOL_UUID=f90a0d1c-06ca-11e2-a05b-00151712f280<br>
REMOTE_PATH=192.168.0.1:/ovirt/orgrimmar<br>
ROLE=Master<br>
SDUUID=67534cca-1327-462a-b455-a04464084b31<br>
TYPE=NFS<br>
VERSION=3<br>
_SHA_CKSUM=1442bb078fd8c9468d241ff141e9bf53839f0721<br>
<br>
So now with the older working commit I now get
this the "StoragePoolMasterNotFound: Cannot find
master domain" error (prior details above when I
worked backwards to that commit)<br>
<br>
This is odd as the nodes can definitely reach the
master storage domain:<br>
<br>
</div>
showmount from one of the el6.3 nodes:<br>
[root@kezan ~]# showmount -e 192.168.0.1<br>
Export list for <a moz-do-not-send="true"
href="http://192.168.0.1"
target="_blank">192.168.0.1</a>:<br>
/ovirt/orgrimmar <a
moz-do-not-send="true"
href="http://192.168.0.0/16"
target="_blank">192.168.0.0/16</a><br>
<br>
</div>
mount/ls from one of the nodes:<br>
<div>
<div>[root@kezan ~]# mount
192.168.0.1:/ovirt/orgrimmar /mnt<br>
[root@kezan ~]# ls -al
/mnt/67534cca-1327-462a-b455-a04464084b31/dom_md/<br>
total 1100<br>
drwxr-xr-x 2 vdsm kvm 4096 Jan 24
11:44 .<br>
drwxr-xr-x 5 vdsm kvm 4096 Oct 19
16:16 ..<br>
-rw-rw---- 1 vdsm kvm 1048576 Jan 19 22:09 ids<br>
-rw-rw---- 1 vdsm
kvm 0 Sep 25 00:46
inbox<br>
-rw-rw---- 1 vdsm kvm 2097152 Jan 10 13:33 leases<br>
-rw-r--r-- 1 vdsm kvm 903
Jan 10 13:39
metadata<br>
-rw-rw---- 1 vdsm
kvm 0 Sep 25 00:46
outbox<br>
<br>
<br>
</div>
<div>- DHC<br>
</div>
<div><br>
</div>
</div>
</div>
<div class="HOEnZb">
<div class="h5">
<div class="gmail_extra"><br>
<br>
<div class="gmail_quote">On Thu, Jan 24, 2013 at
7:51 AM, ybronhei <span dir="ltr"><<a
moz-do-not-send="true"
href="mailto:ybronhei@redhat.com"
target="_blank">ybronhei(a)redhat.com</a>&gt;</span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0
0 0 .8ex;border-left:1px #ccc
solid;padding-left:1ex">
<div>On 01/24/2013 12:44 AM, Dead Horse wrote:<br>
</div>
<blockquote class="gmail_quote"
style="margin:0 0 0 .8ex;border-left:1px
#ccc solid;padding-left:1ex">
<div> I narrowed down on the commit where
the originally reported issue crept in:<br>
</div>
commitfc3a44f71d2ef202cff18d7203b9e4165b546621building
and testing with
<div><br>
this commit or subsequent commits yields
the original issue.<br>
</div>
</blockquote>
Interesting.. it might be related to this
commit and we're trying to reproduce it.<br>
<br>
Did you try to remove that code and run again?
does it work without the additional of
zombieReaper?<br>
does the connectivity to the storage work
well? when you run 'ls' on the mounted folder
you get see the files without a long delay ?
it might related to too long timeout when
validating access to this mount..<br>
we work on that.. any additional info can help<br>
<br>
Thanks.
<div>
<div><br>
<blockquote class="gmail_quote"
style="margin:0 0 0 .8ex;border-left:1px
#ccc solid;padding-left:1ex"> <br>
- DHC<br>
<br>
<br>
On Wed, Jan 23, 2013 at 3:56 PM, Dead
Horse<br>
<<a moz-do-not-send="true"
href="mailto:deadhorseconsulting@gmail.com"
target="_blank">deadhorseconsulting@gmail.com</a>>wrote:<br>
<br>
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex"> Indeed
reverting back to an older vdsm clears
up the above issue. However<br>
now I the issue is see is:<br>
Thread-18::ERROR::2013-01-23<br>
15:50:42,885::task::833::TaskManager.Task::(_setError)<br>
Task=`08709e68-bcbc-40d8-843a-d69d4df40ac6`::Unexpected
error<br>
<br>
Traceback (most recent call last):<br>
File
"/usr/share/vdsm/storage/task.py",
line 840, in _run<br>
return fn(*args,
**kargs)<br>
File
"/usr/share/vdsm/logUtils.py",
line 42, in wrapper<br>
res = f(*args,
**kwargs)<br>
File
"/usr/share/vdsm/storage/hsm.py", line
923, in connectStoragePool<br>
masterVersion,
options)<br>
File
"/usr/share/vdsm/storage/hsm.py", line
970, in _connectStoragePool<br>
res =
pool.connect(hostID,
scsiKey, msdUUID, masterVersion)<br>
File
"/usr/share/vdsm/storage/sp.py", line
643, in connect<br>
self.__rebuild(msdUUID=msdUUID,
masterVersion=masterVersion)<br>
File
"/usr/share/vdsm/storage/sp.py", line
1167, in __rebuild<br>
self.masterDomain =
self.getMasterDomain(msdUUID=msdUUID,<br>
masterVersion=masterVersion)<br>
File
"/usr/share/vdsm/storage/sp.py", line
1506, in getMasterDomain<br>
raise
se.StoragePoolMasterNotFound(self.spUUID,
msdUUID)<br>
StoragePoolMasterNotFound: Cannot find
master domain:<br>
'spUUID=f90a0d1c-06ca-11e2-a05b-00151712f280,<br>
msdUUID=67534cca-1327-462a-b455-a04464084b31'<br>
Thread-18::DEBUG::2013-01-23<br>
15:50:42,887::task::852::TaskManager.Task::(_run)<br>
Task=`08709e68-bcbc-40d8-843a-d69d4df40ac6`::Task._run:<br>
08709e68-bcbc-40d8-843a-d69d4df40ac6<br>
('f90a0d1c-06ca-11e2-a05b-00151712f280',
2,<br>
'f90a0d1c-06ca-11e2-a05b-00151712f280',<br>
'67534cca-1327-462a-b455-a04464084b31',
433) {} failed - stopping task<br>
<br>
This is with vdsm built from<br>
commit25a2d8572ad32352227c98a86631300fbd6523c1<br>
- DHC<br>
<br>
<br>
On Wed, Jan 23, 2013 at 10:44 AM, Dead
Horse <<br>
<a moz-do-not-send="true"
href="mailto:deadhorseconsulting@gmail.com"
target="_blank">deadhorseconsulting(a)gmail.com</a>&gt;
wrote:<br>
<br>
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex"> VDSM was
built from:<br>
commit
166138e37e75767b32227746bb671b1dab9cdd5e<br>
<br>
Attached is the full vdsm log<br>
<br>
I should also note that from engine
perspective it sees the master<br>
storage domain as locked and the
others as unknown.<br>
<br>
<br>
On Wed, Jan 23, 2013 at 2:49 AM, Dan
Kenigsberg <<a
moz-do-not-send="true"
href="mailto:danken@redhat.com"
target="_blank">danken@redhat.com</a>>wrote:<br>
<br>
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex"> On Tue,
Jan 22, 2013 at 04:02:24PM -0600,
Dead Horse wrote:<br>
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex"> Any
ideas on this one? (from VDSM
log):<br>
Thread-25::DEBUG::2013-01-22<br>
15:35:29,065::BindingXMLRPC::914::vds::(wrapper)
client<br>
</blockquote>
[3.57.111.30]::call<br>
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">
getCapabilities with () {}<br>
Thread-25::ERROR::2013-01-22
15:35:29,113::netinfo::159::root::(speed)<br>
cannot read ib0 speed<br>
Traceback (most recent call
last):<br>
File
"/usr/lib64/python2.6/site-packages/vdsm/netinfo.py",
line 155,<br>
</blockquote>
in<br>
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex"> speed<br>
s =
int(file('/sys/class/net/%s/speed'
% dev).read())<br>
IOError: [Errno 22] Invalid
argument<br>
<br>
Causes VDSM to fail to attach
storage<br>
</blockquote>
<br>
I doubt that this is the cause of
the failure, as vdsm has always<br>
reported "0" for ib devices, and
still is.<br>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
</div>
</div>
it happens only when you call to
getCapabilities.. so it doesn't related to the
flow, and it can't effect the storage.<br>
Dan: I guess this is not the issue but why is
the IOError?<br>
<blockquote class="gmail_quote"
style="margin:0 0 0 .8ex;border-left:1px
#ccc solid;padding-left:1ex">
<div>
<div>
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex"> <br>
Does a former version works with
your Engine?<br>
Could you share more of your
vdsm.log? I suppose the culprit
lies in one<br>
one of the storage-related
commands, not in statistics
retrieval.<br>
<br>
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex"> <br>
Engine side sees:<br>
ERROR
[org.ovirt.engine.core.bll.storage.NFSStorageHelper]<br>
(QuartzScheduler_Worker-96)
[553ef26e] The connection with
details<br>
192.168.0.1:/ovirt/ds failed
because of error code 100 and
error<br>
</blockquote>
message<br>
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex"> is:
general exception<br>
2013-01-22 15:35:30,160 INFO<br>
[org.ovirt.engine.core.bll.SetNonOperationalVdsCommand]<br>
(QuartzScheduler_Worker-96)
[1ab78378] Running command:<br>
SetNonOperationalVdsCommand
internal: true. Entities
affected : ID:<br>
8970b3fe-1faf-11e2-bc1f-00151712f280
Type: VDS<br>
2013-01-22 15:35:30,200 INFO<br>
[org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand]<br>
(QuartzScheduler_Worker-96)
[1ab78378] START,<br>
SetVdsStatusVDSCommand(HostName
= kezan, HostId =<br>
8970b3fe-1faf-11e2-bc1f-00151712f280,
status=NonOperational,<br>
nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE),
log id: 4af5c4cd<br>
2013-01-22 15:35:30,211 INFO<br>
[org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand]<br>
(QuartzScheduler_Worker-96)
[1ab78378] FINISH,
SetVdsStatusVDSCommand,<br>
</blockquote>
log<br>
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex"> id:
4af5c4cd<br>
2013-01-22 15:35:30,242 ERROR<br>
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]<br>
(QuartzScheduler_Worker-96)
[1ab78378] Try to add duplicate
audit log<br>
values with the same name. Type:
VDS_SET_NONOPERATIONAL_DOMAIN.
Value:<br>
storagepoolname<br>
<br>
Engine = latest master<br>
VDSM = latest master<br>
</blockquote>
<br>
Since "latest master" is an
unstable reference by definition,
I'm sure<br>
that History would thank you if
you post the exact version (git
hash?)<br>
of the code.<br>
<br>
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex"> node =
el6<br>
</blockquote>
<br>
<br>
</blockquote>
<br>
</blockquote>
<br>
</blockquote>
<br>
<br>
<br>
</div>
</div>
_______________________________________________<br>
Users mailing list<br>
<a moz-do-not-send="true"
href="mailto:Users@ovirt.org"
target="_blank">Users(a)ovirt.org</a><br>
<a moz-do-not-send="true"
href="http://lists.ovirt.org/mailman/listinfo/users"
target="_blank">http://lists.ovirt.org/mailman/listinfo/user...
<br>
<span><font color="#888888">
</font></span></blockquote>
<span><font color="#888888"> <br>
<br>
-- <br>
Yaniv Bronhaim.<br>
RedHat, Israel<br>
09-7692289<br>
054-7744187<br>
</font></span></blockquote>
</div>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</blockquote>
<br>
<br>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
Users mailing list
<a moz-do-not-send="true" class="moz-txt-link-abbreviated"
href="mailto:Users@ovirt.org">Users@ovirt.org</a>
<a moz-do-not-send="true" class="moz-txt-link-freetext"
href="http://lists.ovirt.org/mailman/listinfo/users">http://...
</pre>
</blockquote>
<br>
<br>
<pre class="moz-signature" cols="72">--
---
舒明 Shu Ming
Open Virtualization Engineerning; CSTL, IBM Corp.
Tel: 86-10-82451626 Tieline: 9051626 E-mail: <a moz-do-not-send="true"
class="moz-txt-link-abbreviated"
href="mailto:shuming@cn.ibm.com">shuming@cn.ibm.com</a> or <a
moz-do-not-send="true" class="moz-txt-link-abbreviated"
href="mailto:shuming@linux.vnet.ibm.com">shuming@linux.vnet.ibm.com</a>
Address: 3/F Ring Building, ZhongGuanCun Software Park, Haidian District, Beijing 100193,
PRC</pre>
</blockquote>
<br>
<br>
</body>
</html>
--------------040907040103060503090706--