----- Original Message -----
From: "Adam Litke" <alitke(a)redhat.com>
To: "Nir Soffer" <nsoffer(a)redhat.com>
Cc: devel(a)ovirt.org, "Francesco Romani" <fromani(a)redhat.com>,
"Federico Simoncelli" <fsimonce(a)redhat.com>, "Dan
Kenigsberg" <danken(a)redhat.com>
Sent: Friday, November 21, 2014 4:46:13 PM
Subject: Re: contEIOVMs regression?
On 20/11/14 17:37 -0500, Nir Soffer wrote:
>----- Original Message -----
>> From: "Adam Litke" <alitke(a)redhat.com>
>> To: devel(a)ovirt.org
>> Cc: "Nir Soffer" <nsoffer(a)redhat.com>, "Francesco
Romani"
>> <fromani(a)redhat.com>, "Federico Simoncelli"
>> <fsimonce(a)redhat.com>, "Dan Kenigsberg"
<danken(a)redhat.com>
>> Sent: Thursday, November 20, 2014 9:15:33 PM
>> Subject: contEIOVMs regression?
>>
>> Hi list,
>>
>> I am taking a look at Bug 1157421 [1] which describes a situation
>> where VM's that are paused with an -EIO error are not automatically
>> resumed after the problem with storage has been corrected. I have
>> some patches [2] on gerrit that resolve the problem. Since this
>> appears to be a regression I am looking at a non-intrusive way to fix
>> it in the 3.5 branch. There is some disagreement on the proper way to
>> fix this so I am hoping we can arrive at a solution through an open
>> discussion.
>>
>> The main issue at hand is with the Event/Callback mechanism we use to
>> call clientIF.contEIOVMs. According to my experiments and this online
>> discussion [3] weakref does not work for instance methods such as
>> clientIF.contEIOVMs. Our Event class uses weakref to prevent it from
>> holding references to registered callback functions.
>
>Why making event system more correct is required to tix [1]?
>
>>
>> I see two easy ways to fix the regression:
>
>I don't follow, what is the regression?
Assuming that at some point contEIOVMs actually worked and was able to
automatically resume VMs, then we have a regression because given the
weakref problems I am describing herein there is no way that it is
working now. The only way we don't have a regression is if this code
has never worked to begin with.
The current code is master do work - when I fixed this last time, the problem
was that we did not register the callback before starting the monitors, and
that the monitors did not issue a state change on the first time a monitor
check the domain state.
I verified that contEIOVMs is called and that it does try to continue vms.
If this does not break now (with current code), please open a bug.
>> 1) Treat clientIF as a singleton class (which it is) and make
>> contEIOVMs a module-level method which gets the clientIF instance
>> and calls it's bound contEIOVMs method. See my patches [2] for the
>> code behind this idea.
>
>This is the wrong direction. There is only one place using that horrible
>getInstance(), and it also could just create the single instance that we
>need. We should remove getInstance() instead of using it in new code.
>
>>
>> 2) Allow Event to maintain a strong reference on the bound
>> clientIF.contEIOVMs method. This will allow the current code to work
>> as designed but will change the Event implementation to accomodate
>> this specific use case. Since no one else appears to be using this
>> code, it should have no functional impact.
>
>The code is already holding a strong reference now, no change is
>needed :-)
I disagree. From vdsm/storage/misc.py:
class Event(object):
...
def register(self, func, oneshot=False):
with self._syncRoot:
self._registrar[id(func)] = (weakref.ref(func), oneshot)
...
# ^^^ "He's dead Jim"
The function is converted into a weak reference. Since, in this case,
the function is an instance method, the reference is immediately dead
on arrival. I have verified this with debugging statements in my
environment.
So you suggest that taking a weakref to an instance method returns
a dead reference?
I thought that the problem is instance method keep hard reference to
the instance, so the weakref is useless.
Nir