Re: [ovirt-devel] contEIOVMs regression?

21 Nov 2014

      ----- Original Message -----
...
From: "Adam Litke" <alitke@redhat.com>
To: "Nir Soffer" <nsoffer@redhat.com>
Cc: devel@ovirt.org, "Francesco Romani" <fromani@redhat.com>, "Federico Simoncelli" <fsimonce@redhat.com>, "Dan
Kenigsberg" <danken@redhat.com>
Sent: Friday, November 21, 2014 4:46:13 PM
Subject: Re: contEIOVMs regression?
On 20/11/14 17:37 -0500, Nir Soffer wrote:
...
----- Original Message -----
...
From: "Adam Litke" <alitke@redhat.com>
To: devel@ovirt.org
Cc: "Nir Soffer" <nsoffer@redhat.com>, "Francesco Romani"
<fromani@redhat.com>, "Federico Simoncelli"
<fsimonce@redhat.com>, "Dan Kenigsberg" <danken@redhat.com>
Sent: Thursday, November 20, 2014 9:15:33 PM
Subject: contEIOVMs regression?
Hi list,
I am taking a look at Bug 1157421 [1] which describes a situation
where VM's that are paused with an -EIO error are not automatically
resumed after the problem with storage has been corrected.  I have
some patches [2] on gerrit that resolve the problem.  Since this
appears to be a regression I am looking at a non-intrusive way to fix
it in the 3.5 branch.  There is some disagreement on the proper way to
fix this so I am hoping we can arrive at a solution through an open
discussion.
The main issue at hand is with the Event/Callback mechanism we use to
call clientIF.contEIOVMs.  According to my experiments and this online
discussion [3] weakref does not work for instance methods such as
clientIF.contEIOVMs.  Our Event class uses weakref to prevent it from
holding references to registered callback functions.
Why making event system more correct is required to tix [1]?
...
I see two easy ways to fix the regression:
I don't follow, what is the regression?
Assuming that at some point contEIOVMs actually worked and was able to
automatically resume VMs, then we have a regression because given the
weakref problems I am describing herein there is no way that it is
working now.  The only way we don't have a regression is if this code
has never worked to begin with.
The current code is master do work - when I fixed this last time, the problem
was that we did not register the callback before starting the monitors, and 
that the monitors did not issue a state change on the first time a monitor
check the domain state.

I verified that contEIOVMs is called and that it does try to continue vms.

If this does not break now (with current code), please open a bug.
...
...
...
1) Treat clientIF as a singleton class (which it is) and make
contEIOVMs a module-level method which gets the clientIF instance
and calls it's bound contEIOVMs method.  See my patches [2] for the
code behind this idea.
This is the wrong direction. There is only one place using that horrible
getInstance(), and it also could just create the single instance that we
need. We should remove getInstance() instead of using it in new code.
...
2) Allow Event to maintain a strong reference on the bound
clientIF.contEIOVMs method.  This will allow the current code to work
as designed but will change the Event implementation to accomodate
this specific use case.  Since no one else appears to be using this
code, it should have no functional impact.
The code is already holding a strong reference now, no change is
needed :-)
I disagree.  From vdsm/storage/misc.py:
class Event(object):
...
    def register(self, func, oneshot=False):
        with self._syncRoot:
            self._registrar[id(func)] = (weakref.ref(func), oneshot)
...
            # ^^^ "He's dead Jim"
The function is converted into a weak reference.  Since, in this case,
the function is an instance method, the reference is immediately dead
on arrival.  I have verified this with debugging statements in my
environment.
So you suggest that taking a weakref to an instance method returns
a dead reference?

I thought that the problem is instance method keep hard reference to
the instance, so the weakref is useless.

Nir