Hi list,
I am taking a look at Bug 1157421 [1] which describes a situation
where VM's that are paused with an -EIO error are not automatically
resumed after the problem with storage has been corrected. I have
some patches [2] on gerrit that resolve the problem. Since this
appears to be a regression I am looking at a non-intrusive way to fix
it in the 3.5 branch. There is some disagreement on the proper way to
fix this so I am hoping we can arrive at a solution through an open
discussion.
The main issue at hand is with the Event/Callback mechanism we use to
call clientIF.contEIOVMs. According to my experiments and this online
discussion [3] weakref does not work for instance methods such as
clientIF.contEIOVMs. Our Event class uses weakref to prevent it from
holding references to registered callback functions.
I see two easy ways to fix the regression:
1) Treat clientIF as a singleton class (which it is) and make
contEIOVMs a module-level method which gets the clientIF instance
and calls it's bound contEIOVMs method. See my patches [2] for the
code behind this idea.
2) Allow Event to maintain a strong reference on the bound
clientIF.contEIOVMs method. This will allow the current code to work
as designed but will change the Event implementation to accomodate
this specific use case. Since no one else appears to be using this
code, it should have no functional impact.
Are there any other ideas I'm missing? I am aware of plans to
refactor this code for 3.6 but I am more interested in a short-term,
practical solution to address the current regression.
Thanks for offering your insight on this problem.
[1]
https://bugzilla.redhat.com/show_bug.cgi?id=1157421
[2]
http://gerrit.ovirt.org/#/q/status:open+project:vdsm+branch:master+topic:...
--
Adam Litke