Re: [ovirt-devel] [VDSM] cleaning statistics retrieval

10 Apr 2014

      On Apr 10, 2014, at 12:25 , Dan Kenigsberg <danken@redhat.com> wrote:
...
On Wed, Apr 09, 2014 at 08:40:24AM -0400, Francesco Romani wrote:
...
----- Original Message -----
...
From: "Dan Kenigsberg" <danken@redhat.com>
To: "Francesco Romani" <fromani@redhat.com>
Cc: "vdsm-devel" <vdsm-devel@lists.fedorahosted.org>, devel@ovirt.org
Sent: Tuesday, April 8, 2014 6:57:15 PM
Subject: Re: cleaning statistics retrieval
On Tue, Apr 08, 2014 at 06:52:50AM -0400, Francesco Romani wrote:
...
Hello VDSM developers,
Please use devel@ovirt, and mention "vdsm" at the subject. This thread
in particular involves Engine as well.
Right, I forgot. Sorry about that.
...
...
I'd like to discuss and explain the plans for cleaning up Vm.getStats()
in vdsm/virt/vm.py, and how it affects a bug we have:
https://bugzilla.redhat.com/show_bug.cgi?id=1073478
Let's start from the bug.
To make a long story short, there is a (small) race in VDSM, probably
introduced by commit
8fedf8bde3c28edb07add23c3e9b72681cb48e49
The race can actually be triggered only if the VM is shutting down, so it
is not that bad.
Fixing the reported issue in the specific case can be done with a trivial
if, and that it what I did
in my initial http://gerrit.ovirt.org/#/c/25803/1/vdsm/vm.py,cm
Could you explain how an AttributeError there moved the VM to Down?
This should actually be this bug of engine https://bugzilla.redhat.com/show_bug.cgi?id=1072282
if GetVmStats fails for whatever reason, engine thinks the VM is down.
Could you rename the Vdsm bug to exprese the in-vdsm problem, and make
clear that it confuses older Engines to think the Vm is Down?
...
...
...
And this is the core of the issue.
My initial idea is to either return success and a complete, well formed
statistics set, or return an error.
However current engine seems to not cope properly with this, and we cannot
break backward compatibility.
Would you be more precise? If getAllVmStats returns an errCode for one
of the Vms, what happens?
Of course.
For GetAllVmStats, AFAIK, but please correct me if I am wrong, because I'm not really expert on engine side,
engine simply does not expects anything different from a list
of either a RunningVmStats or an ExitedVmStats.
So not sure (will verify just after this mail) if engine can cope with mixed content,
meaning [ stats, errCode, stats, stats ... ]
For GetVmStats, like Michal said, the engine expects the call to succeed otherwise
it puts the VM into the Unknown state.
...
...
Looks like the only way to go is to always return success and to add a
field to describe the content of the
statistics (partial, complete...). THis is admittedly a far cry from the
ideal solution, but it is dictated
by the need to preserve the compatibility with current/old engines.
I don't think that I understand your suggestion, but it does not sound
right to send a partial dictionary and to "appologize" about its
paritiality elsewhere. The dictionary may be "partial" for engine-4.0
yet "complete" for engine-3.5. It's not for Vdsm to grade its own
output.
I see your point (that's one of the reasons I'm not happy about this solution).
Please see below for the detauls.
...
...
please note that I'm not really happy about this solution, but, given the
constraint, I don't see better alternatives.
Please explain the benefits of describing the partial content, as I do
not see them.
The root issue here is the getStats must always succeed, because the engine doesn't
expect otherwise and thus we guarantee this to cope with old engines;
but inside VDSM, getStats can actually fail in a lot of places
(like in this case getBalloonInfo)
So, in VDSM we can end up with a partial response, and now the question
is: what to do with this partial response? And if there are optional fields
in the response, how can the engine distinguish between
* full response, but with an optional field missing
and
* partial response (because of an exception inside VDSM),
 without some field, incidentally including an optional one
...
?
The VDSM 'grading' was an hint from VDSM to help engine to distinguish
between those cases.
Even if we agree this grading idea is bad, the core issue remains open:
what to do if we end up with a partial response?
For example, let's say we handle the getBalloonInfo exception (http://gerrit.ovirt.org/#/c/26539/),
the stats object to be returned will lack
* the (mandatory, expected) balloon stats
* the (optional) migration progress - ok, bad example because this makes sense only during migrations,
 but other optional fields may be added later and the issue remains
Again, anyone feel free to correct me if I misunderstood something about engine
(or VDSM <=> engine communication) and to suggest better alternatives :\
Currently, we have way too many try-except-Exception clauses in our
code. They swallow everything: from expected libvirt errors to
unexpected syntax errors. We should eliminate them, not add more.
Mandatory stuff must be reported, so if we fail extracting them
we'd better explode and raise an error. Optional stuff are optional, so
we could drop them from the output, in order to report the mandatory
ones.
http://gerrit.ovirt.org/#/c/26539/2/vdsm/virt/vm.py currently suggest
that Vdsm lie when it fails to extract the current ballon size, and say
that it's 0. I'd prefer to drop balloon_cur from the return value of
_getBalloonInfo or drop balloonInfo from the reported stats.
it does make a lot of sense….but my only worry is the backward compatibility.
E.g. for the balloon the current engine code,AFAICT, thinks that the device is not present when we exclude balloonInfo, whereas reporting 0 in balloon_cur still means the balloon is still there
I'm not sure if we can ever find out all possible regressions all the way to 3.0

Then we are "just" speaking about the 4.0 solution.
...
My only question is the granularity of the definition: is balloonInfo
atomic, or can it be reported without balloon_cur? Should it? I'd prefer
to have the "atoms" as big as possible, to limit the number of
combinations - if self._dom is None, don't report balloonInfo at all,
and so if the libvirt connection timed out.
_______________________________________________
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel

Re: [ovirt-devel] [VDSM] cleaning statistics retrieval

Michal Skrivanek