[ovirt-devel] [VDSM] cleaning statistics retrieval
Dan Kenigsberg
danken at redhat.com
Thu Apr 10 10:25:33 UTC 2014
On Wed, Apr 09, 2014 at 08:40:24AM -0400, Francesco Romani wrote:
> ----- Original Message -----
> > From: "Dan Kenigsberg" <danken at redhat.com>
> > To: "Francesco Romani" <fromani at redhat.com>
> > Cc: "vdsm-devel" <vdsm-devel at lists.fedorahosted.org>, devel at ovirt.org
> > Sent: Tuesday, April 8, 2014 6:57:15 PM
> > Subject: Re: cleaning statistics retrieval
> >
> > On Tue, Apr 08, 2014 at 06:52:50AM -0400, Francesco Romani wrote:
> > > Hello VDSM developers,
> >
> > Please use devel at ovirt, and mention "vdsm" at the subject. This thread
> > in particular involves Engine as well.
>
> Right, I forgot. Sorry about that.
>
> > > I'd like to discuss and explain the plans for cleaning up Vm.getStats()
> > > in vdsm/virt/vm.py, and how it affects a bug we have:
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1073478
> > >
> > > Let's start from the bug.
> > >
> > > To make a long story short, there is a (small) race in VDSM, probably
> > > introduced by commit
> > > 8fedf8bde3c28edb07add23c3e9b72681cb48e49
> > > The race can actually be triggered only if the VM is shutting down, so it
> > > is not that bad.
> > >
> > > Fixing the reported issue in the specific case can be done with a trivial
> > > if, and that it what I did
> > > in my initial http://gerrit.ovirt.org/#/c/25803/1/vdsm/vm.py,cm
> >
> > Could you explain how an AttributeError there moved the VM to Down?
>
> This should actually be this bug of engine https://bugzilla.redhat.com/show_bug.cgi?id=1072282
> if GetVmStats fails for whatever reason, engine thinks the VM is down.
Could you rename the Vdsm bug to exprese the in-vdsm problem, and make
clear that it confuses older Engines to think the Vm is Down?
>
> > > And this is the core of the issue.
> > > My initial idea is to either return success and a complete, well formed
> > > statistics set, or return an error.
> > > However current engine seems to not cope properly with this, and we cannot
> > > break backward compatibility.
> > Would you be more precise? If getAllVmStats returns an errCode for one
> > of the Vms, what happens?
>
> Of course.
>
> For GetAllVmStats, AFAIK, but please correct me if I am wrong, because I'm not really expert on engine side,
> engine simply does not expects anything different from a list
> of either a RunningVmStats or an ExitedVmStats.
>
> So not sure (will verify just after this mail) if engine can cope with mixed content,
> meaning [ stats, errCode, stats, stats ... ]
>
> For GetVmStats, like Michal said, the engine expects the call to succeed otherwise
> it puts the VM into the Unknown state.
>
> >
> > >
> > > Looks like the only way to go is to always return success and to add a
> > > field to describe the content of the
> > > statistics (partial, complete...). THis is admittedly a far cry from the
> > > ideal solution, but it is dictated
> > > by the need to preserve the compatibility with current/old engines.
> >
> > I don't think that I understand your suggestion, but it does not sound
> > right to send a partial dictionary and to "appologize" about its
> > paritiality elsewhere. The dictionary may be "partial" for engine-4.0
> > yet "complete" for engine-3.5. It's not for Vdsm to grade its own
> > output.
>
> I see your point (that's one of the reasons I'm not happy about this solution).
> Please see below for the detauls.
>
> > > please note that I'm not really happy about this solution, but, given the
> > > constraint, I don't see better alternatives.
> >
> > Please explain the benefits of describing the partial content, as I do
> > not see them.
>
> The root issue here is the getStats must always succeed, because the engine doesn't
> expect otherwise and thus we guarantee this to cope with old engines;
> but inside VDSM, getStats can actually fail in a lot of places
> (like in this case getBalloonInfo)
>
> So, in VDSM we can end up with a partial response, and now the question
> is: what to do with this partial response? And if there are optional fields
> in the response, how can the engine distinguish between
>
> * full response, but with an optional field missing
>
> and
>
> * partial response (because of an exception inside VDSM),
> without some field, incidentally including an optional one
>
> ?
>
> The VDSM 'grading' was an hint from VDSM to help engine to distinguish
> between those cases.
>
> Even if we agree this grading idea is bad, the core issue remains open:
> what to do if we end up with a partial response?
> For example, let's say we handle the getBalloonInfo exception (http://gerrit.ovirt.org/#/c/26539/),
> the stats object to be returned will lack
>
> * the (mandatory, expected) balloon stats
> * the (optional) migration progress - ok, bad example because this makes sense only during migrations,
> but other optional fields may be added later and the issue remains
>
> Again, anyone feel free to correct me if I misunderstood something about engine
> (or VDSM <=> engine communication) and to suggest better alternatives :\
Currently, we have way too many try-except-Exception clauses in our
code. They swallow everything: from expected libvirt errors to
unexpected syntax errors. We should eliminate them, not add more.
Mandatory stuff must be reported, so if we fail extracting them
we'd better explode and raise an error. Optional stuff are optional, so
we could drop them from the output, in order to report the mandatory
ones.
http://gerrit.ovirt.org/#/c/26539/2/vdsm/virt/vm.py currently suggest
that Vdsm lie when it fails to extract the current ballon size, and say
that it's 0. I'd prefer to drop balloon_cur from the return value of
_getBalloonInfo or drop balloonInfo from the reported stats.
My only question is the granularity of the definition: is balloonInfo
atomic, or can it be reported without balloon_cur? Should it? I'd prefer
to have the "atoms" as big as possible, to limit the number of
combinations - if self._dom is None, don't report balloonInfo at all,
and so if the libvirt connection timed out.
More information about the Devel
mailing list