On Apr 9, 2014, at 14:21 , Dan Kenigsberg <danken(a)redhat.com> wrote:
> On Wed, Apr 09, 2014 at 01:11:22PM +0200, Michal Skrivanek wrote:
>>
>> On Apr 8, 2014, at 18:57 , Dan Kenigsberg <danken(a)redhat.com> wrote:
>>
>>> On Tue, Apr 08, 2014 at 06:52:50AM -0400, Francesco Romani wrote:
>>>> Hello VDSM developers,
>>>
>>> Please use devel@ovirt, and mention "vdsm" at the subject. This
thread
>>> in particular involves Engine as well.
>>>
>>>>
>>>> I'd like to discuss and explain the plans for cleaning up
Vm.getStats()
>>>> in vdsm/virt/vm.py, and how it affects a bug we have:
https://bugzilla.redhat.com/show_bug.cgi?id=1073478
>>>>
>>>> Let's start from the bug.
>>>>
>>>> To make a long story short, there is a (small) race in VDSM, probably
introduced by commit
>>>> 8fedf8bde3c28edb07add23c3e9b72681cb48e49
>>>> The race can actually be triggered only if the VM is shutting down, so
it is not that bad.
>>>>
>>>> Fixing the reported issue in the specific case can be done with a
trivial if, and that it what I did
>>>> in my initial
http://gerrit.ovirt.org/#/c/25803/1/vdsm/vm.py,cm
>>>
>>> Could you explain how an AttributeError there moved the VM to Down?
>>>
>>>>
>>>> However, this is just a bandaid and a proper solution is needed. This is
the reason why
>>>> the following versions of
http://gerrit.ovirt.org/#/c/25803 changed
direction toward a more comprehensive
>>>> approach.
>>>>
>>>> And this is the core of the issue.
>>>> My initial idea is to either return success and a complete, well formed
statistics set, or return an error.
>>>> However current engine seems to not cope properly with this, and we
cannot break backward compatibility.
>>>
>>> Would you be more precise? If getAllVmStats returns an errCode for one
>>> of the Vms, what happens?
>>
>> VM moves to Unknown after some time
>>
>>>
>>>>
>>>> Looks like the only way to go is to always return success and to add a
field to describe the content of the
>>>> statistics (partial, complete...). THis is admittedly a far cry from the
ideal solution, but it is dictated
>>>> by the need to preserve the compatibility with current/old engines.
>>>
>>> I don't think that I understand your suggestion, but it does not sound
>>> right to send a partial dictionary and to "appologize" about its
>>> paritiality elsewhere. The dictionary may be "partial" for
engine-4.0
>>> yet "complete" for engine-3.5. It's not for Vdsm to grade its
own
>>> output.
>>>
>>>>
>>>> Moreover, I would like to take the chance and cleanup/refactor the
statistics collection. I already started
>>>> adding the test infrastructure:
http://gerrit.ovirt.org/#/c/26536/
>>>>
>>>> To summarize, what I suggest to do is:
>>>>
>>>> * fix
https://bugzilla.redhat.com/show_bug.cgi?id=1073478 using a simple
ugly fix like the original
>>>>
http://gerrit.ovirt.org/#/c/25803/1/vdsm/vm.py,cm (which I'll
resubmit as new change)
>>>> * refactor and clean getStats() and friends
>>>> * on the cleaner base, properly fix the statistics collection by let
getStats() always succeed and return
>>>> possibly partial stats, with a new field describing the content
>>>>
>>>> please note that I'm not really happy about this solution, but,
given the constraint, I don't see better
>>>> alternatives.
>>>
>>> Please explain the benefits of describing the partial content, as I do
>>> not see them.
>>
>> the main reason for me is that currently if we fail to fetch some parts of the
stats we still return what we could get. E.g.
>> try:
>> stats.update(self.guestAgent.getGuestInfo())
>> except Exception:
>> return stats
>>
>> currently in engine we can only tell it's partial by not seeing the missing
bits…
>> it makes sense since for lifecycle decisions we care for the VM status, but we
don't necessarily care for other stuff
>
> Engine knows that the data is partial, since it sees only parts of it.
> What is the benefit of Vdsm declaring it as partial?
there are a lot of variable fields so it's not that easy to say if it's partial
or not. It does make sense to let "someone" know that there is an issue and
you're getting partial data, yet you don't want to make any automatic actions...
But Vdsm cannot make this decision. Soon, Vdsm is to report the host's
boot time. Now assume that Vdsm fails to do so. Is the stats "partial"?
It's partial for engine-3.5, but it's complete for engine-3.4.
Vdsm should tell as much of the truth that it knows.
We could extend the "alerts" mechanism to report non-lethal errors in
getVmStats (we currently have it only in for storage domain status),
where Engine is told what's missing and why. I'm not sure if this is
really needed, though.