[Engine-devel] Proposal VDSM <=> Engine Data Statistics Retrieval Optimization
Dan Kenigsberg
danken at redhat.com
Thu Mar 7 22:11:20 UTC 2013
On Thu, Mar 07, 2013 at 12:25:54PM +0100, Vinzenz Feenstra wrote:
> Please find the prettier version on the wiki:
> http://www.ovirt.org/Proposal_VDSM_-_Engine_Data_Statistics_Retrieval
>
>
> Proposal VDSM - Engine Data Statistics Retrieval
>
>
> VDSM <=> Engine data retrieval optimization
>
>
> Motivation:
>
> Currently the RHEVM engine is polling the a lot of data from VDSM
> every 15 seconds. This should be optimized and the amount of data
> requested should be more specific.
It feels like a good idea, but do you have numbers? How much traffic
would be saved? Remember the added computation incurred on each host -
there's always a price to pay.
>
> For each VM the data currently contains much more information than
> actually needed which blows up the size of the XML content quite
> big. We could optimize this by splitting the reply on the getVmStats
> based on the request of the engine into sections. For this reason
> Omer Frenkel and me have split up the data into parts based on their
> usage.
>
> This data can and usually does change during the lifetime of the VM.
>
>
> Rarely Changed:
>
> This data is change not very frequent and it should be enough to
> update this only once in a while. Most commonly this data changes
> after changes made in the UI or after a migration of the VM to
> another Host.
>
> *Status* = Running
Status does not change much, but when it does, it is important to report
that quickly.
> *acpiEnable* = true
> *vmType* = kvm
> *guestName* = W864GUESTAGENTT
> *displayType* = qxl
> *guestOs* = Win 8
> *kvmEnable* = true #/*this should be constant and never changed*/
> *pauseCode* = NOERR
> *monitorResponse* = 0
> *session* = Locked # unused
> *netIfaces* = [{'name': 'Realtek RTL8139C+ Fast Ethernet NIC', 'inet6': ['fe80::490c:92bb:bbcc:9f87'], 'inet': ['10.34.60.148'], 'hw': '00:1a:4a:22:3c:db'}]
> *appsList* = ['RHEV-Tools 3.2.4', 'RHEV-Agent64 3.2.3', 'RHEV-Serial64 3.2.3', 'RHEV-Network64 3.2.2', 'RHEV-Network64 3.2.3', 'RHEV-Block64 3.2.3', 'RHEV-Balloon64 3.2.3', 'RHEV-Balloon64 3.2.2', 'RHEV-Agent64 3.2.2', 'RHEV-USB 3.2.3', 'RHEV-Block64 3.2.2', 'RHEV-Serial64 3.2.2']
> *pid* = 11314
> *guestIPs* = 10.34.60.148 # duplicated info
> *displayIp* = 0
> *displayPort* = 5902
> *displaySecurePort* = 5903
> *username* = user at W864GUESTAGENTT
> *clientIp* =
> *lastLogin* = 1361976900.67
>
>
> Often Changed:
>
> This data is changed quite often however it is not necessary to
> update this data every 15 seconds. As this is cumulative data and
> reflects the current status, and it does not need to be snapshotted
> every 15 seconds to retrieve statistics. The data can be retrieved
> in much more generous time slices. (e.g. Every 5 minutes)
>
> *network* = {'vnet1': {'macAddr': '00:1a:4a:22:3c:db', 'rxDropped': '0', 'txDropped': '0', 'rxErrors': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'unknown', 'speed': '100', 'name': 'vnet1'}}
> *disksUsage* = [{'path': 'c:\\', 'total': '64055406592', 'fs': 'NTFS', 'used': '19223846912'}, {'path': 'd:\\', 'total': '3490912256', 'fs': 'UDF', 'used': '3490912256'}]
> *timeOffset* = 14422
> *elapsedTime* = 68591
> *hash* = 2335461227228498964
> *statsAge* = 0.09 # unused
>
>
> Often Changed but unused
>
> This data does not seem to be used in the engine at all. It is *not*
> even used in the data warehouse.
>
> *memoryStats* = {'swap_out': '0', 'majflt': '0', 'mem_free': '1466884', 'swap_in': '0', 'pageflt': '0', 'mem_total': '2096736', 'mem_unused': '1466884'}
> *balloonInfo* = {'balloon_max': 2097152, 'balloon_cur': 2097152}
> *disks* = {'vda': {'readLatency': '0', 'apparentsize': '64424509440', 'writeLatency': '1754496', 'imageID': '28abb923-7b89-4638-84f8-1700f0b76482', 'flushLatency': '156549', 'readRate': '0.00', 'truesize': '18855059456', 'writeRate': '952.05'}, 'hdc': {'readLatency': '0', 'apparentsize': '0', 'writeLatency': '0', 'flushLatency': '0', 'readRate': '0.00', 'truesize': '0', 'writeRate': '0.00'}}
I am pretty sure that {read,write,flush}Latency is collected and
reported by Engine. `git grep writeLatency` reinforces my vague memory.
>
>
> Very frequent uppdates needed by webadmin portal:
>
> This data is mostly needed for the webadmin portal and might be
> required to be updated quite often. An exception here is the
> statsAge field, which seems to be unused by the Engine. This data
> could be requested every 15 seconds to keep things as they are now.
>
> *cpuSys* = 2.32
> *cpuUser* = 1.34
> *memUsage* = 30
>
>
> Proposed Solution for VDSM & Engine:
>
> We will introduce new optional parameters to getVmStats,
> getAllVmStats and list to allow a finer grained specification of
> data which should be included.
>
> *Parameter:* *statsType*=/*<string>*/ (getVmStats, getAllVmStats
> only) *Allowed values:*
>
> * full (default to keep backwards compatibility)
> * app-list (Just send the application list)
> * rare (include everything from rarely changed to very frequent)
> * often (include everything from often changed to very frequent)
> * frequent (only send the very frequently changed items)
I think that a nice way to think of this, is that Engine ask for a set
of keys it is interested about. Asking for getVmStats(keys=[displayType,
netIfaces]) would return only the requrested values of the VM. "full",
"rare", "often" and "frequent" are simply pre-defined sets of key names.
A side effect of this pov is that we can avoid the vague name
"statsType".
>
>
> *Parameter:* *clientId*=*<string>* The client id is specified by the
> client and should be unique however constantly used.
>
> *Parameter:* *diff*=*<boolean>* In combination with the clientId
> VDSM will send only differences to the previous request from the
> named clientId. (if diff=true)
The semantics of "diff" is not completely defined: how about complex
structures like that of "network"? It is most likely to be reported
every time.
Since this requires a caching mechanism on vdsm side, Engine must expect
that the cache may be evicted in any moment, and that a full list is
received.
>
>
> Additional Change:
>
> Besides the introduction of the new parameters for list, getVmStats
> and getAllVmStats it might make sense to include a hash for the
> appList into the rarely changed section of the response which would
> allow to identify changes and avoid having to sent the complete
> appList every so often and only if the hash known to the client is
> outdated.
>
> *Note:* The appList (Application List) reported by the guest agent
> could be fully implemented on request only, as long as the guest
> agent installed supports this. As there seems to be a request to
> have the complete list of installed applications on all guests this
> data could be quite extensive and a huge list. On the other hand
> this data is only rarely visible and therefore it should not be
> requested all the time and only on demand.
>
>
> Improvement of the Guest Agent:
>
> As part of the proposed solution it is necessary to improve the
> guest agent as well.
Improving the agent may be a good idea, but I do not see the necessity
in it. It's also important to improve the horrible multithreaded
vdsm/libvirt statistics acquisition, but just as unrelated to the core
of this feature.
> For the full application list there should be
> implemented a caching system which will be fully reactive and should
> not poll the application list for example all the time. The guest
> can create a prepared data file containing all data in the JSON
> format (as used for the communication with VDSM via VIO) and just
> have to read that file from disk and directly sends it to VDSM.
> However it is quite possible that this list is to big and it might
> have to be chunked into pieces. (Multiple messages, which would have
> to be supported by VDSM then as well) The solution for this is to
> make VDSM request this data and it will retrieve the data necessary
> on request only.
More information about the Engine-devel
mailing list