[Engine-devel] [vdsm] Proposal VDSM <=> Engine Data Statistics Retrieval Optimization

Fri Mar 8 02:30:50 UTC 2013

On 03/08/2013 06:11 AM, Dan Kenigsberg wrote:
> On Thu, Mar 07, 2013 at 12:25:54PM +0100, Vinzenz Feenstra wrote:
>> Please find the prettier version on the wiki:
>> http://www.ovirt.org/Proposal_VDSM_-_Engine_Data_Statistics_Retrieval
>>
>>
>>   Proposal VDSM - Engine Data Statistics Retrieval
>>
>>
>>     VDSM <=> Engine data retrieval optimization
>>
>>
>>       Motivation:
>>
>> Currently the RHEVM engine is polling the a lot of data from VDSM
>> every 15 seconds. This should be optimized and the amount of data
>> requested should be more specific.
> It feels like a good idea, but do you have numbers? How much traffic
> would be saved? Remember the added computation incurred on each host -
> there's always a price to pay.
>
>> For each VM the data currently contains much more information than
>> actually needed which blows up the size of the XML content quite
>> big. We could optimize this by splitting the reply on the getVmStats
>> based on the request of the engine into sections. For this reason
>> Omer Frenkel and me have split up the data into parts based on their
>> usage.
>>
>> This data can and usually does change during the lifetime of the VM.
>>
>>
>>         Rarely Changed:
>>
>> This data is change not very frequent and it should be enough to
>> update this only once in a while. Most commonly this data changes
>> after changes made in the UI or after a migration of the VM to
>> another Host.
>>
>>     *Status*  = Running
> Status does not change much, but when it does, it is important to report
> that quickly.
For this kind of data, it is suitable to use an event report, which 
should be available in the jsonrpc API.
>
>>     *acpiEnable*  = true
>>     *vmType*  = kvm
>>     *guestName*  = W864GUESTAGENTT
>>     *displayType*  = qxl
>>     *guestOs*  = Win 8
>>     *kvmEnable*  = true #/*this should be constant and never changed*/
>>     *pauseCode*  = NOERR
>>     *monitorResponse*  = 0
>>     *session*  = Locked # unused
>>     *netIfaces*  = [{'name': 'Realtek RTL8139C+ Fast Ethernet NIC', 'inet6':  ['fe80::490c:92bb:bbcc:9f87'], 'inet': ['10.34.60.148'], 'hw': '00:1a:4a:22:3c:db'}]
>>     *appsList*  = ['RHEV-Tools 3.2.4', 'RHEV-Agent64 3.2.3', 'RHEV-Serial64 3.2.3', 'RHEV-Network64 3.2.2', 'RHEV-Network64 3.2.3', 'RHEV-Block64 3.2.3', 'RHEV-Balloon64 3.2.3', 'RHEV-Balloon64 3.2.2', 'RHEV-Agent64 3.2.2', 'RHEV-USB 3.2.3', 'RHEV-Block64 3.2.2', 'RHEV-Serial64 3.2.2']
>>     *pid*  = 11314
>>     *guestIPs*  = 10.34.60.148 # duplicated info
>>     *displayIp*  = 0
>>     *displayPort*  = 5902
>>     *displaySecurePort*  = 5903
>>     *username*  = user at W864GUESTAGENTT
>>     *clientIp*  =
>>     *lastLogin*  = 1361976900.67
>>
>>
>>         Often Changed:
>>
>> This data is changed quite often however it is not necessary to
>> update this data every 15 seconds. As this is cumulative data and
>> reflects the current status, and it does not need to be snapshotted
>> every 15 seconds to retrieve statistics. The data can be retrieved
>> in much more generous time slices. (e.g. Every 5 minutes)
>>
>>     *network*  = {'vnet1': {'macAddr': '00:1a:4a:22:3c:db', 'rxDropped': '0', 'txDropped': '0', 'rxErrors': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'unknown', 'speed': '100', 'name': 'vnet1'}}
>>     *disksUsage*  = [{'path': 'c:\\', 'total': '64055406592', 'fs': 'NTFS', 'used': '19223846912'}, {'path': 'd:\\', 'total': '3490912256', 'fs': 'UDF', 'used': '3490912256'}]
>>     *timeOffset*  = 14422
>>     *elapsedTime*  = 68591
>>     *hash*  = 2335461227228498964
>>     *statsAge*  = 0.09 # unused
>>
>>
>>         Often Changed but unused
>>
>> This data does not seem to be used in the engine at all. It is *not*
>> even used in the data warehouse.
>>
>>     *memoryStats*  = {'swap_out': '0', 'majflt': '0', 'mem_free': '1466884', 'swap_in': '0', 'pageflt': '0', 'mem_total': '2096736', 'mem_unused': '1466884'}
>>     *balloonInfo*  = {'balloon_max': 2097152, 'balloon_cur': 2097152}
>>     *disks*  = {'vda': {'readLatency': '0', 'apparentsize': '64424509440', 'writeLatency': '1754496', 	'imageID': '28abb923-7b89-4638-84f8-1700f0b76482', 'flushLatency': '156549',  'readRate': '0.00', 'truesize': '18855059456', 'writeRate': '952.05'}, 'hdc': {'readLatency': '0', 'apparentsize': '0', 'writeLatency': '0', 'flushLatency': '0', 'readRate': '0.00', 'truesize': '0', 'writeRate': '0.00'}}
> I am pretty sure that {read,write,flush}Latency is collected and
> reported by Engine. `git grep writeLatency` reinforces my vague memory.
>>
>>         Very frequent uppdates needed by webadmin portal:
>>
>> This data is mostly needed for the webadmin portal and might be
>> required to be updated quite often. An exception here is the
>> statsAge field, which seems to be unused by the Engine. This data
>> could be requested every 15 seconds to keep things as they are now.
>>
>>     *cpuSys*  = 2.32
>>     *cpuUser*  = 1.34
>>     *memUsage*  = 30
>>
>>
>>     Proposed Solution for VDSM & Engine:
>>
>> We will introduce new optional parameters to getVmStats,
>> getAllVmStats and list to allow a finer grained specification of
>> data which should be included.
>>
>> *Parameter:* *statsType*=/*<string>*/ (getVmStats, getAllVmStats
>> only) *Allowed values:*
>>
>>   * full (default to keep backwards compatibility)
>>   * app-list (Just send the application list)
>>   * rare (include everything from rarely changed to very frequent)
>>   * often (include everything from often changed to very frequent)
>>   * frequent (only send the very frequently changed items)
> I think that a nice way to think of this, is that Engine ask for a set
> of keys it is interested about. Asking for getVmStats(keys=[displayType,
> netIfaces]) would return only the requrested values of the VM.
+1. It could  split the information according to different functions, 
not just change frequency.
> "full",
> "rare", "often" and "frequent" are simply pre-defined sets of key names.
>
> A side effect of this pov is that we can avoid the vague name
> "statsType".
>
>>
>> *Parameter:* *clientId*=*<string>* The client id is specified by the
>> client and should be unique however constantly used.
>>
>> *Parameter:* *diff*=*<boolean>* In combination with the clientId
>> VDSM will send only differences to the previous request from the
>> named clientId. (if diff=true)
> The semantics of "diff" is not completely defined: how about complex
> structures like that of "network"? It is most likely to be reported
> every time.
>
> Since this requires a caching mechanism on vdsm side, Engine must expect
> that the cache may be evicted in any moment, and that a full list is
> received.
Every data collector should be responsible to invalidate/update the cache.
It could reduce the time to calculate the diff.
>>
>>       Additional Change:
>>
>> Besides the introduction of the new parameters for list, getVmStats
>> and getAllVmStats it might make sense to include a hash for the
>> appList into the rarely changed section of the response which would
>> allow to identify changes and avoid having to sent the complete
>> appList every so often and only if the hash known to the client is
>> outdated.
>>
>> *Note:* The appList (Application List) reported by the guest agent
>> could be fully implemented on request only, as long as the guest
>> agent installed supports this. As there seems to be a request to
>> have the complete list of installed applications on all guests this
>> data could be quite extensive and a huge list. On the other hand
>> this data is only rarely visible and therefore it should not be
>> requested all the time and only on demand.
>>
>>
>>       Improvement of the Guest Agent:
>>
>> As part of the proposed solution it is necessary to improve the
>> guest agent as well.
> Improving the agent may be a good idea, but I do not see the necessity
> in it. It's also important to improve the horrible multithreaded
> vdsm/libvirt statistics acquisition, but just as unrelated to the core
> of this feature.
>
>> For the full application list there should be
>> implemented a caching system which will be fully reactive and should
>> not poll the application list for example all the time. The guest
>> can create a prepared data file containing all data in the JSON
>> format (as used for the communication with VDSM via VIO) and just
>> have to read that file from disk and directly sends it to VDSM.
>> However it is quite possible that this list is to big and it might
>> have to be chunked into pieces. (Multiple messages, which would have
>> to be supported by VDSM then as well) The solution for this is to
>> make VDSM request this data and it will retrieve the data necessary
>> on request only.
> _______________________________________________
> vdsm-devel mailing list
> vdsm-devel at lists.fedorahosted.org
> https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel