[Engine-devel] NUMA support action items

Chegu Vinod chegu_vinod at hp.com
Thu Apr 3 14:21:49 UTC 2014


On 4/3/2014 7:11 AM, Gilad Chaplik wrote:
> ----- Original Message -----
>> From: "Chegu Vinod" <chegu_vinod at hp.com>
>> To: "Xiao-Lei Shi (Bruce, HP Servers-PSC-CQ)" <xiao-lei.shi at hp.com>
>> Cc: "Einav Cohen" <ecohen at redhat.com>, "Shang-Chun Liang (David Liang, HPservers-Core-OE-PSC)"
>> <shangchun.liang at hp.com>, "Chuan Liao (Jason Liao, HPservers-Core-OE-PSC)" <chuan.liao at hp.com>, msivak at redhat.com,
>> "Da-huai Tang (Gary, MCXS-CQ)" <da-huai.tang at hp.com>, "Malini Rao" <mrao at redhat.com>, "Eldan Hildesheim"
>> <ehildesh at redhat.com>, "Doron Fediuck" <dfediuck at redhat.com>, sherold at redhat.com, "Alexander Wels"
>> <awels at redhat.com>, "Gilad Chaplik" <gchaplik at redhat.com>
>> Sent: Thursday, April 3, 2014 3:28:03 PM
>> Subject: RE: NUMA support action items
>>
>> Hi Bruce,
>>
>> The virtual NUMA layout in the guest is a very simple one (not multi-level
>> etc). It is generated by qemu+seabios... and there is no relationship with
>> the host NUMA node distances etc.  Let us not worry about gathering Virtual
>> NUMA node distances for now.
>>
>> Vinod
>>
> CC'ing devel list as well.
>
> Having said that, I don't see a reason why not to prepare an infrastructure for that (if it's free) for future versions (guest agent will collect vNuma data in some point in time).

If you think having this Virtual NUMA topology (along with the virtual 
numa node *distance* info.) really helps some future use cases then pl. 
go ahead...

Vinod



>
> Thanks,
> Gilad.
>
>> -----Original Message-----
>> From: Shi, Xiao-Lei (Bruce, HP Servers-PSC-CQ)
>> Sent: Thursday, April 03, 2014 12:41 AM
>> To: Vinod, Chegu
>> Cc: Einav Cohen; Liang, Shang-Chun (David Liang, HPservers-Core-OE-PSC);
>> Liao, Chuan (Jason Liao, HPservers-Core-OE-PSC); msivak at redhat.com; Tang,
>> Da-huai (Gary, MCXS-CQ); Malini Rao; Eldan Hildesheim; Doron Fediuck;
>> sherold at redhat.com; Alexander Wels; Gilad Chaplik
>> Subject: RE: NUMA support action items
>>
>> Hi Vinod,
>>
>> Is it meaningful for us to collect the distance information of vm numa node
>> (maybe in future, not now)?
>> In my understanding, vm numa topology is a simulation of numa topology, since
>> the vcpus are just threads, I don't know how the vm numa node distances are
>> calculated in vm. Is there any relationship between the vNode distances and
>> host node distances?
>>
>> Thanks & Best Regards
>> Shi, Xiao-Lei (Bruce)
>>
>> Hewlett-Packard Co., Ltd.
>> HP Servers Core Platform Software China Telephone +86 23 65683093 Mobile +86
>> 18696583447 Email xiao-lei.shi at hp.com
>>
>>
>> -----Original Message-----
>> From: Vinod, Chegu
>> Sent: Thursday, April 03, 2014 7:18 AM
>> To: Gilad Chaplik
>> Cc: Einav Cohen; Liang, Shang-Chun (David Liang, HPservers-Core-OE-PSC);
>> Liao, Chuan (Jason Liao, HPservers-Core-OE-PSC); msivak at redhat.com; Shi,
>> Xiao-Lei (Bruce, HP Servers-PSC-CQ); Tang, Da-huai (Gary, MCXS-CQ); Malini
>> Rao; Eldan Hildesheim; Doron Fediuck; sherold at redhat.com; Alexander Wels
>> Subject: RE: NUMA support action items
>>
>> Not sure what the correct way to do this is....but here is a suggestion.
>>
>> Let a given host server diagram shown be very generic...i.e. show the N
>> sockets/nodes numbered from 0 thru N-1.  Show the amount of memory and the
>> list of CPUs in each of those sockets/nodes.
>> Draw a generic Interconnect fabric [box] in between which all the sockets
>> connect to....
>>
>> Ideally ... Under that host diagram we could show the NUMA node distances in
>> text format (as you know this is derived from the "numactl -H" and then
>> conveyed from VDSM-> oVIrt engine etc).
>> That distance info. will tell the user what the distance between a pair of
>> sockets/nodes are (and they can then do what they wish after that :)).
>>
>> Vinod
>>
>> -----Original Message-----
>> From: Gilad Chaplik [mailto:gchaplik at redhat.com]
>> Sent: Wednesday, April 02, 2014 4:09 PM
>> To: Vinod, Chegu
>> Cc: Einav Cohen; Liang, Shang-Chun (David Liang, HPservers-Core-OE-PSC);
>> Liao, Chuan (Jason Liao, HPservers-Core-OE-PSC); msivak at redhat.com; Shi,
>> Xiao-Lei (Bruce, HP Servers-PSC-CQ); Tang, Da-huai (Gary, MCXS-CQ); Malini
>> Rao; Eldan Hildesheim; Doron Fediuck; sherold at redhat.com; Alexander Wels
>> Subject: Re: NUMA support action items
>>
>> Thank you Vinod for the much elaborate explanation.
>> GUI-wise, do you want to show those numbers? maybe for first phase, enough to
>> show them via API?
>>
>> A thought, According to your example there could be up to 2 distances, so
>> maybe the 'closer' nodes can be on the same column or sth; I mean to try an
>> illustrate it graphically rather than with numbers (we have enough of those
>> :)).
>>
>> Thanks,
>> Gilad.
>>
>> ----- Original Message -----
>>> From: "Chegu Vinod" <chegu_vinod at hp.com>
>>> To: "Einav Cohen" <ecohen at redhat.com>
>>> Cc: "Gilad Chaplik" <gchaplik at redhat.com>, "Shang-Chun Liang (David Liang,
>>> HPservers-Core-OE-PSC)"
>>> <shangchun.liang at hp.com>, "Chuan Liao (Jason Liao,
>>> HPservers-Core-OE-PSC)" <chuan.liao at hp.com>, msivak at redhat.com, "Xiao-Lei
>>> Shi (Bruce, HP Servers-PSC-CQ)" <xiao-lei.shi at hp.com>, "Da-huai Tang
>>> (Gary, MCXS-CQ)"
>>> <da-huai.tang at hp.com>, "Malini Rao" <mrao at redhat.com>, "Eldan Hildesheim"
>>> <ehildesh at redhat.com>, "Doron Fediuck"
>>> <dfediuck at redhat.com>, sherold at redhat.com, "Alexander Wels"
>>> <awels at redhat.com>
>>> Sent: Saturday, March 29, 2014 8:15:56 AM
>>> Subject: Re: NUMA support action items
>>>
>>> On 3/27/2014 10:42 AM, Einav Cohen wrote:
>>>> Hi Vinod, thank you very much for that extra information.
>>>>
>>>> unfortunately, we are not familiar with what are levels of NUMA
>>>> (local socket/node, buddy socket/node, remote socket/
>>>> node) and/or what "distance" is - I assume that these are
>>>> definitions that are related to the physical layout of the
>>>> sockets/cores/nodes/RAM and/or to their physical proximity to each
>>>> other, but we will need more detailed explanations if this would
>>>> need to be incorporated into the UX design.
>>>>
>>>> Will you be able to explain it to us / refer us to some material on
>>>> that?
>>> Sorry for the delay in response (I was in a conference).
>>>
>>> Not sure if the following hi-level explanation would help (I will look
>>> for some references in the mean time..or perhaps you can ask someone
>>> like Joe Mario in Shak's performance group to explain it to you).
>>>
>>> In the smaller NUMA servers each socket is directly connected (i.e.
>>> single "hop" away) to any other socket in the server.. This is typical
>>> of all 2 socket Intel servers and a vast majority of 4 socket Intel
>>> servers.
>>>
>>> In some larger NUMA servers a socket could either be directly
>>> connected (single "hop" away) to another socket (or) may have to go
>>> through an interconnect fabric (like a crossbar fabric agent chip.
>>> etc). to get to another socket in the system (i.e. several "hops"
>>> away).  The sockets that are directly connected (i.e. single "hop"
>>> away) are the buddy sockets...and those that aren't are the remote
>>> sockets.  Some call this type of a server as having a multi-level NUMA
>>> topology...
>>>
>>> The way to decipher all of this is by looking at the NUMA node
>>> distance table (I had included a sample of that in the slides that I sent
>>> earlier).
>>>
>>> For e.g. in the example 4 socket server..where all sockets are just
>>> one hop away the node distances are as follows
>>>
>>> node distances:
>>> node   0   1   2   3
>>>     0:  10  21  21  21
>>>     1:  21  10  21  21
>>>     2:  21  21  10  21
>>>     3:  21  21  21  10
>>>
>>> Going from node0 to nodes[1-3] (or for that matter any pair of nodes)
>>> the node distance is the same. i.e. 2.1x latency
>>>
>>> In another example of a different (larger 8 socket server) the node
>>> distances looked something like this :
>>>
>>> node distances:
>>> node   0   1   2   3     4    5    6    7
>>>     0:   10  16  30  30  30  30  30  30
>>>     1:   16  10  30  30  30  30  30  30
>>>     2:   30  30  10  16  30  30  30  30
>>>     3:   30  30  16  10  30  30  30  30
>>>     4:   30  30  30  30  10  16  30  30
>>>     5:   30  30  30  30  16  10  30  30
>>>     6:   30  30  30  30  30  30  10  16
>>>     7:   30  30  30  30  30  30  16  10
>>>
>>> Going from node 0 to node 1 (buddy)  which is just one hop away had a
>>> node distance of 1.6x... but going from node 0 to nodes 3-7 meant
>>> going through the interconnect fabric and it was expensive i.e. 3x.
>>> The nodes
>>> 3-7 are the remote nodes for node 0.
>>>
>>> HTH
>>> Vinod
>>>
>>>> Many thanks in advance.
>>>>
>>>> ----
>>>> Regards,
>>>> Einav
>>>>
>>>>
>>>> ----- Original Message -----
>>>>> From: "Chegu Vinod" <chegu_vinod at hp.com>
>>>>> To: "Gilad Chaplik" <gchaplik at redhat.com>, "Shang-Chun Liang (David
>>>>> Liang, HPservers-Core-OE-PSC)"
>>>>> <shangchun.liang at hp.com>, "Chuan Liao (Jason Liao,
>>>>> HPservers-Core-OE-PSC)"
>>>>> <chuan.liao at hp.com>, msivak at redhat.com, "Xiao-Lei Shi (Bruce, HP
>>>>> Servers-PSC-CQ)" <xiao-lei.shi at hp.com>, "Da-huai Tang (Gary,
>>>>> MCXS-CQ)"
>>>>> <da-huai.tang at hp.com>, "Malini Rao" <mrao at redhat.com>, "Eldan
>>>>> Hildesheim"
>>>>> <ehildesh at redhat.com>
>>>>> Cc: "Doron Fediuck" <dfediuck at redhat.com>, "Einav Cohen"
>>>>> <ecohen at redhat.com>, sherold at redhat.com, "Alexander Wels"
>>>>> <awels at redhat.com>
>>>>> Sent: Thursday, March 27, 2014 12:00:51 AM
>>>>> Subject: RE: NUMA support action items
>>>>>
>>>>> Thanks for sharing the UX info.
>>>>>
>>>>> There is one thing that I forgot to mention in today's morning
>>>>> meeting...
>>>>>
>>>>> There are hosts that will  have one level of NUMA (i.e.   local
>>>>> socket/node
>>>>> and then  remote socket/node). Most  <= 4 socket hosts belong to
>>>>> this category.  (I consider this as the sweet spot servers)
>>>>>
>>>>> When it comes to larger hosts with 8 sockets and more...there can
>>>>> be some hosts with multiple levels of NUMA  (i.e. local
>>>>> socket/node,  buddy socket/node, and then remote socket/node).
>>>>>
>>>>> Pl. see attached.... (the 8 socket prototype system is a HP
>>>>> platform...and its actually only showing half of the system...the
>>>>> actual system is 16 sockets but has a similar NUMA topology). The
>>>>> NUMA  node distances of a given host will provide information about the
>>>>> # of levels of NUMA ...
>>>>>
>>>>> Something to keep in mind when you  folks choose to display the
>>>>> host NUMA toplogy in the UX.
>>>>>
>>>>> Thanks
>>>>> Vinod
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Gilad Chaplik [mailto:gchaplik at redhat.com]
>>>>> Sent: Wednesday, March 26, 2014 9:26 AM
>>>>> To: Liang, Shang-Chun (David Liang, HPservers-Core-OE-PSC); Liao,
>>>>> Chuan (Jason Liao, HPservers-Core-OE-PSC); msivak at redhat.com; Shi,
>>>>> Xiao-Lei (Bruce, HP Servers-PSC-CQ); Vinod, Chegu; Tang, Da-huai
>>>>> (Gary, MCXS-CQ); Malini Rao; Eldan Hildesheim
>>>>> Cc: Doron Fediuck; Einav Cohen; sherold at redhat.com; Alexander Wels
>>>>> Subject: NUMA support action items
>>>>>
>>>>> Hi All,
>>>>>
>>>>> First of all I'd like to thank Malini and Eldan for their great
>>>>> work, I'm sure we'll have a cool UI thanks to them, and Vinod for great
>>>>> insights.
>>>>>
>>>>> Keep on with the great work :-)
>>>>>
>>>>> Action items (as I see it) for next couple of weeks (in parasitism
>>>>> the
>>>>> owner):
>>>>>
>>>>> 0) Resolve community design comments, and finish design phase
>>>>> including sketches (All).
>>>>> 1) Finish UX design and sketches (Malini and Eldan, all to assist).
>>>>> * focus on VM dialog (biggest gap as I see it).
>>>>> * 'default host' topology view, where we don't pin a host.
>>>>> * NUMA in cluster level.
>>>>> 2) Engine Core API, merge BE patch [1], and prepare patches for
>>>>> other APIs (commands (VdcActionType), queries (VdcQueryType),
>>>>> including parameter classes).
>>>>> note that the actual implementation can be mock-ups of fake NUMA
>>>>> entities, in order to start GUI/RESTful development in parallel (HP
>>>>> development team).
>>>>> 3) Test VDSM API (vdcClient) including very basic benchmarks and
>>>>> publish a report (HP development team).
>>>>> 4) VDSM - engine core integration (HP development team, Martin and
>>>>> Gilad to assist).
>>>>> 5) DB scripts and store proc - post maintainer (Eli M) acking the
>>>>> design (HP development team, Gilad to assist).
>>>>> 6) RESTful API impl - post maintainer (Juan H) acking the design
>>>>> (HP development team, Gilad to assist).
>>>>> 7) GUI programmatic design and starting implementation - in order
>>>>> to start it ASAP, the engine's API should be available ASAP see
>>>>> action item #2 (Gilad, assistance from Einav's UX team).
>>>>> 8) MOM and KSM integration, continue current thread and reach
>>>>> conclusions (HP development team, Martin to assist).
>>>>>
>>>>> You are more than welcome to comment :-) nothing's carved in stone.
>>>>> if I forgot someone, please reply to all and CC him.
>>>>>
>>>>> Thanks,
>>>>> Gilad.
>>>>>
>>>>> [1] http://gerrit.ovirt.org/#/c/23702/
>>>>>
>>>> .
>>>>
>>>
> .
>




More information about the Engine-devel mailing list