[Engine-devel] Design wiki page for trusted compute pools integration with oVirt has been updated

Sun Apr 28 13:24:06 UTC 2013

Hi Jimmy,
When the engine starts, it will start 'learning' the current state
of known hosts. So if we want to secure all hosts in 'up' state,
you will need to change it to 'unassigned' only if cluster.trusted == true in-
org.ovirt.engine.core.vdsbroker.ResourceManager.AddVds(VDS, boolean)

This means the host will be picked up by:
org.ovirt.engine.core.vdsbroker.VirtMonitoringStrategy.isMonitoringNeeded(VDS)

and the attestation query should get the results and update the trust level and
accordingly the vds status.

Note that we assume that trust level takes precedence on other functionalities,
as this flow will cause storage connections to reinitialize as well. Basically
it means that booting will take longer, but this is the price of security.

----- Original Message -----
> From: "Gang Wei" <gang.wei at intel.com>
> To: "Itamar Heim" <iheim at redhat.com>, "Doron Fediuck" <dfediuck at redhat.com>
> Cc: "Oved Ourfalli" <ovedo at redhat.com>, engine-devel at ovirt.org
> Sent: Sunday, April 28, 2013 2:06:24 PM
> Subject: Re: [Engine-devel] Design wiki page for trusted compute pools integration with oVirt has been updated
> 
> I like the ideas of 2-phase aggregated attestation & cluster-by-cluster
> order.
> 
> But I want to understand the process more clearly.
> 
> Without TCP, how does engine handle the states of existing hosts during
> engine booting? Will engine put all existing hosts in non-operational state
> and then perform some check via VDSM then turn it into operational state?
> Put host in non-operational state will cause VM migration, right?
> 
> Or there is a global state in engine to indicate whether user is allowed to
> create VM?
> 
> Thanks
> Jimmy
> 
> Itamar Heim wrote on 2013-04-28:
> > On 04/28/2013 11:34 AM, Doron Fediuck wrote:
> >> Hi Dave,
> >> 
> >> Just to make sure I fully understand, I'll repeat your basic arguments;
> >> 
> >> 1. It takes time to query a big number of hosts (hundreds).
> >> 
> >> 2. When backend is booting, a user may start a VM on a host which was
> >> hacked during the downtime of the engine.
> >> 
> >> If the above is your concern, it shouldn't be so.
> >> The reason is, that no host will become operational before you get a
> response
> >> from the attestation server and allow it to become operational. So a user
> >> cannot start a new VM on a non-operational host.
> > 
> > i'd do the queries in groups of "cluster", so cluste-by-cluster they get
> > unblocked. cluster without attestation service shouldn't block on this
> > of course.
> > 
> >> 
> >> What this means is that your thread may need to update the user by
> sending
> >> a periodic event that a large scale attestation operation is in progress.
> >> Other than that, maybe your thread can work in smaller groups if it gets
> >> better results? ie- instead of one query for 300 hosts, maybe you can run
> >> 3 serialized queries for 100 hosts each?
> >> If this does not help, maybe you can run a short query for something like
> >> 10 hosts, which should get an answer relatively fast. The you can issue a
> >> query for the other 290 hosts which will take longer. In this way the
> system
> >> may get 10 hosts to work with quite fast, and later on the other 290
> hosts
> >> will join... So this can actually be configurable to a 2-phase process;
> >> a short query and a longer one. The admin can choose the short query size
> >> based on his setup, and the longer query can pick up all the other hosts.
> >> What do you think?
> >> 
> >> Doron
> >> 
> >> ----- Original Message -----
> >>> From: "Wei D Chen" <wei.d.chen at intel.com> To: "Doron Fediuck"
> >>> <dfediuck at redhat.com> Cc: "Oved Ourfalli" <ovedo at redhat.com>,
> >>> engine-devel at ovirt.org Sent: Saturday, April 27, 2013 9:36:44 AM
> >>> Subject: Re: [Engine-devel] Design wiki page for trusted compute pools
> >>> integration with oVirt has been updated
> >>> 
> >>> Hi,
> >>> 
> >>> Our current consideration is add a new thread in engine's side to
> >>> attest all of hosts (aggregated query from attestation sever) one time
> >>> in case of engine's rebooting. There is still one potential issue
> >>> under extreme condition, saying, hundreds of nodes in a datacenter,
> >>> attest all of hosts still may take couple of mins, let's say, one
> >>> hacked untrusted node before receiving the latest status may
> >>> considered as a trusted host, so, the worst case in a datacenter with
> >>> hundreds of nodes is, 1. engine is down for some reasons and boot up
> >>> again, some trusted nodes may be hacked and rebooted during this
> >>> period. 2. our thread is running to get all of node's status (trust
> >>> /untrusted), may take couple of mins in large datacenter. 2. user
> >>> create VMs on these hacked nodes and believe these VMs are trusted VMs
> >>> launched on trusted nodes. 3. our thread get the correct status of
> >>> these untrusted nodes, set these nodes as non-operational. 4. all of
> >>> these "trusted" VMs running on these untrusted nodes are expected to
> >>> migrate to other trusted node.
> >>> 
> >>> So, the question is in a trusted cluster with hundreds of nodes some
> >>> VMs expected to create on trusted nodes may actually create on
> >>> untrusted nodes instead, and this time may last for couple of mins.
> >>> (worst case in my view is 10 mins with 1000 nodes). Does this
> >>> acceptable from your point of view? Or any other suggestion?
> >>> 
> >>> 
> >>> Best Regards,
> >>> Dave Chen
> >>> 
> >>> 
> >>> Doron Fediuck wrote on 2013-04-21:
> >>>> integration with oVirt has been updated
> >>>> 
> >>>> 
> >>>> 
> >>>> ----- Original Message -----
> >>>>> From: "Wei D Chen" <wei.d.chen at intel.com>
> >>>>> To: "Ofri Masad" <omasad at redhat.com>
> >>>>> Cc: "Oved Ourfalli" <ovedo at redhat.com>, engine-devel at ovirt.org
> >>>>> Sent: Sunday, April 21, 2013 4:00:55 PM
> >>>>> Subject: Re: [Engine-devel] Design wiki page for trusted compute pools
> >>>>> integration with oVirt has been updated
> >>>>> 
> >>>>> Ofri,
> >>>>> 
> >>>>> Absolutely right, aggregated query has a significantly time improve
> >>>>> compared to separated queries. I agree a aggregated query on
> >>>>> engine's starting. Is it possible to invoke attestation service in
> >>>>> engine's initialization code block instead of "quartz job"? Is there
> >>>>> any class similar with " InitVdsOnUpCommand " for engine's
> >>>>> initialization?
> >>>>> 
> >>>>> Best Regards,
> >>>>> Dave Chen
> >>>>> 
> >>>> org.ovirt.engine.core.bll.Backend.Initialize()
> >>>> 
> >>>> Note you cannot block this method while waiting for results. Instead
> >>>> I suggest you fire a one-time background request from this method.
> >>>> 
> >>>> 
> >>>> Ofri Masad wrote on 2013-04-21:
> >>>>> integration with oVirt has been updated
> >>>>> 
> >>>>> Dave,
> >>>>> 
> >>>>> If I'm not mistaking, there is a big difference between separated
> >>>>> queries to the attestation server and aggregated one?
> >>>>> Is it true?
> >>>>> 
> >>>>> Thanks,
> >>>>> Ofri
> >>>>> 
> >>>>> ----- Original Message -----
> >>>>>> From: "Itamar Heim" <iheim at redhat.com>
> >>>>>> To: "Ofri Masad" <omasad at redhat.com>
> >>>>>> Cc: "Oved Ourfalli" <ovedo at redhat.com>, "Wei D Chen"
> >>>>>> <wei.d.chen at intel.com>, engine-devel at ovirt.org
> >>>>>> Sent: Sunday, April 21, 2013 10:20:17 AM
> >>>>>> Subject: Re: [Engine-devel] Design wiki page for trusted compute
> >>>>>> pools integration with oVirt has been updated
> >>>>>> 
> >>>>>> On 04/21/2013 10:13 AM, Ofri Masad wrote:
> >>>>>>> Hi,
> >>>>>>> One more thing we need to think about for the second approach -
> >>>>>>> aggregated query. On engine start we need to determine the trust
> >>>>>>> state of all the hosts. sending a separate query for each host
> >>>>>>> will overload the attestation host and the network. an initial
> >>>>>>> aggregated query needs to be send when the engine starts.
> >>>>>>> Same thing can happen after management network fail and so on.
> >>>>>>> Maybe we can run a quartz job every x minutes, checking if a large
> >>>>>>> part of the hosts in the cluster (like 30%) are untrusted - in
> >>>>>>> that case run the aggregated query.
> >>>>>> 
> >>>>>> are we sure this optimization is needed?
> >>>>>> how heavy/latent is the call to the attestation service?
> >>>>>> 
> >>>>> _______________________________________________
> >>>>> Engine-devel mailing list
> >>>>> Engine-devel at ovirt.org
> >>>>> http://lists.ovirt.org/mailman/listinfo/engine-devel
> >>>>> 
> >>> _______________________________________________
> >>> Engine-devel mailing list
> >>> Engine-devel at ovirt.org
> >>> http://lists.ovirt.org/mailman/listinfo/engine-devel
> >>> 
> >> _______________________________________________
> >> Engine-devel mailing list
> >> Engine-devel at ovirt.org
> >> http://lists.ovirt.org/mailman/listinfo/engine-devel
> >> 
> > 
> > _______________________________________________
> > Engine-devel mailing list
> > Engine-devel at ovirt.org
> > http://lists.ovirt.org/mailman/listinfo/engine-devel
> 
> 
> 
> Jimmy
> 
> 
> 
> _______________________________________________
> Engine-devel mailing list
> Engine-devel at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/engine-devel
>