status update: consuming the BLOCK_THRESHOLD event from libvirt (rhbz#1181665)

older
Nested virtualization on Fedora 26

Francesco Romani

19 Jul 2017 19 Jul '17

5:44 a.m.

Hi all, With libvirt 3.2.0 and onwards, it seems we have now the tools to solve https://bugzilla.redhat.com/show_bug.cgi?id=1181665 and eventually get rid of the disk polling we do. This change is expected to have huge impact on performance, so I'm working on it. I had plans for a comprehensive refactoring in this area, but looks like a solution backportable for 4.1.z is appealing, so I started with this first, saving the refactoring (which I still very much want) for later. So, quick summary: libvirt >= 3.2.0 allows to set a threshold to any node in the backing chain of each drive of a VM (https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainSetBlockThresh...), and fire one event exactly once when that threshold is crossed. The event needs to be explicitely rearmed after. This is exactly what we need to get rid of polling in the steady state, so far so good. The problem is: we can't use this for some important flows we have, and which involve usage of disks not (yet) attached to a given VM. Possibly affected flows: - live storage migration: we use flags = (libvirt.VIR_DOMAIN_BLOCK_COPY_SHALLOW | libvirt.VIR_DOMAIN_BLOCK_COPY_REUSE_EXT | VIR_DOMAIN_BLOCK_COPY_TRANSIENT_JOB) meaning that Vdsm is in charge of handling the volume - snapshots: we use snapFlags = (libvirt.VIR_DOMAIN_SNAPSHOT_CREATE_REUSE_EXT | libvirt.VIR_DOMAIN_SNAPSHOT_CREATE_NO_METADATA) (same meaning as above) - live merge: should be OK (according to a glance at the source and a chat with Adam). So looks like we will need to bridge this gap. So we can still use the BLOCK_THRESHOLD event for steady state, and avoid polling in the vast majority of the cases. With "steady state" I mean that the VM is running, with no administration (snapshot, live merge, live storage migration...) operation in progress. I think it is fair to assume that VMs are in this state the vast majority of the time. For the very important cases on which we cannot depend on events, we can fall back to polling, but in a smarter way: instead of polling everything every 2s, let's just poll just the drives involved in the ongoing operations. Those should be far less of the total amount of drives, and for a far shorter time than today, so polling should be practical. Since the event fires once, we will need to rearm it only if the operation is ongoing, and only just before to start it (both conditions easy to check) We can disable the polling on completion, or on error. This per se is easy, but we will need a careful review of the flows, and perhaps some safety nets in place. Anyway, should we miss to disable the polling, we will "just" have some overhead. On recovery, we will need to make sure to rearm all the relevant events, but we can just plug in the recovery we must do already, so this should be easy as well. So it seems to me this could fly and we can actually have the performance benefits of events. However, due to the fact that we need to review some existing and delicate flows, I think we should still keep the current polling code around for the next release. I believe the best route is: 1. offer the new event-based code for 4.2, keep the polling around. Default to events for performance 2. remove the polling completely in 4.3 I'm currently working on the patches here: https://gerrit.ovirt.org/#/q/status:open+project:vdsm+branch:master+topic:wa... Even though the basics are in place, I don't think they are ready for review yet. Comments welcome, as usual. -- Francesco Romani Senior SW Eng., Virtualization R&D Red Hat IRC: fromani github: @fromanirh

Show replies by date

Barak Korren

19 Jul 19 Jul

6:32 a.m.

On 19 July 2017 at 12:44, Francesco Romani <fromani@redhat.com> wrote:

...

2. remove the polling completely in 4.3

This all is way beyond my level of understanding but, I wonder, is there a danger of events failing to be delivered? Would there be benefit of keeping the polling around but reducing the frequency to something like once an hour? -- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted

Francesco Romani

21 Jul 21 Jul

3:09 a.m.

On 07/19/2017 12:32 PM, Barak Korren wrote:

...

On 19 July 2017 at 12:44, Francesco Romani <fromani@redhat.com> wrote:

...
2. remove the polling completely in 4.3

This all is way beyond my level of understanding but, I wonder, is there a danger of events failing to be delivered?

Yes, there always is this risk, because of bugs :) Besides that, we must pay attention on recovering. Everything else is "just" a libvirt bug.

...

Would there be benefit of keeping the polling around but reducing the frequency to something like once an hour?

The whole point of having a such frequent polling is to avoid that VMs get paused because of disk exhaustion while they run on thin-provisioned storage. So infrequent polling is little help here, better to disable it entirely. Bests, -- Francesco Romani Senior SW Eng., Virtualization R&D Red Hat IRC: fromani github: @fromanirh

Roy Golan

23 Jul 23 Jul

4:58 a.m.

On Wed, Jul 19, 2017 at 12:46 PM Francesco Romani <fromani@redhat.com> wrote:

...

Hi all,

With libvirt 3.2.0 and onwards, it seems we have now the tools to solve https://bugzilla.redhat.com/show_bug.cgi?id=1181665

and eventually get rid of the disk polling we do. This change is expected to have huge impact on performance, so I'm working on it.

I had plans for a comprehensive refactoring in this area, but looks like a solution backportable for 4.1.z is appealing, so I

started with this first, saving the refactoring (which I still very much want) for later.

So, quick summary: libvirt >= 3.2.0 allows to set a threshold to any node in the backing chain of each drive of a VM

( https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainSetBlockThresh... ), and fire one event exactly once

when that threshold is crossed. The event needs to be explicitely rearmed after.

This is exactly what we need to get rid of polling in the steady state, so far so good.

The problem is: we can't use this for some important flows we have, and which involve usage of disks not (yet) attached to a given VM.

...

Possibly affected flows:

- live storage migration:

we use flags = (libvirt.VIR_DOMAIN_BLOCK_COPY_SHALLOW | libvirt.VIR_DOMAIN_BLOCK_COPY_REUSE_EXT | VIR_DOMAIN_BLOCK_COPY_TRANSIENT_JOB)

meaning that Vdsm is in charge of handling the volume

- snapshots:

we use snapFlags = (libvirt.VIR_DOMAIN_SNAPSHOT_CREATE_REUSE_EXT | libvirt.VIR_DOMAIN_SNAPSHOT_CREATE_NO_METADATA)

(same meaning as above)

- live merge: should be OK (according to a glance at the source and a chat with Adam).

So looks like we will need to bridge this gap.

So we can still use the BLOCK_THRESHOLD event for steady state, and avoid polling in the vast majority of the cases.

With "steady state" I mean that the VM is running, with no administration (snapshot, live merge, live storage migration...) operation in progress.

I think it is fair to assume that VMs are in this state the vast majority of the time. For the very important cases on which we cannot depend on events, we can fall back to polling, but in a smarter way:

instead of polling everything every 2s, let's just poll just the drives involved in the ongoing operations.

Those should be far less of the total amount of drives, and for a far shorter time than today, so polling should be practical.

Since the event fires once, we will need to rearm it only if the operation is ongoing, and only just before to start it (both conditions easy to check) We can disable the polling on completion, or on error. This per se is easy, but we will need a careful review of the flows, and perhaps some safety nets in place.

Consider fusing polling and events into a single pipeline of events so they can be used together. If a poll triggers an event (with distinguished origin) then it all the handling is done in one place and it should be easy to stop or start polling, or remove them totally.

...

Anyway, should we miss to disable the polling, we will "just" have some overhead.

...

On recovery, we will need to make sure to rearm all the relevant events, but we can just plug in the recovery we must do already, so this should be easy as well.

What is needed in order to 'rearm' it? is there an API to get the state of event subscription? If we lost an event how do we know to rearm it? is it idempotent to rearm? Remind me, do we extend a disk if the VM paused with out of space event? How will we handle 2 subsequent events if we didn't extend between them? (expecting the extend to be async operation) So it seems to me this could fly and we can actually have the

...

performance benefits of events.

However, due to the fact that we need to review some existing and

...

delicate flows, I think we should still keep the current polling code around for the next release.

+1

...

I believe the best route is:

1. offer the new event-based code for 4.2, keep the polling around. Default to events for performance

2. remove the polling completely in 4.3

Still wonder if removing them totally is good. The absence of the events should be supervised somehow - like in today, a failure to poll getstats of a domain will result in a VM going unresponsive. Not the most accurate state but at least gives some visibility. So polling should cover us where events will fail. (similar to engine's vms monitoring)

I'm currently working on the patches here:

...

https://gerrit.ovirt.org/#/q/status:open+project:vdsm+branch:master+topic:wa...

Even though the basics are in place, I don't think they are ready for review yet.

Comments welcome, as usual.

-- Francesco Romani Senior SW Eng., Virtualization R&D Red Hat IRC: fromani github: @fromanirh

_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

Michal Skrivanek

9:26 a.m.

...

On 23 Jul 2017, at 11:58, Roy Golan <rgolan@redhat.com> wrote: =20 On Wed, Jul 19, 2017 at 12:46 PM Francesco Romani <fromani@redhat.com = <mailto:fromani@redhat.com>> wrote: Hi all, =20 =20 With libvirt 3.2.0 and onwards, it seems we have now the tools to = solve https://bugzilla.redhat.com/show_bug.cgi?id=3D1181665 = <https://bugzilla.redhat.com/show_bug.cgi?id=3D1181665> =20 and eventually get rid of the disk polling we do. This change is expected to have huge impact on performance, so I'm working on it. =20 =20 I had plans for a comprehensive refactoring in this area, but looks =

...

a solution backportable for 4.1.z is appealing, so I =20 started with this first, saving the refactoring (which I still very = much want) for later. =20 =20 So, quick summary: libvirt >=3D 3.2.0 allows to set a threshold to any node in the backing chain of each drive of a VM =20 = (https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainSetBlockThr= eshold = <https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainSetBlockThr= eshold>), and fire one event exactly once =20 when that threshold is crossed. The event needs to be explicitely rearmed after. =20 This is exactly what we need to get rid of polling in the steady = state, so far so good. =20 =20 The problem is: we can't use this for some important flows we have, = and which involve usage of disks not (yet) attached to a given VM. =20 =20 =20 =20 Possibly affected flows: =20 - live storage migration: =20 we use flags =3D (libvirt.VIR_DOMAIN_BLOCK_COPY_SHALLOW | libvirt.VIR_DOMAIN_BLOCK_COPY_REUSE_EXT | VIR_DOMAIN_BLOCK_COPY_TRANSIENT_JOB) =20 meaning that Vdsm is in charge of handling the volume =20 - snapshots: =20 we use snapFlags =3D = (libvirt.VIR_DOMAIN_SNAPSHOT_CREATE_REUSE_EXT | libvirt.VIR_DOMAIN_SNAPSHOT_CREATE_NO_METADATA) =20 =20 (same meaning as above) =20 - live merge: should be OK (according to a glance at the source and a chat with Adam). =20 =20 So looks like we will need to bridge this gap. =20 =20 So we can still use the BLOCK_THRESHOLD event for steady state, and avoid polling in the vast majority of the cases. =20 With "steady state" I mean that the VM is running, with no administration (snapshot, live merge, live storage migration...) operation in progress. =20 I think it is fair to assume that VMs are in this state the vast majority of the time. For the very important cases on which we cannot depend on events, we = can fall back to polling, but in a smarter way: =20 instead of polling everything every 2s, let's just poll just the = drives involved in the ongoing operations. =20 Those should be far less of the total amount of drives, and for a far shorter time than today, so polling should be practical. =20 Since the event fires once, we will need to rearm it only if the operation is ongoing, and only just before to start it (both = conditions easy to check) We can disable the polling on completion, or on error. This per se is easy, but we will need a careful review of the flows, and perhaps some safety nets in place. =20 =20 Consider fusing polling and events into a single pipeline of events so =

...

then it all the handling is done in one place and it should be easy to = stop or start polling, or remove them totally. =20 Anyway, should we miss to disable the polling, we will "just" have = some overhead. =20 =20 On recovery, we will need to make sure to rearm all the relevant = events, but we can just plug in the recovery we must do already, so this = should be easy as well. =20 =20 What is needed in order to 'rearm' it? is there an API to get the = state of event subscription? If we lost an event how do we know to rearm it? is it idempotent to = rearm? =20 Remind me, do we extend a disk if the VM paused with out of space = event? =20 How will we handle 2 subsequent events if we didn't extend between =

--Apple-Mail=_59A168E2-93D5-4CF3-B121-15F35FE387B0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii like they can be used together. If a poll triggers an event (with = distinguished origin) them? (expecting the extend to be async operation)

...

=20 So it seems to me this could fly and we can actually have the performance benefits of events. =20 =20 However, due to the fact that we need to review some existing and delicate flows, I think we should still keep the current polling code around for the next release. =20 +1 =20 =20 I believe the best route is: =20 1. offer the new event-based code for 4.2, keep the polling around. Default to events for performance =20 2. remove the polling completely in 4.3 =20 =20 Still wonder if removing them totally is good. The absence of the = events should be supervised somehow - like in today, a failure to poll = getstats of a domain will result in a VM going unresponsive. Not the = most accurate state but at least gives some visibility. So polling = should cover us where events will fail. (similar to engine's vms = monitoring)

It is a different case. With disk extensions there always is the = fallback of actually hitting the 100%, pausing the VM, and triggering = the extend anyway. So I do not think there is a need for another = mechanism when event is missed (e.g. due to a vdsm restart).

...

=20 I'm currently working on the patches here: = https://gerrit.ovirt.org/#/q/status:open+project:vdsm+branch:master+topic:= watermark-event-minimal = <https://gerrit.ovirt.org/#/q/status:open+project:vdsm+branch:master+topic= :watermark-event-minimal> =20 =20 Even though the basics are in place, I don't think they are ready for review yet. =20 =20 Comments welcome, as usual. =20 =20 -- Francesco Romani Senior SW Eng., Virtualization R&D Red Hat IRC: fromani github: @fromanirh =20 _______________________________________________ Devel mailing list Devel@ovirt.org <mailto:Devel@ovirt.org> http://lists.ovirt.org/mailman/listinfo/devel = <http://lists.ovirt.org/mailman/listinfo/devel> _______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

--Apple-Mail=_59A168E2-93D5-4CF3-B121-15F35FE387B0 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=us-ascii <html><head><meta http-equiv=3D"Content-Type" content=3D"text/html = charset=3Dus-ascii"></head><body style=3D"word-wrap: break-word; = -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" = class=3D""><br class=3D""><div><blockquote type=3D"cite" class=3D""><div = class=3D"">On 23 Jul 2017, at 11:58, Roy Golan <<a = href=3D"mailto:rgolan@redhat.com" class=3D"">rgolan@redhat.com</a>> = wrote:</div><br class=3D"Apple-interchange-newline"><div class=3D""><div = dir=3D"ltr" class=3D""><div class=3D"gmail_quote"><div dir=3D"ltr" = class=3D"">On Wed, Jul 19, 2017 at 12:46 PM Francesco Romani <<a = href=3D"mailto:fromani@redhat.com" target=3D"_blank" = class=3D"">fromani@redhat.com</a>> wrote:<br = class=3D""></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 = .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi all,<br class=3D""> <br class=3D""> <br class=3D""> With libvirt 3.2.0 and onwards, it seems we have now the tools to = solve<br class=3D""> <a href=3D"https://bugzilla.redhat.com/show_bug.cgi?id=3D1181665" = rel=3D"noreferrer" target=3D"_blank" = class=3D"">https://bugzilla.redhat.com/show_bug.cgi?id=3D1181665</a><br = class=3D""> <br class=3D""> and eventually get rid of the disk polling we do. This change is<br = class=3D""> expected to have huge impact on performance, so I'm working on it.<br = class=3D""> <br class=3D""> <br class=3D""> I had plans for a comprehensive refactoring in this area, but looks = like<br class=3D""> a solution backportable for 4.1.z is appealing, so I<br class=3D""> <br class=3D""> started with this first, saving the refactoring (which I still very = much<br class=3D""> want) for later.<br class=3D""> <br class=3D""> <br class=3D""> So, quick summary: libvirt >=3D 3.2.0 allows to set a threshold to = any<br class=3D""> node in the backing chain of each drive of a VM<br class=3D""> <br class=3D""> (<a = href=3D"https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainSetB= lockThreshold" rel=3D"noreferrer" target=3D"_blank" = class=3D"">https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainS= etBlockThreshold</a>),<br class=3D""> and fire one event exactly once<br class=3D""> <br class=3D""> when that threshold is crossed. The event needs to be explicitely<br = class=3D""> rearmed after.<br class=3D""> <br class=3D""> This is exactly what we need to get rid of polling in the steady = state,<br class=3D""> so far so good.<br class=3D""> <br class=3D""> <br class=3D""> The problem is: we can't use this for some important flows we have, = and<br class=3D""> which involve usage of disks not (yet) attached to a given VM.<br = class=3D""> <br class=3D""> <br class=3D""></blockquote><div class=3D""><br class=3D""></div><div = class=3D""> <br class=3D""></div><blockquote class=3D"gmail_quote" = style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> Possibly affected flows:<br class=3D""> <br class=3D""> - live storage migration:<br class=3D""> <br class=3D""> we use flags =3D = (libvirt.VIR_DOMAIN_BLOCK_COPY_SHALLOW |<br class=3D""> = libvirt.VIR_DOMAIN_BLOCK_COPY_REUSE_EXT |<br class=3D""> = VIR_DOMAIN_BLOCK_COPY_TRANSIENT_JOB)<br class=3D""> <br class=3D""> meaning that Vdsm is in charge of handling the volume<br = class=3D""> <br class=3D""> - snapshots:<br class=3D""> <br class=3D""> we use snapFlags =3D = (libvirt.VIR_DOMAIN_SNAPSHOT_CREATE_REUSE_EXT |<br class=3D""> = libvirt.VIR_DOMAIN_SNAPSHOT_CREATE_NO_METADATA)<br class=3D""> <br class=3D""> <br class=3D""> (same meaning as above)<br class=3D""> <br class=3D""> - live merge: should be OK (according to a glance at the source and a<br = class=3D""> chat with Adam).<br class=3D""> <br class=3D""> <br class=3D""> So looks like we will need to bridge this gap.<br class=3D""> <br class=3D""> <br class=3D""> So we can still use the BLOCK_THRESHOLD event for steady state, and<br = class=3D""> avoid polling in the vast majority of the cases.<br class=3D""> <br class=3D""> With "steady state" I mean that the VM is running, with no<br class=3D""> administration (snapshot, live merge, live storage migration...)<br = class=3D""> operation in progress.<br class=3D""> <br class=3D""> I think it is fair to assume that VMs are in this state the vast<br = class=3D""> majority of the time.<br class=3D""> For the very important cases on which we cannot depend on events, we = can<br class=3D""> fall back to polling, but in a smarter way:<br class=3D""> <br class=3D""> instead of polling everything every 2s, let's just poll just the = drives<br class=3D""> involved in the ongoing operations.<br class=3D""> <br class=3D""> Those should be far less of the total amount of drives, and for a far<br = class=3D""> shorter time than today, so polling should be practical.<br class=3D""> <br class=3D""> Since the event fires once, we will need to rearm it only if the<br = class=3D""> operation is ongoing, and only just before to start it (both = conditions<br class=3D""> easy to check)<br class=3D""> We can disable the polling on completion, or on error. This per se is<br = class=3D""> easy, but we will need a careful review of the flows, and perhaps = some<br class=3D""> safety nets in place.<br class=3D""> <br class=3D""></blockquote><div class=3D""><br class=3D""></div><div = class=3D"">Consider fusing polling and events into a single pipeline of = events so they can be used together. If a poll triggers an event (with = distinguished origin)<br class=3D""></div><div class=3D"">then it all = the handling is done in one place and it should be easy to stop or start = polling, or remove them totally.<br class=3D""></div><div = class=3D""> </div><blockquote class=3D"gmail_quote" style=3D"margin:0= 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> Anyway, should we miss to disable the polling, we will "just" have = some<br class=3D""> overhead.<br class=3D""> <br class=3D""></blockquote><div class=3D""> <br = class=3D""></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 = .8ex;border-left:1px #ccc solid;padding-left:1ex"> On recovery, we will need to make sure to rearm all the relevant = events,<br class=3D""> but we can just plug in the recovery we must do already, so this = should<br class=3D""> be easy as well.<br class=3D""> <br class=3D""></blockquote><div class=3D""> </div></div><div = dir=3D"ltr" class=3D""><div class=3D"gmail_quote"><div class=3D"">What = is needed in order to 'rearm' it? is there an API to get the state of = event subscription?<br class=3D""></div><div class=3D"">If we lost an = event how do we know to rearm it? is it idempotent to rearm?<br = class=3D""></div><div class=3D""><br class=3D"">Remind me, do we extend = a disk if the VM paused with out of space event?<br class=3D""><br = class=3D""></div><div class=3D"">How will we handle 2 subsequent events = if we didn't extend between them? (expecting the extend to be async = operation)<br class=3D""><br class=3D""></div></div></div><div dir=3D"ltr"= class=3D""><div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" = style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> So it seems to me this could fly and we can actually have the<br = class=3D""> performance benefits of events.<br class=3D""> <br class=3D""></blockquote><div class=3D""><br = class=3D""></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 = .8ex;border-left:1px #ccc solid;padding-left:1ex"> However, due to the fact that we need to review some existing and<br = class=3D""> delicate flows, I think we should still keep the current polling code<br = class=3D""> around for the next release.<br class=3D""> <br class=3D""></blockquote><div class=3D"">+1 <br = class=3D""> <br class=3D""></div><blockquote class=3D"gmail_quote" = style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> I believe the best route is:<br class=3D""> <br class=3D""> 1. offer the new event-based code for 4.2, keep the polling around.<br = class=3D""> Default to events for performance<br class=3D""> <br class=3D""> 2. remove the polling completely in 4.3<br class=3D""> <br class=3D""> <br class=3D""></blockquote><div class=3D"">Still wonder if removing = them totally is good. The absence of the events should be supervised = somehow - like in today, a failure to poll getstats of a domain will = result in a VM going unresponsive. Not the most accurate state but at = least gives some visibility. So polling should cover us where events = will fail. (similar to engine's vms monitoring)<br = class=3D""></div></div></div></div></div></blockquote><div><br = class=3D""></div>It is a different case. With disk extensions there = always is the fallback of actually hitting the 100%, pausing the VM, and = triggering the extend anyway. So I do not think there is a need for = another mechanism when event is missed (e.g. due to a vdsm = restart).</div><div><br class=3D""><blockquote type=3D"cite" = class=3D""><div class=3D""><div dir=3D"ltr" class=3D""><div dir=3D"ltr" = class=3D""><div class=3D"gmail_quote"><div class=3D""><br = class=3D""></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 = .8ex;border-left:1px #ccc solid;padding-left:1ex"> I'm currently working on the patches here:<br class=3D""> <a = href=3D"https://gerrit.ovirt.org/#/q/status:open+project:vdsm+branch:maste= r+topic:watermark-event-minimal" rel=3D"noreferrer" target=3D"_blank" = class=3D"">https://gerrit.ovirt.org/#/q/status:open+project:vdsm+branch:ma= ster+topic:watermark-event-minimal</a><br class=3D""> <br class=3D""> <br class=3D""> Even though the basics are in place, I don't think they are ready for<br = class=3D""> review yet.<br class=3D""> <br class=3D""> <br class=3D""> Comments welcome, as usual.<br class=3D""> <br class=3D""> <br class=3D""> --<br class=3D""> Francesco Romani<br class=3D""> Senior SW Eng., Virtualization R&D<br class=3D""> Red Hat<br class=3D""> IRC: fromani github: @fromanirh<br class=3D""> <br class=3D""> _______________________________________________<br class=3D""> Devel mailing list<br class=3D""> <a href=3D"mailto:Devel@ovirt.org" target=3D"_blank" = class=3D"">Devel@ovirt.org</a><br class=3D""> <a href=3D"http://lists.ovirt.org/mailman/listinfo/devel" = rel=3D"noreferrer" target=3D"_blank" = class=3D"">http://lists.ovirt.org/mailman/listinfo/devel</a><br = class=3D""> </blockquote></div></div></div> _______________________________________________<br class=3D"">Devel = mailing list<br class=3D""><a href=3D"mailto:Devel@ovirt.org" = class=3D"">Devel@ovirt.org</a><br = class=3D"">http://lists.ovirt.org/mailman/listinfo/devel</div></blockquote=

...

</div><br class=3D""></body></html>=

--Apple-Mail=_59A168E2-93D5-4CF3-B121-15F35FE387B0--

Francesco Romani

24 Jul 24 Jul

4:05 a.m.

This is a multi-part message in MIME format. --------------A8BFBE836D14266855BB394E Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit TL;DR: NEWS! First two patches (https://gerrit.ovirt.org/#/c/79386/5 and https://gerrit.ovirt.org/#/c/79264/14) are now review worthy! On 07/23/2017 10:58 AM, Roy Golan wrote:

...

[...] So we can still use the BLOCK_THRESHOLD event for steady state, and avoid polling in the vast majority of the cases.

With "steady state" I mean that the VM is running, with no administration (snapshot, live merge, live storage migration...) operation in progress.

I think it is fair to assume that VMs are in this state the vast majority of the time. For the very important cases on which we cannot depend on events, we can fall back to polling, but in a smarter way:

instead of polling everything every 2s, let's just poll just the drives involved in the ongoing operations.

Those should be far less of the total amount of drives, and for a far shorter time than today, so polling should be practical.

Since the event fires once, we will need to rearm it only if the operation is ongoing, and only just before to start it (both conditions easy to check) We can disable the polling on completion, or on error. This per se is easy, but we will need a careful review of the flows, and perhaps some safety nets in place.

Consider fusing polling and events into a single pipeline of events so they can be used together. If a poll triggers an event (with distinguished origin) then it all the handling is done in one place and it should be easy to stop or start polling, or remove them totally.

Yes, this is the final design I have in mind. I have plans to refactor Vdsm master to make it look like that. It will play nice with refactorings that storage team has planned. Let's see if virt refactorings are just needed to have the block threshold events, or if we can postpone them.

...

On recovery, we will need to make sure to rearm all the relevant events, but we can just plug in the recovery we must do already, so this should be easy as well.

What is needed in order to 'rearm' it? is there an API to get the state of event subscription? If we lost an event how do we know to rearm it? is it idempotent to rearm?

QEMU supports a single threshold per block device (= node of backing chain), so rearming a threshold just means setting a new threshold, overwriting the old one. To rearm the, we need to get the highest allocation of block devices and set the threshold. If we do that among the first thing of recovery, it should be little risk, if any. To know if we need to do that, we "just" need to inspect all block devices at recovery. It doesn't come for free, but I believe it is a fair price.

...

Remind me, do we extend a disk if the VM paused with out of space event?

Yes we do. We examine the last paused reason in recovery, we do extension in this case

...

How will we handle 2 subsequent events if we didn't extend between them? (expecting the extend to be async operation)

At qemu level, the event cannot fire twice, must be rearmed after every firing. In general, should virt code receive two events before the extension completed... I don't know yet :) Perhaps we can start just handling the first event, I don't think we can easily queue extension requests (and I'm not sure we should)

...

I believe the best route is:

1. offer the new event-based code for 4.2, keep the polling around. Default to events for performance

2. remove the polling completely in 4.3

Still wonder if removing them totally is good. The absence of the events should be supervised somehow - like in today, a failure to poll getstats of a domain will result in a VM going unresponsive. Not the most accurate state but at least gives some visibility. So polling should cover us where events will fail. (similar to engine's vms monitoring)

I don't have strong opinions about polling removal as long as it is disabled by default. Actually, I like having fallbacks and safety nets in place. However, the libvirt event support is here to stay, and as time goes, it should only get better (featurewise and reliability wise).

...

I'm currently working on the patches here: https://gerrit.ovirt.org/#/q/status:open+project:vdsm+branch:master+topic:wa...

Even though the basics are in place, I don't think they are ready for review yet.

First two patches (https://gerrit.ovirt.org/#/c/79386/5 and https://gerrit.ovirt.org/#/c/79264/14) are now review worthy! -- Francesco Romani Senior SW Eng., Virtualization R&D Red Hat IRC: fromani github: @fromanirh --------------A8BFBE836D14266855BB394E Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 8bit <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> </head> <body text="#000000" bgcolor="#FFFFFF"> TL;DR: NEWS!<br> <br> First two patches (<a class="moz-txt-link-freetext" href="https://gerrit.ovirt.org/#/c/79386/5">https://gerrit.ovirt.org/#/c/79386/5</a> and <a class="moz-txt-link-freetext" href="https://gerrit.ovirt.org/#/c/79264/14">https://gerrit.ovirt.org/#/c/79264/14</a>) are now review worthy!<br> <br> <br> <div class="moz-cite-prefix">On 07/23/2017 10:58 AM, Roy Golan wrote:<br> </div> <blockquote type="cite" cite="mid:CAC_Jqc=nZY+sqDD21_mxPmr+kzxTM15jq=_4c5zJDyzQnupHLA@mail.gmail.com"> <div dir="ltr"> <div class="gmail_quote"><br> <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> [...]<br> So we can still use the BLOCK_THRESHOLD event for steady state, and<br> avoid polling in the vast majority of the cases.<br> <br> With "steady state" I mean that the VM is running, with no<br> administration (snapshot, live merge, live storage migration...)<br> operation in progress.<br> <br> I think it is fair to assume that VMs are in this state the vast<br> majority of the time.<br> For the very important cases on which we cannot depend on events, we can<br> fall back to polling, but in a smarter way:<br> <br> instead of polling everything every 2s, let's just poll just the drives<br> involved in the ongoing operations.<br> <br> Those should be far less of the total amount of drives, and for a far<br> shorter time than today, so polling should be practical.<br> <br> Since the event fires once, we will need to rearm it only if the<br> operation is ongoing, and only just before to start it (both conditions<br> easy to check)<br> We can disable the polling on completion, or on error. This per se is<br> easy, but we will need a careful review of the flows, and perhaps some<br> safety nets in place.<br> <br> </blockquote> <div><br> </div> <div>Consider fusing polling and events into a single pipeline of events so they can be used together. If a poll triggers an event (with distinguished origin)<br> </div> <div>then it all the handling is done in one place and it should be easy to stop or start polling, or remove them totally.<br> </div> </div> </div> </blockquote> <br> Yes, this is the final design I have in mind. I have plans to refactor Vdsm master to make it look like that.<br> It will play nice with refactorings that storage team has planned.<br> Let's see if virt refactorings are just needed to have the block threshold events, or if we can postpone them.<br> <br> <blockquote type="cite" cite="mid:CAC_Jqc=nZY+sqDD21_mxPmr+kzxTM15jq=_4c5zJDyzQnupHLA@mail.gmail.com"> <div dir="ltr"> <div class="gmail_quote"> <div><br> </div> <br> <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> On recovery, we will need to make sure to rearm all the relevant events,<br> but we can just plug in the recovery we must do already, so this should<br> be easy as well.<br> <br> </blockquote> <div> </div> </div> <div dir="ltr"> <div class="gmail_quote"> <div>What is needed in order to 'rearm' it? is there an API to get the state of event subscription?<br> </div> <div>If we lost an event how do we know to rearm it? is it idempotent to rearm?<br> </div> </div> </div> </div> </blockquote> <br> QEMU supports a single threshold per block device (= node of backing chain), so rearming a<br> threshold just means setting a new threshold, overwriting the old one.<br> To rearm the, we need to get the highest allocation of block devices and set the threshold.<br> If we do that among the first thing of recovery, it should be little risk, if any.<br> <br> To know if we need to do that, we "just" need to inspect all block devices at recovery.<br> It doesn't come for free, but I believe it is a fair price.<br> <br> <blockquote type="cite" cite="mid:CAC_Jqc=nZY+sqDD21_mxPmr+kzxTM15jq=_4c5zJDyzQnupHLA@mail.gmail.com"> <div dir="ltr"> <div dir="ltr"> <div class="gmail_quote"> <div>Remind me, do we extend a disk if the VM paused with out of space event?<br> </div> </div> </div> </div> </blockquote> <br> Yes we do. We examine the last paused reason in recovery, we do extension in this case<br> <br> <blockquote type="cite" cite="mid:CAC_Jqc=nZY+sqDD21_mxPmr+kzxTM15jq=_4c5zJDyzQnupHLA@mail.gmail.com"> <div dir="ltr"> <div dir="ltr"> <div class="gmail_quote"> <div><br> </div> <div>How will we handle 2 subsequent events if we didn't extend between them? (expecting the extend to be async operation)<br> </div> </div> </div> </div> </blockquote> <br> At qemu level, the event cannot fire twice, must be rearmed after every firing.<br> In general, should virt code receive two events before the extension completed... I don't know yet :) Perhaps we can start just handling the first event, I don't think we can easily queue extension requests (and I'm not sure we should)<br> <br> <blockquote type="cite" cite="mid:CAC_Jqc=nZY+sqDD21_mxPmr+kzxTM15jq=_4c5zJDyzQnupHLA@mail.gmail.com"> <div dir="ltr"> <br> <div dir="ltr"> <div class="gmail_quote"> <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> I believe the best route is:<br> <br> 1. offer the new event-based code for 4.2, keep the polling around.<br> Default to events for performance<br> <br> 2. remove the polling completely in 4.3<br> <br> <br> </blockquote> <div>Still wonder if removing them totally is good. The absence of the events should be supervised somehow - like in today, a failure to poll getstats of a domain will result in a VM going unresponsive. Not the most accurate state but at least gives some visibility. So polling should cover us where events will fail. (similar to engine's vms monitoring)<br> </div> </div> </div> </div> </blockquote> <br> I don't have strong opinions about polling removal as long as it is disabled by default.<br> Actually, I like having fallbacks and safety nets in place.<br> However, the libvirt event support is here to stay, and as time goes, it should only get better (featurewise and reliability wise).<br> <br> <blockquote type="cite" cite="mid:CAC_Jqc=nZY+sqDD21_mxPmr+kzxTM15jq=_4c5zJDyzQnupHLA@mail.gmail.com"> <div dir="ltr"> <div dir="ltr"> <div class="gmail_quote"> <div><br> </div> <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> I'm currently working on the patches here:<br> <a href="https://gerrit.ovirt.org/#/q/status:open+project:vdsm+branch:master+topic:wa..." rel="noreferrer" target="_blank" moz-do-not-send="true">https://gerrit.ovirt.org/#/q/status:open+project:vdsm+branch:master+topic:watermark-event-minimal</a><br> <br> <br> Even though the basics are in place, I don't think they are ready for<br> review yet.<br> </blockquote> </div> </div> </div> </blockquote> <br> First two patches (<a class="moz-txt-link-freetext" href="https://gerrit.ovirt.org/#/c/79386/5">https://gerrit.ovirt.org/#/c/79386/5</a> and <a class="moz-txt-link-freetext" href="https://gerrit.ovirt.org/#/c/79264/14">https://gerrit.ovirt.org/#/c/79264/14</a>) are now review worthy!<br> <br> <br> <pre class="moz-signature" cols="72">-- Francesco Romani Senior SW Eng., Virtualization R&D Red Hat IRC: fromani github: @fromanirh</pre> </body> </html> --------------A8BFBE836D14266855BB394E--

Francesco Romani

26 Jul 26 Jul

5:19 a.m.

On 07/24/2017 10:05 AM, Francesco Romani wrote:

...

TL;DR: NEWS!

First two patches (https://gerrit.ovirt.org/#/c/79386/5 and https://gerrit.ovirt.org/#/c/79264/14) are now review worthy!

Please review the above two patches: those provide the minimal first step toward the consumption of the watermark event. We can build on them, and add/refactor what is needed to completely support the event. https://gerrit.ovirt.org/#/c/79386/ - refactors Vm.extendDrivesIfNeeded to have one method which tries to extend a single drive. We will use this patch both in the existing poll-based flow and in the new event-based flow. Relatively big, but also quite simple and safe https://gerrit.ovirt.org/#/c/79264/ - add the basic work to register and consume the event. This is need for steady-state, and seems to work ok in my initial testing Bests, -- Francesco Romani Senior SW Eng., Virtualization R&D Red Hat IRC: fromani github: @fromanirh

3042

Age (days ago)

3049

Last active (days ago)

List overview

Download

6 comments

4 participants

participants (4)

Barak Korren
Francesco Romani
Michal Skrivanek
Roy Golan