Cumulative VM network usage

Hello, The need to monitor cumulative VM network usage has come up several times in the past; while this should be handled as part of (https://bugzilla.redhat.com/show_bug.cgi?id=1063343), in the mean time I've written a small Python script that monitors those statistics, attached here. The script polls the engine via RESTful API periodically and dumps the up-to-date total usage into a file. The output is a multi-level map/dictionary in JSON format, where: * The top level keys are VM names. * Under each VM, the next level keys are vNIC names. * Under each vNIC, there are keys for total 'rx' (received) and 'tx' (transmitted), where the values are in Bytes. The script is built to run forever. It may be stopped at any time, but while it's not running VM network usage data will "be lost". When it's re-run, it'll go back to accumulating data on top of its previous data. A few disclaimers: * I haven't tested this with any edge cases (engine service dies, etc.). * Tested this with tens of VMs, not sure it'll work fine with hundreds. * The PERIOD_TIME (polling interval) should be set so that it matches both the engine's and vdsm's polling interval (see comments inside the script), otherwise data will be either lost or counted multiple times. >From 3.4 onwards, default configuration should be fine with 15 seconds. * The precision of traffic measurement on a NIC is 0.1% of the interface's speed over each PERIOD_TIME interval. For example, on a 1Gbps vNIC, when PERIOD_TIME = 15s, data will only be measured in 15Mb (~2MB) quanta. Specifically what this means is, that in this example, any traffic smaller than 2MB over a 15-second period would be negligible and wouldn't be recorded. Knock yourselves out :)

This is a multi-part message in MIME format. --------------080401060806050003020601 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Hi Lior, Thank you for this. Indeed I have seen multiple requests for this. I also have a bugzilla for it: https://bugzilla.redhat.com/show_bug.cgi?id=1108144. Some comments bellow. On 11/11/2014 07:07 AM, Lior Vernia wrote: > Hello, > > The need to monitor cumulative VM network usage has come up several > times in the past; while this should be handled as part of > (https://bugzilla.redhat.com/show_bug.cgi?id=1063343), in the mean time > I've written a small Python script that monitors those statistics, > attached here. > > The script polls the engine via RESTful API periodically and dumps the > up-to-date total usage into a file. The output is a multi-level > map/dictionary in JSON format, where: > * The top level keys are VM names. > * Under each VM, the next level keys are vNIC names. > * Under each vNIC, there are keys for total 'rx' (received) and 'tx' > (transmitted), where the values are in Bytes. > > The script is built to run forever. It may be stopped at any time, but > while it's not running VM network usage data will "be lost". When it's > re-run, it'll go back to accumulating data on top of its previous data. This could be mitigated if along with rx and tx data, vdsm was reporting a timestamp reflecting the time when data was collected. So, even with gaps, we should be able to calculate the cumulative information. > > A few disclaimers: > * I haven't tested this with any edge cases (engine service dies, etc.). > * Tested this with tens of VMs, not sure it'll work fine with hundreds. > * The PERIOD_TIME (polling interval) should be set so that it matches > both the engine's and vdsm's polling interval (see comments inside the > script), otherwise data will be either lost or counted multiple times. > From 3.4 onwards, default configuration should be fine with 15 seconds. Here we have another issue. In 3.4, 15 seconds is fine... backend and vdsm are in line with 15 seconds. But up to 3.3, vdsm is pooling the data every 5 seconds and backend is collecting data every 15 seconds. So 2 in 3 vdsm poolings are droped. Since you're handling total bytes, this might not be a big issue. > * The precision of traffic measurement on a NIC is 0.1% of the > interface's speed over each PERIOD_TIME interval. For example, on a > 1Gbps vNIC, when PERIOD_TIME = 15s, data will only be measured in 15Mb > (~2MB) quanta. Specifically what this means is, that in this example, > any traffic smaller than 2MB over a 15-second period would be negligible > and wouldn't be recorded. Looking to the code, if "overhead" is bigger than "PERIOD_TIME", cumulative data for a given period will never be accurate. Anyway the script will fall in exception when that is the case (negative value for time.sleep()). The mentioned timestamp reported by vdsm could drop the need for the "overhead" calculation. > > Knock yourselves out :) > > > _______________________________________________ > Users mailing list > Users@ovirt.org > http://lists.ovirt.org/mailman/listinfo/users --------------080401060806050003020601 Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: 7bit <html> <head> <meta content="text/html; charset=windows-1252" http-equiv="Content-Type"> </head> <body text="#000000" bgcolor="#FFFFFF"> <div class="moz-cite-prefix">Hi Lior,<br> <br> Thank you for this. Indeed I have seen multiple requests for this. I also have a bugzilla for it: <a class="moz-txt-link-freetext" href="https://bugzilla.redhat.com/show_bug.cgi?id=1108144">https://bugzilla.redhat.com/show_bug.cgi?id=1108144</a>. Some comments bellow.<br> <br> On 11/11/2014 07:07 AM, Lior Vernia wrote:<br> </div> <blockquote cite="mid:5461DFFE.9060006@redhat.com" type="cite"> <pre wrap="">Hello, The need to monitor cumulative VM network usage has come up several times in the past; while this should be handled as part of (<a class="moz-txt-link-freetext" href="https://bugzilla.redhat.com/show_bug.cgi?id=1063343">https://bugzilla.redhat.com/show_bug.cgi?id=1063343</a>), in the mean time I've written a small Python script that monitors those statistics, attached here. The script polls the engine via RESTful API periodically and dumps the up-to-date total usage into a file. The output is a multi-level map/dictionary in JSON format, where: * The top level keys are VM names. * Under each VM, the next level keys are vNIC names. * Under each vNIC, there are keys for total 'rx' (received) and 'tx' (transmitted), where the values are in Bytes. The script is built to run forever. It may be stopped at any time, but while it's not running VM network usage data will "be lost". When it's re-run, it'll go back to accumulating data on top of its previous data.</pre> </blockquote> <br> This could be mitigated if along with rx and tx data, vdsm was reporting a timestamp reflecting the time when data was collected. So, even with gaps, we should be able to calculate the cumulative information.<br> <br> <blockquote cite="mid:5461DFFE.9060006@redhat.com" type="cite"> <pre wrap=""> A few disclaimers: * I haven't tested this with any edge cases (engine service dies, etc.). * Tested this with tens of VMs, not sure it'll work fine with hundreds. * The PERIOD_TIME (polling interval) should be set so that it matches both the engine's and vdsm's polling interval (see comments inside the script), otherwise data will be either lost or counted multiple times. >From 3.4 onwards, default configuration should be fine with 15 seconds.</pre> </blockquote> <br> Here we have another issue. In 3.4, 15 seconds is fine... backend and vdsm are in line with 15 seconds. But up to 3.3, vdsm is pooling the data every 5 seconds and backend is collecting data every 15 seconds. So 2 in 3 vdsm poolings are droped. Since you're handling total bytes, this might not be a big issue.<br> <br> <blockquote cite="mid:5461DFFE.9060006@redhat.com" type="cite"> <pre wrap=""> * The precision of traffic measurement on a NIC is 0.1% of the interface's speed over each PERIOD_TIME interval. For example, on a 1Gbps vNIC, when PERIOD_TIME = 15s, data will only be measured in 15Mb (~2MB) quanta. Specifically what this means is, that in this example, any traffic smaller than 2MB over a 15-second period would be negligible and wouldn't be recorded.</pre> </blockquote> <br> Looking to the code, if "overhead" is bigger than "PERIOD_TIME", cumulative data for a given period will never be accurate. Anyway the script will fall in exception when that is the case (negative value for time.sleep()). The mentioned timestamp reported by vdsm could drop the need for the "overhead" calculation.<br> <br> <blockquote cite="mid:5461DFFE.9060006@redhat.com" type="cite"> <pre wrap=""> Knock yourselves out :) </pre> <br> <fieldset class="mimeAttachmentHeader"></fieldset> <br> <pre wrap="">_______________________________________________ Users mailing list <a class="moz-txt-link-abbreviated" href="mailto:Users@ovirt.org">Users@ovirt.org</a> <a class="moz-txt-link-freetext" href="http://lists.ovirt.org/mailman/listinfo/users">http://lists.ovirt.org/mailman/listinfo/users</a> </pre> </blockquote> <br> </body> </html> --------------080401060806050003020601--

This is a multi-part message in MIME format. --------------040208030403030401000208 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit On 11/20/2014 09:20 AM, Amador Pahim wrote:
Hi Lior,
Thank you for this. Indeed I have seen multiple requests for this. I also have a bugzilla for it: https://bugzilla.redhat.com/show_bug.cgi?id=1108144. Some comments bellow.
On 11/11/2014 07:07 AM, Lior Vernia wrote:
Hello,
The need to monitor cumulative VM network usage has come up several times in the past; while this should be handled as part of (https://bugzilla.redhat.com/show_bug.cgi?id=1063343), in the mean time I've written a small Python script that monitors those statistics, attached here.
The script polls the engine via RESTful API periodically and dumps the up-to-date total usage into a file. The output is a multi-level map/dictionary in JSON format, where: * The top level keys are VM names. * Under each VM, the next level keys are vNIC names. * Under each vNIC, there are keys for total 'rx' (received) and 'tx' (transmitted), where the values are in Bytes.
The script is built to run forever. It may be stopped at any time, but while it's not running VM network usage data will "be lost". When it's re-run, it'll go back to accumulating data on top of its previous data.
This could be mitigated if along with rx and tx data, vdsm was reporting a timestamp reflecting the time when data was collected. So, even with gaps, we should be able to calculate the cumulative information.
Actually vdsm is not reporting rx/tx bytes. They are "tx/rx rate". So, we're only able to see the average consumption for the time between pooling periods.
A few disclaimers: * I haven't tested this with any edge cases (engine service dies, etc.). * Tested this with tens of VMs, not sure it'll work fine with hundreds. * The PERIOD_TIME (polling interval) should be set so that it matches both the engine's and vdsm's polling interval (see comments inside the script), otherwise data will be either lost or counted multiple times. From 3.4 onwards, default configuration should be fine with 15 seconds.
Here we have another issue. In 3.4, 15 seconds is fine... backend and vdsm are in line with 15 seconds. But up to 3.3, vdsm is pooling the data every 5 seconds and backend is collecting data every 15 seconds. So 2 in 3 vdsm poolings are droped. Since you're handling total bytes, this might not be a big issue.
Forget the last sentence. It's a big issue since the data is not cumulative, but the average of the period between vdsm checks. bz#1066570 is the solution for precise calculations here.
* The precision of traffic measurement on a NIC is 0.1% of the interface's speed over each PERIOD_TIME interval. For example, on a 1Gbps vNIC, when PERIOD_TIME = 15s, data will only be measured in 15Mb (~2MB) quanta. Specifically what this means is, that in this example, any traffic smaller than 2MB over a 15-second period would be negligible and wouldn't be recorded.
Looking to the code, if "overhead" is bigger than "PERIOD_TIME", cumulative data for a given period will never be accurate. Anyway the script will fall in exception when that is the case (negative value for time.sleep()). The mentioned timestamp reported by vdsm could drop the need for the "overhead" calculation.
Knock yourselves out :)
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
--------------040208030403030401000208 Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: 7bit <html> <head> <meta content="text/html; charset=windows-1252" http-equiv="Content-Type"> </head> <body text="#000000" bgcolor="#FFFFFF"> <div class="moz-cite-prefix">On 11/20/2014 09:20 AM, Amador Pahim wrote:<br> </div> <blockquote cite="mid:546DDC77.3090803@redhat.com" type="cite"> <meta content="text/html; charset=windows-1252" http-equiv="Content-Type"> <div class="moz-cite-prefix">Hi Lior,<br> <br> Thank you for this. Indeed I have seen multiple requests for this. I also have a bugzilla for it: <a moz-do-not-send="true" class="moz-txt-link-freetext" href="https://bugzilla.redhat.com/show_bug.cgi?id=1108144">https://bugzilla.redhat.com/show_bug.cgi?id=1108144</a>. Some comments bellow.<br> <br> On 11/11/2014 07:07 AM, Lior Vernia wrote:<br> </div> <blockquote cite="mid:5461DFFE.9060006@redhat.com" type="cite"> <pre wrap="">Hello, The need to monitor cumulative VM network usage has come up several times in the past; while this should be handled as part of (<a moz-do-not-send="true" class="moz-txt-link-freetext" href="https://bugzilla.redhat.com/show_bug.cgi?id=1063343">https://bugzilla.redhat.com/show_bug.cgi?id=1063343</a>), in the mean time I've written a small Python script that monitors those statistics, attached here. The script polls the engine via RESTful API periodically and dumps the up-to-date total usage into a file. The output is a multi-level map/dictionary in JSON format, where: * The top level keys are VM names. * Under each VM, the next level keys are vNIC names. * Under each vNIC, there are keys for total 'rx' (received) and 'tx' (transmitted), where the values are in Bytes. The script is built to run forever. It may be stopped at any time, but while it's not running VM network usage data will "be lost". When it's re-run, it'll go back to accumulating data on top of its previous data.</pre> </blockquote> <br> This could be mitigated if along with rx and tx data, vdsm was reporting a timestamp reflecting the time when data was collected. So, even with gaps, we should be able to calculate the cumulative information.<br> </blockquote> <br> Actually vdsm is not reporting rx/tx bytes. They are "tx/rx rate". So, we're only able to see the average consumption for the time between pooling periods.<br> <br> <blockquote cite="mid:546DDC77.3090803@redhat.com" type="cite"> <br> <blockquote cite="mid:5461DFFE.9060006@redhat.com" type="cite"> <pre wrap=""> A few disclaimers: * I haven't tested this with any edge cases (engine service dies, etc.). * Tested this with tens of VMs, not sure it'll work fine with hundreds. * The PERIOD_TIME (polling interval) should be set so that it matches both the engine's and vdsm's polling interval (see comments inside the script), otherwise data will be either lost or counted multiple times.
From 3.4 onwards, default configuration should be fine with 15 seconds.</pre> </blockquote> <br> Here we have another issue. In 3.4, 15 seconds is fine... backend and vdsm are in line with 15 seconds. But up to 3.3, vdsm is pooling the data every 5 seconds and backend is collecting data every 15 seconds. So 2 in 3 vdsm poolings are droped. Since you're handling total bytes, this might not be a big issue.<br> </blockquote> <br> Forget the last sentence. It's a big issue since the data is not cumulative, but the average of the period between vdsm checks. bz#1066570 is the solution for precise calculations here.<br> <br> <blockquote cite="mid:546DDC77.3090803@redhat.com" type="cite"> <br> <blockquote cite="mid:5461DFFE.9060006@redhat.com" type="cite"> <pre wrap="">* The precision of traffic measurement on a NIC is 0.1% of the interface's speed over each PERIOD_TIME interval. For example, on a 1Gbps vNIC, when PERIOD_TIME = 15s, data will only be measured in 15Mb (~2MB) quanta. Specifically what this means is, that in this example, any traffic smaller than 2MB over a 15-second period would be negligible and wouldn't be recorded.</pre> </blockquote> <br> Looking to the code, if "overhead" is bigger than "PERIOD_TIME", cumulative data for a given period will never be accurate. Anyway the script will fall in exception when that is the case (negative value for time.sleep()). The mentioned timestamp reported by vdsm could drop the need for the "overhead" calculation.<br> <br> <blockquote cite="mid:5461DFFE.9060006@redhat.com" type="cite"> <pre wrap=""> Knock yourselves out :) </pre> <br> <fieldset class="mimeAttachmentHeader"></fieldset> <br> <pre wrap="">_______________________________________________ Users mailing list <a moz-do-not-send="true" class="moz-txt-link-abbreviated" href="mailto:Users@ovirt.org">Users@ovirt.org</a> <a moz-do-not-send="true" class="moz-txt-link-freetext" href="http://lists.ovirt.org/mailman/listinfo/users">http://lists.ovirt.org/mailman/listinfo/users</a> </pre> </blockquote> <br> <br> <fieldset class="mimeAttachmentHeader"></fieldset> <br> <pre wrap="">_______________________________________________ Users mailing list <a class="moz-txt-link-abbreviated" href="mailto:Users@ovirt.org">Users@ovirt.org</a> <a class="moz-txt-link-freetext" href="http://lists.ovirt.org/mailman/listinfo/users">http://lists.ovirt.org/mailman/listinfo/users</a> </pre> </blockquote> <br> </body> </html>
--------------040208030403030401000208--
participants (2)
-
Amador Pahim
-
Lior Vernia