
VM Pools are a nice feature of oVirt. A VM pool lets you quickly create a pool of stateles VMs all based on the same template. A VM pool also seems to currently be the only way to create template-based thin QCOW2 VMs in oVirt. (Cloning from template creates a thick copy, this is why its relatively slow) With the autostart [1] feature, you can have the VMs auto-started when the pool is started, it also means VMs get started automatically a few minutes after they are shut down. What this comes down to is that if you run 'shutdown' in a VM from a pool, you will automatically get back a clean VM a few minutes later. Unfortunately VM pools are not without their short comings, I've documented two of these in BZ#1298235 [2] and BZ#1298232 [3]. When this means in essence is that oVirt does not give you a way to predictably assign names or IPs to VMs in a pool. So how do we solve this? Since the ultimate goal for VMs in a pool is to become Jenkins slaves, one solution is to use the swarm plugin [4]. With the swarm plugin, the actual name and address of the slave VM becomes not very important. We could quite easily setup the cloud-init invoked for VMs in the pool to download the swarm plugin client and then run it to register to Jenkins while setting labels according to various system properties. The question remains how to assign IP addresses and names, to the pool VMs. We will probably need a range of IP addresses that is pre-assigned to a range of DNS records an that will be assigned to pool VMs as they boot up. Currently our DHCP and DNS servers in PHX is managed by Foreman in a semi-random fashion. As we've seen in the past this is subject to various failures such as the MAC address of the foreman record getting out of sync with the one of the VM (for example due to Facter reporting a bad address after a particularity nasty VDSM test run), or the DNS record going out of sync with the VM's host name and address in the DHCP. At this point I think we've enough evidence against Foreman's style of managing DNS and DHCP, I suggest we will: 1. Cease from creating new VMs in PHX via Foreman for a while. 2. Shutdown the PHX foreman proxy to disconnect it from managing the DNS and DHCP. 3. Map out our currently active MAC->IP->HOSTNAME combinations and create static DNS and DHCP configuration files (I suggest we also migrate from BIND+ISC DHCPD to Dnsmasq which is far easier to configure and provides very tight DNS, DHCP and TFTP integration) 4. Add configuration for a dynamically assigned IP range as described above. Another way to resolve the current problem of coming up with a dynamically assignable range of IPs, is to create a new VLAN in PHX for the new pools of VMs. One more issue we need to consider is how to use Puppet on the pool VMs, we will probably still like Puppet to run in order to setup SSH access for us, as well as other things needed on the slave. Possibly we would also like for the swarm plugin client to be actually installed and activated by Puppet, as that would grant us easy access to Facter facts for determining the labels the slave should have while also ensuring the slave will not become available to Jenkins until it is actually ready for use. It is easy enough to get Puppet running via a cloud-init script, but the issue here is how to select classes for the new VMs. Since they are not created in Foreman, they will not get assigned to hostgroups, and therefore class assignment by way of hostgroup membership will not work. I see a few ways to resolve this: 1. An a 'node' entry in 'site.pp' to detect pool VMs (with a name regex) and assgin classes to them 2. Use 'hiera_include' [5] in 'site.pp' to assign classes by facts via Hiera 3. Use a combination of the two methods above to ensure 'hiera_include' gets applied to and only to pool VMs. These are my thoughts about this so far, I am working on building a POC for this, but I would be happy to hear other thoughts and opinions at this point. [1]: http://www.ovirt.org/Features/PrestartedVm [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1298235 [3]: https://bugzilla.redhat.com/show_bug.cgi?id=1298232 [4]: https://wiki.jenkins-ci.org/display/JENKINS/Swarm+Plugin [5]: https://docs.puppetlabs.com/hiera/1/puppet.html#assigning-classes-to-nodes-w... -- Barak Korren bkorren@redhat.com RHEV-CI Team

--5KxTQ9fdN6Op3ksq Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On 01/13 18:23, Barak Korren wrote:
VM Pools are a nice feature of oVirt. A VM pool lets you quickly create a pool of stateles VMs all based on the same template. A VM pool also seems to currently be the only way to create template-based thin QCOW2 VMs in oVirt. (Cloning from template creates a thick copy, this is why its relatively slow) With the autostart [1] feature, you can have the VMs auto-started when the pool is started, it also means VMs get started automatically a few minutes after they are shut down. What this comes down to is that if you run 'shutdown' in a VM from a pool, you will automatically get back a clean VM a few minutes later.
Is there an easy way to do so from jenknis job without failing the job with slave connection error? Most projects I know that use ephemeral slaves have to work around it by having a job that starts/creates a slave tag and provisions the slave, and removes it at the end, if we can skip that extra job level better for us.
Unfortunately VM pools are not without their short comings, I've documented two of these in BZ#1298235 [2] and BZ#1298232 [3]. When this means in essence is that oVirt does not give you a way to predictably assign names or IPs to VMs in a pool. =20 So how do we solve this? =20 Since the ultimate goal for VMs in a pool is to become Jenkins slaves, one solution is to use the swarm plugin [4]. With the swarm plugin, the actual name and address of the slave VM becomes not very important. We could quite easily setup the cloud-init invoked for VMs in the pool to download the swarm plugin client and then run it to register to Jenkins while setting labels according to various system properties.
iirc the puppet manifest for jenkins already has integration with the swarm plugin, we can use that instead.
The question remains how to assign IP addresses and names, to the pool VM= s. We will probably need a range of IP addresses that is pre-assigned to a range of DNS records an that will be assigned to pool VMs as they boot up. =20 Currently our DHCP and DNS servers in PHX is managed by Foreman in a semi-random fashion. As we've seen in the past this is subject to various failures such as the MAC address of the foreman record getting out of sync with the one of the VM (for example due to Facter reporting a bad address after a particularity nasty VDSM test run), or the DNS record going out of sync with the VM's host name and address in the DHCP. At this point I think we've enough evidence against Foreman's style of managing DNS and DHCP, I suggest we will: 1. Cease from creating new VMs in PHX via Foreman for a while. 2. Shutdown the PHX foreman proxy to disconnect it from managing the DNS and DHCP. 3. Map out our currently active MAC->IP->HOSTNAME combinations and create static DNS and DHCP configuration files (I suggest we also migrate from BIND+ISC DHCPD to Dnsmasq which is far easier to configure and provides very tight DNS, DHCP and TFTP integration) 4. Add configuration for a dynamically assigned IP range as described abo= ve.
Can't we just use a reserved range for those machines instead? there's no need to remove from foreman, it can work with machines it does not provision.
Another way to resolve the current problem of coming up with a dynamically assignable range of IPs, is to create a new VLAN in PHX for the new pools of VMs.
I'm in favor of using an internal network for the jenkins slaves, if they are the ones connecting to the master there's no need for externally addressable ips, so no need for public ips, though I recall that it was not so easy to set up, better discuss with the hosting
One more issue we need to consider is how to use Puppet on the pool VMs, we will probably still like Puppet to run in order to setup SSH access for us, as well as other things needed on the slave. Possibly we would also like for the swarm plugin client to be actually installed and activated by Puppet, as that would grant us easy access to Facter facts for determining the labels the slave should have while also ensuring the slave will not become available to Jenkins until it is actually ready for use. It is easy enough to get Puppet running via a cloud-init script, but the issue here is how to select classes for the new VMs. Since they are not created in Foreman, they will not get assigned to hostgroups, and therefore class assignment by way of hostgroup membership will not work.
Can't you just autoasign a hostgroup on creation on formean or something? Quick search throws a plugin that might do the trick: https://github.com/GregSutcliffe/foreman_default_hostgroup +1 on moving any data aside from the hostgroup assignation to hiera though, so it can be versioned and peer reviewed.
I see a few ways to resolve this: 1. An a 'node' entry in 'site.pp' to detect pool VMs (with a name regex) and assgin classes to them 2. Use 'hiera_include' [5] in 'site.pp' to assign classes by facts via Hi= era 3. Use a combination of the two methods above to ensure 'hiera_include' gets applied to and only to pool VMs. =20 These are my thoughts about this so far, I am working on building a POC for this, but I would be happy to hear other thoughts and opinions at this point. =20 =20 [1]: http://www.ovirt.org/Features/PrestartedVm [2]: https://bugzilla.redhat.com/show_bug.cgi?id=3D1298235 [3]: https://bugzilla.redhat.com/show_bug.cgi?id=3D1298232 [4]: https://wiki.jenkins-ci.org/display/JENKINS/Swarm+Plugin [5]: https://docs.puppetlabs.com/hiera/1/puppet.html#assigning-classes-to= -nodes-with-hiera-hierainclude =20 --=20 Barak Korren bkorren@redhat.com RHEV-CI Team _______________________________________________ Infra mailing list Infra@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra
--=20 David Caro Red Hat S.L. Continuous Integration Engineer - EMEA ENG Virtualization R&D Tel.: +420 532 294 605 Email: dcaro@redhat.com IRC: dcaro|dcaroest@{freenode|oftc|redhat} Web: www.redhat.com RHT Global #: 82-62605 --5KxTQ9fdN6Op3ksq Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBAgAGBQJWloDiAAoJEEBxx+HSYmnDqRoH/3OjuH+et4zdC7V6+L53gtu9 pcXv9FcQrM5RXzPPjoOhGhpN9biv4ed60mbNdzZdX0OvCIk3jsC6K//0EAWYLNV4 wnh45hAE3tURozIq/0U43i7eTr13sTQ99v0ywl9tJgShRsy0cusPkdA3Pkp5vng5 4wE9IJNGZ7YWja5e1gZPft4Obd7wfMAr2rkx5CiRa8BJ3k7C17kwn5ySGgvXGqs/ 708DL+GX9q5PHs6KCNdTFAsSSh0IWljkJv+RQRPlC6FBj1hmMlOLYPDB/DocgYQW tYSgoQ2vekKD2rpYNKfXOhVLWZOoXRTJPe13NTLhk7hjJFkPaCinVVT1tnLYYaE= =kuNo -----END PGP SIGNATURE----- --5KxTQ9fdN6Op3ksq--

Hello All.
What this comes down to is that if you run 'shutdown' in a VM from a
pool, you will automatically get back a clean VM a few minutes later.
Is there an easy way to do so from jenknis job without failing the job with slave connection error? Most projects I know that use ephemeral
But why we need it here. Do we really need to target ephemeral slaves or UI management of pool servers is not good enough in ovirt?
1. Cease from creating new VMs in PHX via Foreman for a while.
2. Shutdown the PHX foreman proxy to disconnect it from managing the DNS and DHCP. 3. Map out our currently active MAC->IP->HOSTNAME combinations and create static DNS and DHCP configuration files (I suggest we also migrate from BIND+ISC DHCPD to Dnsmasq which is far easier to configure and provides very tight DNS, DHCP and TFTP integration) 4. Add configuration for a dynamically assigned IP range as described above.
Can't we just use a reserved range for those machines instead? there's no need to remove from foreman, it can work with machines it does not provision.
As I understand the problem here is that in one VLAN we obviously can have only one DHCP and if it is managed by foreman it may not be possible to have a range there that is not touchable by foreman. But it depends on how foreman touches DHCP config.
Another way to resolve the current problem of coming up with a dynamically assignable range of IPs, is to create a new VLAN in PHX for the new pools of VMs.
I'm in favor of using an internal network for the jenkins slaves, if they are the ones connecting to the master there's no need for externally addressable ips, so no need for public ips, though I recall that it was not so easy to set up, better discuss with the hosting
I think if we want to scale that public IPv4 IPs might be indeed quite wasteful. I though about using IPv6 since e.g. we can just have one prefix and there is no need for DHCP so such VMs will live in the same VLAN as foreman if needed with no problem. But as I understand we need IPv4 addressing on the slave for the tests, do I get it correct?
Can't you just autoasign a hostgroup on creation on formean or something? Quick search throws a plugin that might do the trick: https://github.com/GregSutcliffe/foreman_default_hostgroup
+1 on moving any data aside from the hostgroup assignation to hiera though, so it can be versioned and peer reviewed.
Can we somehow utilize cloud init for this. Also do we really want to use vanialla OS templates for this instead of building our own based on vanialla but with configuration setting needed for us. I think it will also fasten slave creation although since they are not ephemeral this will not give much. -- Anton Marchukov Senior Software Engineer - RHEV CI - Red Hat

--/KohU7xR/z4Rz7fl Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On 01/13 18:02, Anton Marchukov wrote:
Hello All. =20
What this comes down to is that if you run 'shutdown' in a VM from a
pool, you will automatically get back a clean VM a few minutes later.
Is there an easy way to do so from jenknis job without failing the job with slave connection error? Most projects I know that use ephemeral
=20 But why we need it here. Do we really need to target ephemeral slaves or = UI management of pool servers is not good enough in ovirt?
The issue is being able to recycle the slaves without breaking any jenkins jobs, and if possble, automatically. Iiuc the key idea of those slaves, is that they are ephemeral, so we can create/destroy them on demand really easily.
=20
1. Cease from creating new VMs in PHX via Foreman for a while.
2. Shutdown the PHX foreman proxy to disconnect it from managing the DNS and DHCP. 3. Map out our currently active MAC->IP->HOSTNAME combinations and create static DNS and DHCP configuration files (I suggest we also migrate from BIND+ISC DHCPD to Dnsmasq which is far easier to configure and provides very tight DNS, DHCP and TFTP integration) 4. Add configuration for a dynamically assigned IP range as described above.
Can't we just use a reserved range for those machines instead? there's no need to remove from foreman, it can work with machines it does not provision.
=20 As I understand the problem here is that in one VLAN we obviously can have only one DHCP and if it is managed by foreman it may not be possible to have a range there that is not touchable by foreman. But it depends on how foreman touches DHCP config.
We already have reserved ips and ranges in the same dhcp that is managed by foreman.
=20 =20
Another way to resolve the current problem of coming up with a dynamically assignable range of IPs, is to create a new VLAN in PHX for the new pools of VMs.
I'm in favor of using an internal network for the jenkins slaves, if they are the ones connecting to the master there's no need for externally addressable ips, so no need for public ips, though I recall that it was not so easy to set up, better discuss with the hosting
=20 I think if we want to scale that public IPv4 IPs might be indeed quite wasteful. I though about using IPv6 since e.g. we can just have one prefix and there is no need for DHCP so such VMs will live in the same VLAN as foreman if needed with no problem. But as I understand we need IPv4 addressing on the slave for the tests, do I get it correct?
I'm not really sure, but if we are using lago for the functional tests, maybe there's no need for them. I'm not really familiar with ipv6, maybe it's time to get to know it :)
=20
Can't you just autoasign a hostgroup on creation on formean or something? Quick search throws a plugin that might do the trick: https://github.com/GregSutcliffe/foreman_default_hostgroup
+1 on moving any data aside from the hostgroup assignation to hiera though, so it can be versioned and peer reviewed.
=20 Can we somehow utilize cloud init for this.
I don't like the slaves explicitly registering themselves into foreman, that makes the provision totally coupled with it from the slave perspective.
=20 Also do we really want to use vanilla OS templates for this instead of building our own based on vanialla but with configuration setting needed for us. I think it will also fasten slave creation although since they are not ephemeral this will not give much. =20 --=20 Anton Marchukov Senior Software Engineer - RHEV CI - Red Hat
--=20 David Caro Red Hat S.L. Continuous Integration Engineer - EMEA ENG Virtualization R&D Tel.: +420 532 294 605 Email: dcaro@redhat.com IRC: dcaro|dcaroest@{freenode|oftc|redhat} Web: www.redhat.com RHT Global #: 82-62605 --/KohU7xR/z4Rz7fl Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBAgAGBQJWloXjAAoJEEBxx+HSYmnDjkYH/2Hqe9YGuL/kux5QOGyQ65RC r8rWPRWaqXpAaCRRbWCSmJ6vjdpZ/z1MhzKvEJrlt84pucfJRlXFloS02epdumtD M9J/7yc5ra/XqGxEEh/LAKKQR2BFsySWa/WO1oIvu2ZVEAlBX23yyFSgtleuU63a WEthhgnsnJI3qbdpwB1Xw7BgiL/UtvX27XG+8nPhLVwEYgfmWN4mJwEcWtNfdXGS fSneIF9rDhC76oc+3+BsKUHEqk7ooYIXPCdFdtbFRdJLzh8hHARxMq67jzgmd2Pe 3GAasFvqjwm9h/P5H3X2fnPabVnTUja4U3EKu+hUI6niLKDRWljgY0D9pzRk/8E= =AGT7 -----END PGP SIGNATURE----- --/KohU7xR/z4Rz7fl--

Can't you just autoasign a hostgroup on creation on formean or something?
Can we somehow utilize cloud init for this.
Foreman was designed with a flow of learning about existing servers from the puppet reports they generate, this flow is so baked in to foreman it often breaks other flows where foreman is the one creating the hosts (indeed some of our issues with it are due to that). Given that, I would`nt want to go and invent our own Foreman registration flow. Also, keep i mind that Foreman was designed as a tool for _manual_ host classification, because Puppet opened the window for creating such tools by supporting ENCs (Indeed the good old Puppet dashboard, and Puppet Enterprise essentially do the same thing with hostgroups and all). For _automatic_ classification Puppet already has very good and reliable capabilities.
Also do we really want to use vanialla OS templates for this instead of building our own based on vanialla but with configuration setting needed for us. I think it will also fasten slave creation although since they are not ephemeral this will not give much.
There is nothing about slaves pools that prevents you from either using a vanilla template or a baked one. Having said that, I'm under the opinion that any configuration we do except enabling Jenkins to connect and us to manage and monitor it, is a change to the slave that may mask out deployment issues that real users will experience. I think we need to narrow down the configuration to the point that its runtime will be trivial. Also, slave creation time is important only if you have to make your tests wait for it. If we can make things so the slaves will become available to Jenkins only after all needed configuration was done, then ho long it takes becomes far less important. It seems that OpenStack had made people used to the style of thinking where you crate VMs and bring them up on the Fly, and then creation time is very important. But we are oVirt, we can think differently. Given a fixed pool of hardware resources, a fixed pool of VMs that are up and ready is a viable option. And in practice it will mean service time for the jobs will be faster. -- Barak Korren bkorren@redhat.com RHEV-CI Team

Is there an easy way to do so from jenknis job without failing the job with slave connection error? Most projects I know that use ephemeral slaves have to work around it by having a job that starts/creates a slave tag and provisions the slave, and removes it at the end, if we can skip that extra job level better for us.
Maybe we could use [1] or [2] to trigger an external service. We can use [3] to prevent race conditions. It also opens up a possibility of a 'garbage collector' job that will shut down and remove offline slaves (which will cause pool VMs to come back up clean and re-join Jenkins with the swarm client)
iirc the puppet manifest for jenkins already has integration with the swarm plugin, we can use that instead.
Great, I'll look into that.
Can't we just use a reserved range for those machines instead? there's no need to remove from foreman, it can work with machines it does not provision.
Do we have such a range available? I was under the impression I will have to wrestle it out of our existing range in which Foreman had been poking holes at random...
I'm in favor of using an internal network for the jenkins slaves, if they are the ones connecting to the master there's no need for externally addressable ips, so no need for public ips, though I recall that it was not so easy to set up, better discuss with the hosting
I think that even with swarm, eventually its Jenkins itself that will open connections to the slaves (Swarm plugin afaik is just used to notify Jenkins about slave existence, after taht it is used just like a regular slave, with SSH from Jenkins), so you will need external addresses for slaves as long as Jenkins in not running in PHX.
Can't you just autoasign a hostgroup on creation on formean or something? Quick search throws a plugin that might do the trick: https://github.com/GregSutcliffe/foreman_default_hostgroup
+1 on moving any data aside from the hostgroup assignation to hiera though, so it can be versioned and peer reviewed.
I kinda prefer to move foreman out of the provisioning process here, I'm burned by our bad experience with it. At it seems to me we are agreed on this. [1]: https://wiki.jenkins-ci.org/display/JENKINS/Notification+Plugin [2]: http://git.openstack.org/cgit/openstack-infra/zmq-event-publisher/tree/READM... [3]: https://wiki.jenkins-ci.org/display/JENKINS/Single+Use+Slave+Plugin -- Barak Korren bkorren@redhat.com RHEV-CI Team

--jq0ap7NbKX2Kqbes Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On 01/14 10:41, Barak Korren wrote:
Is there an easy way to do so from jenknis job without failing the job with slave connection error? Most projects I know that use ephemeral slaves have to work around it by having a job that starts/creates a slave tag and provisions the slave, and removes it at the end, if we can skip that extra job level better for us.
Maybe we could use [1] or [2] to trigger an external service. We can use [3] to prevent race conditions. It also opens up a possibility of a 'garbage collector' job that will shut down and remove offline slaves (which will cause pool VMs to come back up clean and re-join Jenkins with the swarm client)
So essentially no, this is what I said I wanted to avoid :/
iirc the puppet manifest for jenkins already has integration with the swarm plugin, we can use that instead.
Great, I'll look into that.
Can't we just use a reserved range for those machines instead? there's no need to remove from foreman, it can work with machines it does not provision.
Do we have such a range available? I was under the impression I will have to wrestle it out of our existing range in which Foreman had been poking holes at random...
We have a small range right now for non-jenkins vms, but it's easy (maybe not fast, but easy) to get the slaves to free another range. But we would have to do so anyhow unless we use internal ips or request a new range.
I'm in favor of using an internal network for the jenkins slaves, if they are the ones connecting to the master there's no need for externally addressable ips, so no need for public ips, though I recall that it was not so easy to set up, better discuss with the hosting
I think that even with swarm, eventually its Jenkins itself that will open connections to the slaves (Swarm plugin afaik is just used to notify Jenkins about slave existence, after taht it is used just like a regular slave, with SSH from Jenkins), so you will need external addresses for slaves as long as Jenkins in not running in PHX.
Afaik the swarm plugin is an extension of the jnlp slave connection method, and does not allow to change the connection method to ssh, it uses it's own (or so it seems from the docs, maybe that changed) that is to connect to the master from the slave.
Can't you just autoasign a hostgroup on creation on formean or something? Quick search throws a plugin that might do the trick: https://github.com/GregSutcliffe/foreman_default_hostgroup
+1 on moving any data aside from the hostgroup assignation to hiera though, so it can be versioned and peer reviewed.
I kinda prefer to move foreman out of the provisioning process here, I'm burned by our bad experience with it. At it seems to me we are agreed on this. =20 [1]: https://wiki.jenkins-ci.org/display/JENKINS/Notification+Plugin [2]: http://git.openstack.org/cgit/openstack-infra/zmq-event-publisher/tr= ee/README [3]: https://wiki.jenkins-ci.org/display/JENKINS/Single+Use+Slave+Plugin =20 =20 --=20 Barak Korren bkorren@redhat.com RHEV-CI Team
--=20 David Caro Red Hat S.L. Continuous Integration Engineer - EMEA ENG Virtualization R&D Tel.: +420 532 294 605 Email: dcaro@redhat.com IRC: dcaro|dcaroest@{freenode|oftc|redhat} Web: www.redhat.com RHT Global #: 82-62605 --jq0ap7NbKX2Kqbes Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBAgAGBQJWl2t0AAoJEEBxx+HSYmnDLZQIAIQAVjZuXwp+gnuF44s+9pIO jwREVnLDgxHk8yIy4rflGUNT7KfIKKpWpCtmbeifiy2U3AojZEyKf91vQSPPjWo5 ANHTw0cL3ubvB/q55w/qRUwfpveoJ4OGX55VfVmk6KFauwhSrTjddBLXw56HfIjG +HPKgDEqrtgXafrX9H37KXmpLUWznzcUeOxLZmbrWWpVKdzQer3LXxe42vTERSSk O4D8fSAyH/9W5LoOf0IrPPQNyVwI11blBLgdft0CxBygXRYaXyAIhw3t/5IJFguT RtZ87iWQq+LDxrmc5HQT1owYZLzpXC6GMwEg435WZFivTwfjhB1+2mBc+oGvq0k= =cOFd -----END PGP SIGNATURE----- --jq0ap7NbKX2Kqbes--

So essentially no, this is what I said I wanted to avoid :/
I was under the impression you are thinking about a wrapper job you need to wrap around every job. This is a single, out of band, job. So it may not be that bad. You seem to imply that slaves managed by the Swarm plugin are not 'normal' ssh-based slaves, so there might be something there we can exploit (For example, perhaps the swarm client JAR can be made to exit once the slave is brought offline, so we can wrap it in a script that will shut the slave down when it does). I will look deeper into this in my POC.
We have a small range right now for non-jenkins vms, but it's easy (maybe not fast, but easy) to get the slaves to free another range. But we would have to do so anyhow unless we use internal ips or request a new range.
So essentially we need to do the mapping work I mentioned in my original mail. Or I'm not understanding what you mean.
Afaik the swarm plugin is an extension of the jnlp slave connection method, and does not allow to change the connection method to ssh, it uses it's own (or so it seems from the docs, maybe that changed) that is to connect to the master from the slave.
I got a different impression from the docs, we will have to just try it and see I guess. -- Barak Korren bkorren@redhat.com RHEV-CI Team

I was under the impression you are thinking about a wrapper job you need to wrap around every job. This is a single, out of band, job. So it may not be that bad. You seem to imply that slaves managed by the Swarm plugin are not 'normal' ssh-based slaves, so there might be something there we can exploit (For example, perhaps the swarm client JAR can be made to exit once the slave is brought offline, so we can wrap it in a script that will shut the slave down when it does). I will look deeper into this in my POC.
Arent there any ability to hook into shutdown process and delay it from the hook itself? There are vdsm hooks for that but I am not sure how pool scheduler interacts with it. Maybe we can ask on user list. As I see the ideal is to catch shutdown, than run some hook that will put skave to maintanance, wait for job to finish and than unblocks shutdown. I had the same problem when I was thinking on how to get back migration for local disk slaves so auto balancing can be used for them. And the only troublesome was to interact with user land to have an idea about if it is safe. Sounds like feature request? -- Anton Marchukov Senior Software Engineer - RHEV CI - Red Hat

On 14 January 2016 at 12:02, Anton Marchukov <amarchuk@redhat.com> wrote:
I was under the impression you are thinking about a wrapper job you need to wrap around every job. This is a single, out of band, job. So it may not be that bad. You seem to imply that slaves managed by the Swarm plugin are not 'normal' ssh-based slaves, so there might be something there we can exploit (For example, perhaps the swarm client JAR can be made to exit once the slave is brought offline, so we can wrap it in a script that will shut the slave down when it does). I will look deeper into this in my POC.
Arent there any ability to hook into shutdown process and delay it from the hook itself? There are vdsm hooks for that but I am not sure how pool scheduler interacts with it. Maybe we can ask on user list. As I see the ideal is to catch shutdown, than run some hook that will put skave to maintanance, wait for job to finish and than unblocks shutdown.
But this is the reverse of what we need, the problem is how to make the slave shut down in the first place, you can`t just do it from the job that used it because it will make the job fail. But maybe we can actually use the good old 'shutdown $TIME_DELAY' to make the slave shut down a few seconds after the job is done... I can't believe I forgot you can time delay a shut down... I was initially thinking of 'at' and then I remebred this... -- Barak Korren bkorren@redhat.com RHEV-CI Team

On 14 January 2016 at 12:02, Anton Marchukov <amarchuk@redhat.com> wrote:
I was under the impression you are thinking about a wrapper job you need to wrap around every job. This is a single, out of band, job. So it may not be that bad. You seem to imply that slaves managed by the Swarm plugin are not 'normal' ssh-based slaves, so there might be something there we can exploit (For example, perhaps the swarm client JAR can be made to exit once the slave is brought offline, so we can wrap it in a script that will shut the slave down when it does). I will look deeper into this in my POC.
Arent there any ability to hook into shutdown process and delay it from=
--v2Uk6McLiE8OV1El Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On 01/14 12:41, Barak Korren wrote: the
hook itself? There are vdsm hooks for that but I am not sure how pool scheduler interacts with it. Maybe we can ask on user list. As I see the ideal is to catch shutdown, than run some hook that will put skave to maintanance, wait for job to finish and than unblocks shutdown.
But this is the reverse of what we need, the problem is how to make the slave shut down in the first place, you can`t just do it from the job that used it because it will make the job fail. =20 But maybe we can actually use the good old 'shutdown $TIME_DELAY' to make the slave shut down a few seconds after the job is done... I can't believe I forgot you can time delay a shut down... I was initially thinking of 'at' and then I remebred this...
You end up with a race condition anyhow, if there's a post-build job that take a bit too much, it will break it.
--=20 Barak Korren bkorren@redhat.com RHEV-CI Team
--=20 David Caro Red Hat S.L. Continuous Integration Engineer - EMEA ENG Virtualization R&D Tel.: +420 532 294 605 Email: dcaro@redhat.com IRC: dcaro|dcaroest@{freenode|oftc|redhat} Web: www.redhat.com RHT Global #: 82-62605 --v2Uk6McLiE8OV1El Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBAgAGBQJWl3xTAAoJEEBxx+HSYmnD3/oH/iZ4aHrL34yyBCw+sDKm23Ye yZuzCvVYSwGhRlrcXsZItsD0U4VIJuV3+3jQghYuDQJaZD3+CnqLt0RXX0zFW1pU NZfwgY/ZUOB1Cw1AYU15mik82o9M750dw5q7ztvl8907CYM3kmu963wZU01pSHyH tIR0RF7vDzvXoKI+GTULUOsxSDNv1cpjX7qaaA38OEmGCD9kqljePU4GRnKnX3DE IeLdyMxPtHiCWRjYH4wn/pES5EvTBEJP3KYT0V5Ot2ylAz9WpVmrA2vFkilwW8K1 cohEDGyv4dxVLyVk8O1GzDL6PKEjI6aPdtoWkCuY+siYawivFwWBxRgccZtgPwQ= =9esO -----END PGP SIGNATURE----- --v2Uk6McLiE8OV1El--

But this is the reverse of what we need, the problem is how to make the slave shut down in the first place, you can`t just do it from the job that used it because it will make the job fail.
Hm. I think some hybrid option is needed. Once the job is finished we should unlabel the slave and then use some garbage collection to kill the used slaves. I believe this can be done using system groovy script. And I think instead of removing labels we should just add a new one, e..g. we add "to_be_removed" and just schedule based on slaves not having that label. Smth like how data are purged from the database with delete flag.
But maybe we can actually use the good old 'shutdown $TIME_DELAY' to make the slave shut down a few seconds after the job is done... I can't believe I forgot you can time delay a shut down... I was initially thinking of 'at' and then I remebred this...
I do not like anything that accounts on any delays as it will rise race condition at some point. The probability is discussable, but first we should try to design the system without it if possible. -- Anton Marchukov Senior Software Engineer - RHEV CI - Red Hat

But this is the reverse of what we need, the problem is how to make the slave shut down in the first place, you can`t just do it from the job that used it because it will make the job fail.
=20 Hm. I think some hybrid option is needed. Once the job is finished we should unlabel the slave and then use some garbage collection to kill the used slaves. I believe this can be done using system groovy script. And I think instead of removing labels we should just add a new one, e..g. we a=
--5uO961YFyoDlzFnP Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On 01/14 11:56, Anton Marchukov wrote: dd
"to_be_removed" and just schedule based on slaves not having that label. Smth like how data are purged from the database with delete flag.
Well, the plugin that barak passed before, the one that forces a slave to be used only once, is what openstack guys use, in combination with nodepool, a big python service to provision/manage slaves, that's what I wanted to avoid, that extra service (as a jenkins job or not) to exclusively handle slaves, when we already have the ovirt pool stuff
=20
But maybe we can actually use the good old 'shutdown $TIME_DELAY' to make the slave shut down a few seconds after the job is done... I can't believe I forgot you can time delay a shut down... I was initially thinking of 'at' and then I remebred this...
=20 I do not like anything that accounts on any delays as it will rise race condition at some point. The probability is discussable, but first we should try to design the system without it if possible. =20 --=20 Anton Marchukov Senior Software Engineer - RHEV CI - Red Hat
--=20 David Caro Red Hat S.L. Continuous Integration Engineer - EMEA ENG Virtualization R&D Tel.: +420 532 294 605 Email: dcaro@redhat.com IRC: dcaro|dcaroest@{freenode|oftc|redhat} Web: www.redhat.com RHT Global #: 82-62605 --5uO961YFyoDlzFnP Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBAgAGBQJWl3/GAAoJEEBxx+HSYmnDqUYH/im4e3UYlV2gymtt+QweoYk2 8eSqGRwKqxQJEEKPUxlh8X/i9w0FkDDVQ0Ojau9w2u9oKLGR3yn43/MeIhqJV8n3 OP+mi4ThnNxk4F4obL+FMq7GdmOxVTIvHUt7WVI8OSn4MZb7Paa0RZ3g9rXcE/PN bJaNMU9jBZM1Z97N7kdSuiAv349ga1HcqDMy2IX5UDbcXxo8XVRZiFtxBz8w6KBN QIwuejG+u2xCKLOaj91cvhimoIZyx10xYdb7SnbpraqtIuAxOpWe0QOzr4UbJsdF RKPV106ZvIY5anLsV+7qrO6Fi0ix3b5Ur3mkwOb+ZhpohrhmT4L4SJnTzYA08Jg= =JkSn -----END PGP SIGNATURE----- --5uO961YFyoDlzFnP--

Well, the plugin that barak passed before, the one that forces a slave to be used only once, is what openstack guys use, in combination with nodepool, a big python service to provision/manage slaves, that's what I wanted to avoid, that extra service (as a jenkins job or not) to exclusively handle slaves, when we already have the ovirt pool stuff
Than sounds like we need to review the plugins available. I believe there should be something. At least we cannot do that without any orchestrator and since we do not want to introduce a new service - jenkins master its the ideal place for it. If there is nothing available we should write it. This will be technically the best solution. -- Anton Marchukov Senior Software Engineer - RHEV CI - Red Hat

--0FRtVia6Q6lt+M0P Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On 01/14 12:03, Anton Marchukov wrote:
Well, the plugin that barak passed before, the one that forces a slave to be used only once, is what openstack guys use, in combination with nodepool, a big python service to provision/manage slaves, that's what I wanted to avoid, that extra service (as a jenkins job or not) to exclusively handle slaves, when we already have the ovirt pool stuff
Than sounds like we need to review the plugins available. I believe there should be something. At least we cannot do that without any orchestrator and since we do not want to introduce a new service - jenkins master its the ideal place for it. If there is nothing available we should write it. This will be technically the best solution.
The fact that all that's needed is to reboot, and it's jenkins the one that knows when a job has finished running, and the one that already orchestrates where a job runs makes it the ideal place imo to put there that logic, so maybe yes, writing a small plugin that does it might be an option
--=20 Anton Marchukov Senior Software Engineer - RHEV CI - Red Hat
--=20 David Caro Red Hat S.L. Continuous Integration Engineer - EMEA ENG Virtualization R&D Tel.: +420 532 294 605 Email: dcaro@redhat.com IRC: dcaro|dcaroest@{freenode|oftc|redhat} Web: www.redhat.com RHT Global #: 82-62605 --0FRtVia6Q6lt+M0P Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBAgAGBQJWl4OiAAoJEEBxx+HSYmnD+yoH/Rc2BiYL4amIH9fyhYiAnHht KIoKuS+KNsjq9rk12IrVaRuPsC8z/dQf9y37luHuxAluMxHOARLX2q/1V0FD5K3B lKXYSsmDYXIoFxK0ZQMmIOSj9hOrYA4uZ5N7uRAyZwfS0sUAFc9aHq9MKLELSb1K PEHDi6TTmaOFQZgG/r9HjqyfwDlB3WCX10w75LL90JJf/0ghu3GbBVGuu0aWZYhY 4PbimJFmCHPFxfzh5Rt1BIjh5t8e/s/ldY1x8x93QFOACn9YNiicORtjtLuXLhLf A+G02isXHbr6tuOOTB4Yl2XBW9xvle1zEJEP1OuR/nmNCYQn13OtMLoBkkrojlE= =gPvh -----END PGP SIGNATURE----- --0FRtVia6Q6lt+M0P--
participants (3)
-
Anton Marchukov
-
Barak Korren
-
David Caro