I was under the impression you are thinking about a wrapper job you
need to wrap around every job. This is a single, out of band, job. So
it may not be that bad.
You seem to imply that slaves managed by the Swarm plugin are not
'normal' ssh-based slaves, so there might be something there we can
exploit (For example, perhaps the swarm client JAR can be made to exit
once the slave is brought offline, so we can wrap it in a script that
will shut the slave down when it does).
I will look deeper into this in my POC.

Arent there any ability to hook into shutdown process and delay it from the hook itself? There are vdsm hooks for that but I am not sure how pool scheduler interacts with it. Maybe we can ask on user list. As I see the ideal is to catch shutdown, than run some hook that will put skave to maintanance, wait for job to finish and than unblocks shutdown.

I had the same problem when I was thinking on how to get back migration for local disk slaves so auto balancing can be used for them. And the only troublesome was to interact with user land to have an idea about if it is safe. Sounds like feature request?

Anton Marchukov
Senior Software Engineer - RHEV CI - Red Hat