In this case we re-used Condor, a well-known Grid (and other) scheduler. There is good precedent for putting cloud-like functionality into condor; not too long ago, direct EC2 support was added, so using condor in this way was not unprecedented.
We didn't want to use the existing condor EC2 infrastructure, however, since that ties you into a single cloud. Instead we decided to write new code to utilize the Deltacloud API, which will give us nice cross-cloud functionality.
I'm pleased to say that as of January 7, the code to manage cloud instances via deltacloud is in upstream condor. That means that this code will be part of the condor 7.6 release. It is a bit complicated to use, which is why we hide the whole thing behind the conductor. However, it is possible to use the deltacloud support by hand, which I will describe below.
In order to take full advantage of condor's matching and scheduling capabilities, our current solution relies on having provider classads and instance classads. The provider classads describe each of the cloud provider combinations, along with their values. For instance, if you had two AMIs on EC2 that you wanted to be able to use, you would generate two separate classads with all of the data the same except for the AMI id. A provider classad looks something like:
Name="provider_combination_1" MyType="Machine" Requirements=true # Stuff needed to match: hardwareprofile="large" image="fedora13" realm="1" # Backend info to complete this job: image_key="ami-ac3fc9c5" hardwareprofile_key="m1.large" realm_key="us-east-1a" provider_url="http://localhost:3002/api" username="ec2_username" password="ec2_secret_access_key" cloud_account_id="1" keypair="mykeypair"
(pruned for brevity and to protect the innocent). There are a few things to note here. First, in the "stuff needed to match" category are generic things the job classad will match against. That is, the instance classad should not need to know any specifics of the cloud provider backend, just that it wants to run Fedora 13 on a large type instance. Once a provider (or providers) have been found that matches those requirements, the values in the "backend info the complete this job" section are substituted into the actual instance classad, and then the job is submitted.
The jobs themselves then look something like:
universe = grid executable = job_name notification = never requirements = hardwareprofile == "large" && image == "fedora13" grid_resource = deltacloud $$(provider_url) DeltacloudUsername = $$(username) DeltacloudPassword = $$(password) DeltacloudImageId = $$(image_key) DeltacloudHardwareProfile = $$(hardwareprofile_key) DeltacloudKeyname = $$(keypair) queue
There is a lot going on here, so let's break it down. The first three fields are necessary, but not that interesting; they just give the job a name, tell condor to use the grid universe, and tell condor not to send notifications. The interesting things start happening on the "requirements" line. This is the line that specifies what has to match in order for this job to run. In this case, we are saying that condor gridmanager should look through the provider ads looking for one that has a hardwareprofile that matches "large" and an image that matches "fedora13". These values match what was put in our provider ad above, so condor selects that provider ad. Once it selects that provider ad, it can fill in the rest of the job ad $$ substitutions and submit the job. For instance, in our job ad DeltacloudUsername = $$(username). Once our provider ad has been selected above, that $$(username) gets replaced with "ec2_username". The same happens for the rest of the $$ substitutions.
Once condor has made the match, filled in the values and submitted the job, it will then contact the deltacloud core (from the grid_resource line) to submit the job. If all has gone well to this point, then the instance will actually be launched in the cloud. The instance can also be controlled via condor; if you "condor_rm" the job, then the instance in the cloud will be killed off.