Initializing Amazon EC2 Instances with CloudInit
The last several blog posts have been about setting up and using Amazon EC2 instances. One thing you may have noticed if you’ve been following along is that newly-launched instances are pretty bare-bones; they have one user defined, and not much more software installed than the absolute minimum. It’s unlikely they’re going to have the software you need installed out of the box.
Sure, you can set everything up by hand, but that runs counter to one of the strong points of running cloud instances – being able to launch new instances rapidly, to scale up capacity as fast as possible. There’s also the problem of repeatability; the more complex the process, the more steps involved in setting up a new instance, the more likely you are to leave out a step.
The Linux images from both Amazon and Canonical include a package called CloudInit which helps with both of these problems.
Hey, you, get onto my cloud
CloudInit runs as a process at boot time. You pass data to it when launching the instance, via
-d user_data or
-f user_data_file command-line options to
CloudInit interprets the data, handles it appropriately, then exits.
For the sake of repeatability, passing data directly on the command line seems like a bad idea.
(Not to mention the difficulty of passing a lot of data on the command line.) In these examples,
I’m going to stick to files and the
CloudInit can handle user data in a number of different formats. Let’s look at some of them.
User data scripts
The simplest and most straightforward format is the user data script. User data scripts are identified
by a first line starting with
#! 2. The user data is written to a file on the newly-created
instance, which is then executed as root by the Unix program loader “very late in the boot sequence”
(according to the documentation). In essence, a user data script can look like nearly any normal
One thing you need to remember is that anything you try to execute in the script must be available in the instance before you try to execute it. This usually means you either need to be stingy with what you try to use, or install packages before using them.
As a simple example, here’s a user data script named
user-script.txt that will write to a file named
user-script-output.txt 3 in the home directory of the
And here it is in action.
At The Day Job, we have several different configurations of servers, but each is based on different combinations of the same packages. For instance, they all need Java; most but not all need Tomcat; some need Subversion; and so on. Rather than maintain a number of monolithic user data scripts, one for each configuration, it makes more sense to have a script for each component, then combine them as needed.
To the best of my knowledge, you can’t pass more than one instance of
-f on the command line
when starting an instance. Fortunately, the next best thing is available: you can pass an
include file, which contains references to other user data files.
Include files are identified by a first line starting with
#include. Subsequent lines contain
URLs which are read one at a time by CloudInit, loaded, and the data processed as user data. The
documentation is silent on whether the URLs are handled consecutively or concurrently, or whether
they’re guaranteed to be handled in order. In other words, while you probably can assume that
later included user data can depend on previously included user data, you should test and confirm
for yourself that this is the case.
Note that the include file contains URLs rather than filenames. This is a mixed blessing; you need to put the included files somewhere where the launching instance can load them (which is likely not your workstation or laptop), but then they’re in a central location, accessible to all of your EC2 instances as needed. As an added benefit, if you store them on S3, the bandwidth to read them isn’t metered and doesn’t incur additional charges.
For this example, we’ll create three user data scripts, upload them to the Web 4, then create an include file which references all three. We’ll do a simple test of whether the includes are processed consecutively or concurrently as well.
The first script is named
The second script is named, naturally,
Working out what the third script is named is left as an exercise for the reader. :-)
Next, we need an include file which references all three of these user data scripts,
And here it is in action.
So here we have empirical evidence that the included user data scripts are run consecutively rather than concurrently.
More advanced usage
There’s a fairly advanced-looking data format called
cloud-config which, unfortunately, is
not well documented beyond the “here’s some examples” level. A good part of what is
documented seems to be specific to Ubuntu, so I’m not sure how useful it would be for the
If you need code to execute earlier than “late in the boot process”, the
available. It appears to function similar to a user data script, with the additional
restriction that it should be idempotent 5. I was unable to find an example in the
documentation, so I don’t know anything more about this data format.
Similarly, you can define upstart 6 jobs directly with the
upstart-job data format.
You can specify more than one data format in a single file by formatting it as a mime-multipart message. I can see this as preferable to the include file only if all of the data is already on your workstation or laptop, and you don’t want to put it in a location accessible via URL. For our purposes, this isn’t an advantage.
Finally, if CloudInit determines that the user data is gzip compressed, it will decompress it first before processing.
CloudInit can only process 16K of data, whether it’s passed directly on the command line or in a user data file. Gzip compression can help with that, but there’s still a hard limit that you may run in to. I’ll talk about a possible solution in an upcoming post.
Or perhaps both. I haven’t seen examples of using both, and best as I can tell there’s nothing in the documentation that says you can’t use both in a single invocation. That’s probably worth trying. ↩
Okay, so I could be more creative in naming my files. ↩
I’m not using S3 for this example just because I like keeping the example files for this blog with the rest of the files for the blog. ↩
A replacement for