Initializing Amazon EC2 Instances with CloudInit

The last several blog posts have been about setting up and using Amazon EC2 instances. One thing you may have noticed if you’ve been following along is that newly-launched instances are pretty bare-bones; they have one user defined, and not much more software installed than the absolute minimum. It’s unlikely they’re going to have the software you need installed out of the box.

Sure, you can set everything up by hand, but that runs counter to one of the strong points of running cloud instances – being able to launch new instances rapidly, to scale up capacity as fast as possible. There’s also the problem of repeatability; the more complex the process, the more steps involved in setting up a new instance, the more likely you are to leave out a step.

The Linux images from both Amazon and Canonical include a package called CloudInit which helps with both of these problems.

Hey, you, get onto my cloud

CloudInit runs as a process at boot time. You pass data to it when launching the instance, via either the -d user_data or -f user_data_file command-line options to ec2-run-instances ¹. CloudInit interprets the data, handles it appropriately, then exits.

For the sake of repeatability, passing data directly on the command line seems like a bad idea. (Not to mention the difficulty of passing a lot of data on the command line.) In these examples, I’m going to stick to files and the -f option.

CloudInit can handle user data in a number of different formats. Let’s look at some of them.

User data scripts

The simplest and most straightforward format is the user data script. User data scripts are identified by a first line starting with #! ². The user data is written to a file on the newly-created instance, which is then executed as root by the Unix program loader “very late in the boot sequence” (according to the documentation). In essence, a user data script can look like nearly any normal Unix script.

One thing you need to remember is that anything you try to execute in the script must be available in the instance before you try to execute it. This usually means you either need to be stingy with what you try to use, or install packages before using them.

As a simple example, here’s a user data script named user-script.txt that will write to a file named user-script-output.txt ³ in the home directory of the ec2-user user.

  #!/bin/sh
  echo 'Woot!' > /home/ec2-user/user-script-output.txt

And here it is in action.

  $ ec2-run-instances --group default --key $EC2_KEYPAIR_NAME -f user-script.txt ami-d59d6bbc
  $ ec2-describe-instances
  RESERVATION   r-92c5b7ff      331055354537    default
  INSTANCE      i-badd25d5      ami-d59d6bbc    ec2-50-17-64-120.compute-1.amazonaws.com  \
                domU-12-31-39-15-64-0E.compute-1.internal   running ec2-keypair     0     \
                m1.small  2011-03-20T04:21:29+0000  us-east-1b  aki-407d9529              \
                monitoring-disabled     50.17.64.120        10.207.103.252                \
                instance-store          paravirtual xen
  $ ssh -i $EC2_KEYPAIR ec2-user@50.17.64.120

  [ec2-user@domU-12-31-39-15-64-0E ~]$ pwd
  /home/ec2-user
  [ec2-user@domU-12-31-39-15-64-0E ~]$ ls -l
  total 4
  -rw-r--r-- 1 root root 6 Mar 20 04:22 user-script-output.txt
  [ec2-user@domU-12-31-39-15-64-0E ~]$ cat user-script-output.txt
  Woot!

Include files

At The Day Job, we have several different configurations of servers, but each is based on different combinations of the same packages. For instance, they all need Java; most but not all need Tomcat; some need Subversion; and so on. Rather than maintain a number of monolithic user data scripts, one for each configuration, it makes more sense to have a script for each component, then combine them as needed.

To the best of my knowledge, you can’t pass more than one instance of -f on the command line when starting an instance. Fortunately, the next best thing is available: you can pass an include file, which contains references to other user data files.

Include files are identified by a first line starting with #include. Subsequent lines contain URLs which are read one at a time by CloudInit, loaded, and the data processed as user data. The documentation is silent on whether the URLs are handled consecutively or concurrently, or whether they’re guaranteed to be handled in order. In other words, while you probably can assume that later included user data can depend on previously included user data, you should test and confirm for yourself that this is the case.

Note that the include file contains URLs rather than filenames. This is a mixed blessing; you need to put the included files somewhere where the launching instance can load them (which is likely not your workstation or laptop), but then they’re in a central location, accessible to all of your EC2 instances as needed. As an added benefit, if you store them on S3, the bandwidth to read them isn’t metered and doesn’t incur additional charges.

For this example, we’ll create three user data scripts, upload them to the Web ⁴, then create an include file which references all three. We’ll do a simple test of whether the includes are processed consecutively or concurrently as well.

The first script is named script-1.sh.

  #!/bin/sh
  HOME=/home/ec2-user
  sleep 20
  echo 'running' > $HOME/script-1-output.txt
  if [ -f $HOME/script-2-output.txt ]
    echo 'script-2 has run' >> $HOME/script-1-output.txt
  else
    echo 'script-2 has not run' >> $HOME/script-1-output.txt
  fi
  if [ -f $HOME/script-3-output.txt ]
    echo 'script-3 has run' >> $HOME/script-1-output.txt
  else
    echo 'script-3 has not run' >> $HOME/script-1-output.txt
  fi

The second script is named, naturally, script-2.sh.

  #!/bin/sh
  HOME=/home/ec2-user
  sleep 10
  echo 'running' > $HOME/script-2-output.txt
  if [ -f $HOME/script-1-output.txt ]
    echo 'script-1 has run' >> $HOME/script-2-output.txt
  else
    echo 'script-1 has not run' >> $HOME/script-2-output.txt
  fi
  if [ -f $HOME/script-3-output.txt ]
    echo 'script-3 has run' >> $HOME/script-2-output.txt
  else
    echo 'script-3 has not run' >> $HOME/script-2-output.txt
  fi

Working out what the third script is named is left as an exercise for the reader. :-)

  #!/bin/sh
  HOME=/home/ec2-user
  echo 'running' > $HOME/script-3-output.txt
  if [ -f $HOME/script-1-output.txt ]
    echo 'script-1 has run' >> $HOME/script-3-output.txt
  else
    echo 'script-1 has not run' >> $HOME/script-3-output.txt
  fi
  if [ -f $HOME/script-2-output.txt ]
    echo 'script-2 has run' >> $HOME/script-3-output.txt
  else
    echo 'script-2 has not run' >> $HOME/script-3-output.txt
  fi

Next, we need an include file which references all three of these user data scripts, called include.txt.

  #include
  http://craigcottingham.github.com/code/cloud-init/script-1.sh
  http://craigcottingham.github.com/code/cloud-init/script-2.sh
  http://craigcottingham.github.com/code/cloud-init/script-3.sh

And here it is in action.

  $ ec2-run-instances --group default --key $EC2_KEYPAIR_NAME -f include.txt ami-d59d6bbc
  $ ec2-describe-instances
  RESERVATION   r-b40d76d9      331055354537    default
  INSTANCE      i-5cb97933      ami-d59d6bbc    ec2-184-73-11-176.compute-1.amazonaws.com \
                ip-10-114-46-188.compute-1.internal         running ec2-keypair     0     \
                m1.small  2011-03-22T03:39:13+0000  us-east-1b  aki-407d9529              \
                monitoring-disabled     184.73.11.176       10.114.46.188                 \
                instance-store          paravirtual xen
  $ ssh -i $EC2_KEYPAIR ec2-user@184.73.11.176

  [ec2-user@ip-10-114-46-188 ~]$ ls -l
  total 12
  -rw-r--r-- 1 root root 50 Mar 22 03:41 script-1-output.txt
  -rw-r--r-- 1 root root 46 Mar 22 03:41 script-2-output.txt
  -rw-r--r-- 1 root root 42 Mar 22 03:41 script-3-output.txt
  [ec2-user@ip-10-114-46-188 ~]$ cat script-1-output.txt
  running
  script-2 has not run
  script-3 has not run
  [ec2-user@ip-10-114-46-188 ~]$ cat script-2-output.txt
  running
  script-1 has run
  script-3 has not run
  [ec2-user@ip-10-114-46-188 ~]$ cat script-3-output.txt
  running
  script-1 has run
  script-2 has run

So here we have empirical evidence that the included user data scripts are run consecutively rather than concurrently.

More advanced usage

There’s a fairly advanced-looking data format called cloud-config which, unfortunately, is not well documented beyond the “here’s some examples” level. A good part of what is documented seems to be specific to Ubuntu, so I’m not sure how useful it would be for the Amazon AMIs.

If you need code to execute earlier than “late in the boot process”, the cloud-boothook is available. It appears to function similar to a user data script, with the additional restriction that it should be idempotent ⁵. I was unable to find an example in the documentation, so I don’t know anything more about this data format.

Similarly, you can define upstart ⁶ jobs directly with the upstart-job data format.

You can specify more than one data format in a single file by formatting it as a mime-multipart message. I can see this as preferable to the include file only if all of the data is already on your workstation or laptop, and you don’t want to put it in a location accessible via URL. For our purposes, this isn’t an advantage.

Finally, if CloudInit determines that the user data is gzip compressed, it will decompress it first before processing.

Caveats

CloudInit can only process 16K of data, whether it’s passed directly on the command line or in a user data file. Gzip compression can help with that, but there’s still a hard limit that you may run in to. I’ll talk about a possible solution in an upcoming post.

Or perhaps both. I haven’t seen examples of using both, and best as I can tell there’s nothing in the documentation that says you can’t use both in a single invocation. That’s probably worth trying. ↩
The shebang. ↩
Okay, so I could be more creative in naming my files. ↩
I’m not using S3 for this example just because I like keeping the example files for this blog with the rest of the files for the blog. ↩
“repeated applications have the same effect as one”. ↩
A replacement for sysvinit. ↩