CloudTask

Parallel batch execution with auto-launched cloud servers

Author: Liraz Siri <liraz@turnkeylinux.org>
Date: 2012-12-19
Manual section:8
Manual group:misc

SYNOPSIS

cat jobs | cloudtask [ -opts ] [ command ]

cloudtask [ -opts ] --resume=SESSION_ID cloudtask [ -opts ] --retry=SESSION_ID

DESCRIPTION

Remotely execute in parallel a batch of commands over SSH while either automatically allocating cloud servers as required, or using a pre-configured list of servers.

Cloudtask reads job inputs from stdin, each job input on a separate line containing one or more command line arguments. For each job, the job arguments are appended to the configured command and executed via SSH on a worker.

BACKGROUND

Many batch tasks can be easily broken down into units of work that can be executed in parallel. On cloud services such as Amazon EC2, running a single server for 100 hours costs the same as running 100 servers for 1 hour. In other words, for problems that can be parallelized this makes it advantageous to distribute the batch on as many cloud servers as required to finish execution of the job in just under an hour. To take full advantage of these economics an easy to use, automatic system for launching and destroying server instances reliably on demand and distributing work amongst them is required.

CloudTask solves this problem by automating the execution of a batch job on remote worker servers over SSH. The user can split up the batch so that its parts run in parallel amongst an arbitrary number of servers to speed up execution time. CloudTask can automatically allocate (and later destroy) EC2 cloud servers as required, or the user may provide a list of suitably configured "persistent" servers.

TERMS AND DEFINITIONS

USAGE BASICS

$ cat > jobs << 'EOF'
hello
world
hello world
EOF

$ cat jobs | cloudtask echo executed:
$ cat jobs | $ct echo executed:
About to launch 1 cloud server to execute 3 jobs (hello .. hello world):

  Parameter       Value
  ---------       -----

  split           1
  command         echo executed:
  hub-apikey      CQHKTKHN7N2ZR6A
  ec2-region      us-east-1
  ec2-size        m1.small
  ec2-type        s3
  user            liraz
  timeout         3600

Is this really what you want? [yes/no] yes
2012-12-19 23:44:20 :: session 1 (pid 11438)

# booting instance i-2ef94c5b ...
127.124.183.115 (11438): launched worker i-2ef94c5b
127.124.183.115 (11438): echo executed: hello
executed: hello
Connection to 127.124.183.115 closed.
127.124.183.115 (11438): exit 0 # echo executed: hello
127.124.183.115 (11438): echo executed: world
executed: world
Connection to 127.124.183.115 closed.
127.124.183.115 (11438): exit 0 # echo executed: world
127.124.183.115 (11438): echo executed: hello world
executed: hello world
Connection to 127.124.183.115 closed.
127.124.183.115 (11438): exit 0 # echo executed: hello world
127.124.183.115 (11438): destroyed worker i-2ef94c5b

2012-12-19 23:44:24 :: session 1 (11 seconds): 0/3 !OK - 0 pending, 0 timeouts, 0 errors, 3 OK

OPTIONS

--resume=SESSION_ID
 Instead of launching a new session, resume the pending jobs of a previous session
--retry=SESSION_ID
 Instead of launching a new session, retry the failed jobs of a previous session
--force Don't ask for confirmation
--ssh-identity=
SSH identity keyfile to use (defaults to ~/.ssh/identity)
--hub-apikey=APIKEY
 Hub API KEY (required if launching workers)
--backup-id=BACKUP-ID
 TurnKey Backup ID to restore on launch
--snapshot-id=SNAPSHOT-ID
 Launch instance from a snapshot ID
--ami-id=AMI-ID
 Force launch a specific AMI ID (default is the latest core)
--ec2-region=REGION
 

Region for instance launch (default: us-east-1)

Regions:

us-east-1 (Virginia, USA)
us-west-1 (California, USA)
us-west-2 (Oregon, USA)
eu-west-1 (Ireland, Europe)
ap-southeast-1 (Singapore, South-East Asia)
ap-northeast-1 (Tokyo, North-East Asia)
sa-east-1 (Sao Paulo, South America)
--ec2-size=SIZE
 

Instance launch size (default: m1.small)

Sizes:

t1.micro (1 CPU core, 613M RAM, no tmp storage)
m1.small (1 CPU core, 1.7G RAM, 160G tmp storage)
c1.medium (2 CPU cores, 1.7G RAM, 350G tmp storage)
--ec2-type=TYPE
 Instance launch type <s3|ebs> (default: s3)
--sessions=PATH
 Path where sessions are stored (default: $HOME/.cloudtask)
--timeout=SECONDS
 How many seconds to wait before giving up (default: 3600)
--retries=NUM How many times to retry a failed job (default: 0)
--strikes=NUM How many consecutive failures before we retire worker
--user=USERNAME
 Username to execute commands as (default: root)
--pre=COMMAND Worker setup command
--post=COMMAND Worker cleanup command
--overlay=PATH Path to worker filesystem overlay
--split=NUM Number of workers to execute jobs in parallel
--workers=ADDRESSES
 

List of pre-allocated workers to use

path/to/file | host-1 ... host-N
--report=HOOK

Task reporting hook, examples:

sh: command || py: file || py: code

mail: from@foo.com to@bar.com

FEATURES

EXAMPLE USAGE SCENARIO

Alon wants to refresh all TurnKey Linux appliances with the latest security updates.

He writes a script which accepts the name of an appliance as an argument, downloads the latest version from Sourceforge, extracts the root filesystem, installs the security updates, repackages the root filesystem into an appliance ISO and uploads a new version of the appliance back to Sourceforge.

After testing the script on his local Ubuntu workstation, he asks the Hub to launch a new TurnKey Core instance (88.1.2.3), transfers his script and installs whatever dependencies are required. Once everything is tested to work, he creates a new TKLBAM backup with captures the state of his master worker server.

Alon runs his first cloudtask test:

echo core | cloudtask --workers=88.1.2.3 refresh-iso-security-updates

Once he confirms that this single test job worked correctly, he's ready for the big batch job that will run on 10 servers in parallel.

Since this is a routine task Alon expects to repeat regularly, he creates a pre-configured cloudtask template for it in $HOME/cloudtasks:

$ mkdir $HOME/cloudtasks
$ cd $HOME/cloudtasks

$ cat > refresh-iso << 'EOF'
#!/usr/bin/env python
from cloudtask import Task

class RefreshISO(Task):
    DESCRIPTION = "This task refreshes security updates on an ISO"
    BACKUP_ID = 123
    COMMAND = 'refresh-iso-security-updates'
    SPLIT = 10
    REPORT = 'mail: cloudtask@example.com alon@example.com liraz@example.com'

    HUB_APIKEY = 'BRDUKK3WDXY3CFQ'

RefreshISO.main()

EOF

$ chmod +x ./refresh-iso

$ cat $PATH_LIST_APPLIANCES | ./refresh-iso
About to launch 10 cloud servers to execute 101 jobs (appengine-go .. zurmo):

  Parameter       Value
  ---------       -----

  split           10
  command         refresh-iso-security-updates
  hub-apikey      CQHKTKHN7N2ZR6A
  ec2-region      us-east-1
  ec2-size        m1.small
  ec2-type        s3
  user            liraz
  backup-id       123
  timeout         3600
  report          mail: cloudtask@example.com alon@example.com liraz@example.com

Is this really what you want? [yes/no] yes
2012-12-19 23:57:25 :: session 3 (pid 13845)

# booting instance i-0c7acff6 ...
# booting instance i-9e8bec5e ...
127.150.56.219 (13859): launched worker i-0c7acff6
127.49.232.160 (13860): launched worker i-9e8bec5e
# booting instance i-49528c78 ...

...

45 minutes later, Alon receives an e-mail from cloudtask that the job has finished. In the body is the session log detailing if errors were detected on any job (e.g., non-zero exitcode), how long the session took to run, etc.

Had he wanted to, Alon could have followed the execution of the task jobs in real-time by tailing the worker log files:

tail -f ~/.cloudtask/11/workers/29721

GETTING STARTED

Since launching and destroying cloud servers can take a few minutes, the easiest way to get started and explore cloudtask is to experiment with a local ssh server:

# you need root privileges to install SSH
apt-get install openssh-server
/etc/init.d/ssh start

Add your user's SSH key to root's authorized keys:

ssh-copy-id root@localhost

Then run test tasks with the --workers=localhost option, like this:

seq 10 | cloudtask --workers=localhost echo

TASK CONFIGURATION

Any cloudtask configuration option that can be configured from the command line may also be configured through a template default, or by defining an environment variable.

Resolution order for options: 1) command line (highest precedence) 2) task-level default 3) CLOUDTASK_{PARAM_NAME} environment variable (lowest precedence)

For example, if you want to configure the ec2 region worker instances are launched in, you can configure it as:

  1. The --ec2-region command line option:

    $ cloudtask --ec2-region ap-southeast-1
    
  2. By defining EC2_REGION in a task template:

    $ cat > foo.py << 'EOF'
    
    from cloudtask import Task
    
    class Foo(Task):
        EC2_REGION = 'ap-southeast-1'
    
    Foo.main()
    EOF
    
    $ chmod +x ./foo.py
    $ ./foo.py
    
  3. By setting the CLOUDTASK_EC2_REGION environment variable:

    export CLOUDTASK_EC2_REGION=ap-southeast-1
    

Best practices for production use

For production use, it is recommended to create pre-configured task templates for routine jobs in a Git repository. Task templates may inherit shared definitions such as the Hub APIKEY or the reporting hook from a common module:

$ cat > common.py << 'EOF'
from cloudtask import Task
class BaseTask(Task):
    HUB_APIKEY = 'BRDUKK3WDXY3CFQ'
    REPORT = 'mail: cloudtask@example.com alon@example.com liraz@example.com'

    # save sessions in the local directory ratehr than
    # $HOME/.cloudtask. That way we can easily track the session
    # logs in Git too.
    SESSIONS = 'sessions/'
EOF

$ cat > helloworld << 'EOF'
#!/usr/bin/python
from common import BaseTask
class HelloWorld(BaseTask):
    COMMAND = 'echo hello world'

HelloWorld.main()
EOF
chmod +x helloworld

SSH AUTHENTICATION

Cloudtask uses SSH to log into remote workers, transfer files and execute commands. You can't SSH into a remote worker unless authentication has been properly set up. The local ssh client must be capable of authenticating to the remote worker with its ssh identity. Password authentication is not supported. Your ssh identity must be added to the remote worker's authorized keys list.

The easiest and most reliable way to do this is to:

  1. Generate an SSH keypair:

    $ ssh-keygen -f cloudtask-keypair -N ''
    Generating public/private rsa key pair.
    Your identification has been saved in cloudtask-keypair.
    Your public key has been saved in cloudtask-keypair.pub.
    The key fingerprint is:
    c5:88:16:8e:78:a9:b9:b9:c1:c3:5d:87:e5:03:8d:3c liraz@backstage
    The key's randomart image is:
    +--[ RSA 2048]----+
    |      .          |
    |   . = = o       |
    |  . + E + o      |
    |   + . * .       |
    |  o   o S        |
    | o + . . .       |
    |  B .            |
    |   +             |
    |  .              |
    +-----------------+
    
  2. Log into your Hub account, go to the User Profile page and cut and paste the contents of cloudtask-keypair.pub to the Authorized Keys textbox. This ensures that cloudtask-keypair will be added to the list of authorized keys on newly launched instances.

If you are running Cloudtask on a remote machine and don't want to leave your authorized key on it, there is a somewhat safer, though less reliable alternative. You can keep your authorized key on your local machine and forward your SSH agent to the remote machine running Cloudtask:

ssh -A my-cloudtask-controller

Then when you run Cloudtask, as soon as it launches a new instance it will use your forwarded identity to add a temporary identity to the list of authorized keys on the newly launched remote instance. This will allow Cloudtask to continue to access the worker even if you log out and cut off access to the forwarded SSH agent.

You'll need to make sure you stay logged on with the forwarded SSH agent until the last worker launches and authorizes the temporary identity.

HOW IT WORKS

When the user executes a task, the following steps are performed:

  1. A temporary SSH session key is created.

    The initial authentication to workers assumes you have set up an SSH agent or equivalent (cloudtask does not support password authentication).

    The temporary session key will be added to the worker's authorized keys for the duration of the task run, and then removed. We need to authorize a temporary session key to ensure access to the workers without relying on the SSH agent.

  2. Workers are allocated.

    Worker cloud servers are launched automatically by cloudtask to satisfy the requested split unless enough pre-allocated workers are provided via the --workers option.

    A TKLBAM backup id may be provided to install the required job execution dependencies (e.g., scripts, packages, etc.) on top of TurnKey Core.

  3. Worker setup.

    After workers are allocated they are set up. The temporary session key is added to the authorized keys, the overlay is applied to the root filesystem (if the user has configured an overlay) and the pre command is executed (if the user has configured a pre command).

  4. Job execution.

    CloudTask feeds a list of all jobs that make up the task into an job queue. Every remote worker has a local supervisor process which reads a job command from the queue and executes it over SSH on the worker.

    The job may time out before it has completed if a --timeout has been configured.

    While the job is executing, the supervising process will periodically check that the worker is still alive every 30 seconds if the job doesn't generate any console output. If a worker is no longer reachable, it is destroyed and the aborted job is put back into the job queue for execution by another worker.

  5. Worker cleanup

    When there are no job commands left in the input Queue to provide a worker it is cleaned up by running the post command, removing the temporary session key from the authorized keys.

    If cloudtask launched the worker, it will also destroy it at this point to halt incremental usage fees.

  6. Session reporting

    A reporting hook may be configured that performs an action once the session has finished executing. 3 types of reporting hooks are supported:

    1. mail: uses /usr/sbin/sendmail to send a simple unencrypted e-mail containing the session log in the body.

    2. sh: executes a shell command, with the task configuration embedded in the environment and the current working directory set to the session path. You can test the execution context like this:

      --report='sh: env && pwd'
      
    3. py: executes a Python code snippet with the session values set as local variables. You can test the execution context like this:

      --report='py: import pprint; pprint.pprint(locals())'
      

SEE ALSO

cloudtask-faq (7), cloudtask-launch-workers (8), cloudtask-destroy-workers (8),