Batch Jobs#

Run your jobs on the cloud in parallel

Coiled Batch jobs are a lightweight API that make it easy to run your code on any cloud hardware and scale out parallel workflows. This is useful when you want to run:

  1. In region close to your cloud data

  2. On specific hardware (e.g. GPU, bigger machine)

  3. On many machines in parallel

  4. Any code (doesn’t need to be Python)

  5. With a lightweight, easy-to-use interface for the cloud

To run a batch job, add special #COILED comments to your script to specify the cloud resources you want:

# my_script.sh

#COILED memory 32GB
#COILED container ubuntu:latest
#COILED ntasks 10

echo Hello from $COILED_BATCH_TASK_ID

and then launch your script with coiled batch run:

$ coiled batch run my_script.sh

Under the hood Coiled will:

  • Inspect your script

  • Spin up appropriate machines as defined by #COILED comments

  • Download your software on to them

  • Run your script

  • Shut down the machines

Your script can do anything and you can run it on any cloud hardware. Coiled batch jobs are simple, but also powerful. Let’s explore a few more examples.

Examples#

The example from above, for completeness.

#COILED memory 32GB
#COILED container ubuntu:latest
#COILED ntasks 10

echo Hello from $COILED_BATCH_TASK_ID

Here, ten cloud VMs with 32 GB are spun up and each runs their own echo command. COILED_BATCH_TASK_ID on each machine is a unique identifier which run from “0”, “1”, “2”, …, “9” in this case.

coiled batch run works directly with Python scripts

#COILED n-tasks     10
#COILED memory      8 GiB
#COILED region      us-east-2

import os

print(f"Hello from {os.environ['COILED_BATCH_TASK_ID']}")

Here we also drop the container directive, and rely on Coiled’s environment synchronization to copy all of our Python libraries to the remote machines automatically.

A common pattern is to list many files, get the i-th file, and run some command on that file.

#COILED n-tasks     1000
#COILED memory      8 GiB
#COILED region      us-east-2

import os
import s3fs

s3 = s3fs.S3FileSystem()

# Get the i'th file to process
task_id = int(os.environ["COILED_BATCH_TASK_ID"])
filenames = s3.ls("my-bucket")
filename = filenames[task_id]

# Process that file
def process(filename):
    result = ...
    return result

result = process(filename)

# Store result
s3.put(...)  # put result back in S3 somewhere

We can ask for any hardware on any cloud, including a collection of GPUs machines.

#COILED n-tasks     100
#COILED vm-type     g5.xlarge
#COILED region      us-east-2

import torch

# Load model
model = ...
model = model.to("cuda")

# Train model
for epoch in range(50):
    model.train()
    ...

Configuration#

Coiled Batch options for specifying cloud hardware, software, number of tasks, etc., can be either added as #COILED comments in your script:

# my_script.py

#COILED memory 64GB
#COILED region us-west-2

...

or passed as command line arguments directly to coiled batch run:

$ coiled batch run --memory 64GB --region us-west-2 my_script.py

See the API docs for the full list of available options.

Monitoring and Logs#

Batch tasks run remotely on cloud VMs in the background. To monitor the status of a batch job use the coiled batch status command, which shows information about task progress, start / stop times, cloud costs, and more.

$ coiled batch status
┏━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ ID           State   Tasks Done            Submitted             Finished  Approx Cloud Cost  Command             ┃
┡━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ 659567       done      10 /  10  2024-11-18 17:25:08  2024-11-18 17:26:07              $0.01  process.py          │
│ 658315       done      10 /  10  2024-11-17 16:24:27  2024-11-17 16:25:20              $0.01  train.sh            │
│ 658292       done      20 /  20  2024-11-17 15:28:44  2024-11-17 15:29:39              $0.02  train.sh            │
│ 658288      pending     0 /  20  2024-11-17 15:24:32                                   $0.02  train.sh            │
│ 657615      pending     0 /  10  2024-11-16 20:32:46                                   $0.01  process.py          │
│ 656571       done      10 /  10  2024-11-15 14:32:35  2024-11-15 14:33:35              $0.01  process.py          │
│ 655848       done      20 /  20  2024-11-14 23:56:06  2024-11-14 23:58:01              $0.01  echo 'Hello, world' │
└────────────┴─────────┴────────────┴─────────────────────┴─────────────────────┴───────────────────┴─────────────────────┘

And while you can always look at jobs on the Coiled UI, we also provide a convenient coiled logs command to get logs given a job ID.

$ coiled logs 661309 | grep Hello
2024-11-19 15:59:46.131000 Hello from 0
2024-11-19 15:59:46.419000 Hello from 1
2024-11-19 15:59:46.743000 Hello from 2
2024-11-19 15:59:47.019000 Hello from 3
2024-11-19 15:59:47.324000 Hello from 4
2024-11-19 15:59:47.587000 Hello from 5
2024-11-19 15:59:47.862000 Hello from 6
2024-11-19 15:59:48.256000 Hello from 7
2024-11-19 15:59:48.315000 Hello from 8
2024-11-19 15:59:48.508000 Hello from 9

Coordination#

Each task within a batch job has the following environment variables automatically set:

  • COILED_BATCH_TASK_ID: ID for the current running task. For example, “0”, “1”, “2”, etc.

  • COILED_BATCH_LOCAL_ADDRESS: IP address for the current running task.

  • COILED_BATCH_SCHEDULER_ADDRESS: IP address for the head node.

  • COILED_BATCH_PROCESS_TYPE: Either “scheduler” if running on the head node or “worker” otherwise.

  • COILED_BATCH_READY_WORKERS: Comma delimited list of IP addresses for VMs that are ready for work.

These are often used for handling coordination among tasks. Depending on your specific workload, you may or may not need to use these.

API#

coiled batch run#

Submit a batch job to run on Coiled.

Batch Jobs is currently an experimental feature.

coiled batch run [OPTIONS] [COMMAND]...

Options

--name <name>#

Name to use for Coiled cluster.

--workspace <workspace>#

Coiled workspace (uses default workspace if not specified).

--software <software>#

Existing Coiled software environment (Coiled will sync local Python software environment if neither software nor container is specified).

--container <container>#

Docker container in which to run the batch job tasks; this does not need to have Dask (or even Python), only what your task needs in order to run.

-e, --env <env>#

Environment variables transmitted to run command environment. Format is KEY=val, multiple vars can be set with separate --env for each.

--secret-env <secret_env>#

Environment variables transmitted to run command environment. Format is KEY=val, multiple vars can be set with separate --secret-env for each. Unlike environment variables specified with --env, these are only stored in our database temporarily.

-t, --tag <tag>#

Tags. Format is KEY=val, multiple vars can be set with separate --tag for each.

--vm-type <vm_type>#

VM type to use. Specify multiple times to provide multiple options.

--arm#

Use ARM VM type.

--cpu <cpu>#

Number of cores per VM.

--memory <memory>#

Memory per VM.

--gpu#

Have a GPU available.

--region <region>#

The cloud provider region in which to run the job.

--disk-size <disk_size>#

Use larger-than-default disk on VM, specified in GiB.

--ntasks, --n-tasks <ntasks>#

Number of tasks to run. Tasks will have ID from 0 to n-1, the COILED_ARRAY_TASK_ID environment variable for each task is set to the ID of the task.

--task-on-scheduler, --no-task-on-scheduler#

Run task with lowest job ID on scheduler node.

--array <array>#

Specify array of tasks to run with specific IDs (instead of using --ntasks to array from 0 to n-1). You can specify list of IDs, a range, or a list with IDs and ranges. For example, --array 2,4-6,8-10.

--scheduler-task-array <scheduler_task_array>#

Which tasks in array to run on the scheduler node. In most cases you’ll probably want to use --task-on-scheduler instead to run task with lowest ID on the scheduler node.

-N, --max-workers <max_workers>#

Maximum number of worker nodes (by default, there will be as many worker nodes as tasks).

--wait-for-ready-cluster#

Only assign tasks once full cluster is ready.

--forward-aws-credentials#

Forward STS token from local AWS credentials.

--package-sync-strict#

Require exact package version matches when using package sync.

--package-sync-conda-extras <package_sync_conda_extras>#

A list of conda package names (available on conda-forge) to include in the environment that are not in your local environment.

--host-setup-script <host_setup_script>#

Path to local script which will be run on each VM prior to running any tasks.

Arguments

COMMAND#

Optional argument(s)

coiled batch status#

coiled batch status [OPTIONS] [CLUSTER]

Options

--workspace <workspace>#

Coiled workspace (uses default workspace if not specified).

--format <format>#
Options:

json | table

--limit <limit>#

Arguments

CLUSTER#

Optional argument

coiled logs#

coiled logs [OPTIONS] CLUSTER

Options

--account, --workspace <account>#

Coiled workspace (uses default workspace if not specified). Note: –account is deprecated, please use –workspace instead.

--scheduler#

Get scheduler logs

--workers <workers>#

Get worker logs (‘any’, ‘all’, or comma-delimited list of names, states, or internal IP addresses)

--follow#

Passed directly to aws logs tail, see aws cli docs for details.

--filter <filter>#

Passed directly to aws logs tail, see aws cli docs for details.

--since <since>#

For follow, uses aws logs tail default (10m), otherwise defaults to start time of cluster.

--format <format>#

Passed directly to aws logs tail, see aws cli docs for details.

--profile <profile>#

Passed directly to aws logs tail, see aws cli docs for details.

Arguments

CLUSTER#

Required argument