Batch Jobs#

Run your jobs on the cloud in parallel

Coiled Batch jobs are a lightweight API that make it easy to run your code on any cloud hardware and scale out parallel workflows. This is useful when you want to run:

  1. In region close to your cloud data

  2. On specific hardware (e.g. GPU, bigger machine)

  3. On many machines in parallel

  4. Any code (doesn’t need to be Python)

  5. With a lightweight, easy-to-use interface for the cloud

To run a batch job, add special #COILED comments to your script to specify the cloud resources you want:

# my_script.sh

#COILED memory 32GB
#COILED container ubuntu:latest
#COILED ntasks 10

echo Hello from $COILED_BATCH_TASK_ID

and then launch your script with coiled batch run:

$ coiled batch run my_script.sh

Under the hood Coiled will:

  • Inspect your script

  • Spin up appropriate machines as defined by #COILED comments

  • Download your software on to them

  • Run your script

  • Shut down the machines

Your script can do anything and you can run it on any cloud hardware. Coiled batch jobs are simple, but also powerful. Let’s explore a few more examples.

Examples#

The example from above, for completeness.

#COILED memory 32GB
#COILED container ubuntu:latest
#COILED ntasks 10

echo Hello from $COILED_BATCH_TASK_ID

Here, ten cloud VMs with 32 GB are spun up and each runs their own echo command. COILED_BATCH_TASK_ID on each machine is a unique identifier which run from “0”, “1”, “2”, …, “9” in this case.

coiled batch run works directly with Python scripts

#COILED n-tasks     10
#COILED memory      8 GiB
#COILED region      us-east-2

import os

print(f"Hello from {os.environ['COILED_BATCH_TASK_ID']}")

Here we also drop the container directive, and rely on Coiled’s environment synchronization to copy all of our Python libraries to the remote machines automatically.

A common pattern is to list many files, get the i-th file, and run some command on that file.

#COILED n-tasks     1000
#COILED memory      8 GiB
#COILED region      us-east-2

import os
import s3fs

s3 = s3fs.S3FileSystem()

# Get the i'th file to process
task_id = int(os.environ["COILED_BATCH_TASK_ID"])
filenames = s3.ls("my-bucket")
filename = filenames[task_id]

# Process that file
def process(filename):
    result = ...
    return result

result = process(filename)

# Store result
s3.put(...)  # put result back in S3 somewhere

We can ask for any hardware on any cloud, including a collection of GPUs machines.

#COILED n-tasks     100
#COILED vm-type     g5.xlarge
#COILED region      us-east-2

import torch

# Load model
model = ...
model = model.to("cuda")

# Train model
for epoch in range(50):
    model.train()
    ...

Configuration#

Coiled Batch options for specifying cloud hardware, software, number of tasks, etc., can be either added as #COILED comments in your script:

# my_script.py

#COILED memory 64GB
#COILED region us-west-2

...

or passed as command line arguments directly to coiled batch run:

$ coiled batch run --memory 64GB --region us-west-2 my_script.py

See the API docs for the full list of available options.

Monitoring and Logs#

Batch tasks run remotely on cloud VMs in the background. To monitor the status of a batch job use the coiled batch status command, which shows information about task progress, start / stop times, cloud costs, and more.

$ coiled batch status
┏━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ ID           State   Tasks Done            Submitted             Finished  Approx Cloud Cost  Command             ┃
┡━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ 659567       done      10 /  10  2024-11-18 17:25:08  2024-11-18 17:26:07              $0.01  process.py          │
│ 658315       done      10 /  10  2024-11-17 16:24:27  2024-11-17 16:25:20              $0.01  train.sh            │
│ 658292       done      20 /  20  2024-11-17 15:28:44  2024-11-17 15:29:39              $0.02  train.sh            │
│ 658288      pending     0 /  20  2024-11-17 15:24:32                                   $0.02  train.sh            │
│ 657615      pending     0 /  10  2024-11-16 20:32:46                                   $0.01  process.py          │
│ 656571       done      10 /  10  2024-11-15 14:32:35  2024-11-15 14:33:35              $0.01  process.py          │
│ 655848       done      20 /  20  2024-11-14 23:56:06  2024-11-14 23:58:01              $0.01  echo 'Hello, world' │
└────────────┴─────────┴────────────┴─────────────────────┴─────────────────────┴───────────────────┴─────────────────────┘

And while you can always look at jobs on the Coiled UI, we also provide a convenient coiled logs command to get logs given a job ID.

$ coiled logs 661309 | grep Hello
2024-11-19 15:59:46.131000 Hello from 0
2024-11-19 15:59:46.419000 Hello from 1
2024-11-19 15:59:46.743000 Hello from 2
2024-11-19 15:59:47.019000 Hello from 3
2024-11-19 15:59:47.324000 Hello from 4
2024-11-19 15:59:47.587000 Hello from 5
2024-11-19 15:59:47.862000 Hello from 6
2024-11-19 15:59:48.256000 Hello from 7
2024-11-19 15:59:48.315000 Hello from 8
2024-11-19 15:59:48.508000 Hello from 9

Coordination#

Each task within a batch job has the following environment variables automatically set:

  • COILED_BATCH_TASK_ID: ID for the current running task. For example, “0”, “1”, “2”, etc.

  • COILED_BATCH_LOCAL_ADDRESS: IP address for the current running task.

  • COILED_BATCH_SCHEDULER_ADDRESS: IP address for the head node.

  • COILED_BATCH_PROCESS_TYPE: Either “scheduler” if running on the head node or “worker” otherwise.

  • COILED_BATCH_READY_WORKERS: Comma delimited list of IP addresses for VMs that are ready for work.

These are often used for handling coordination among tasks. Depending on your specific workload, you may or may not need to use these.

Data Access#

For short jobs, you can send a forward STS token from your local AWS credentials using --forward-aws-credentials. Because Batch Jobs are a fire-and-forget API, Coiled does not refresh the STS token we forward, so it may expire while your job is still running. (The expiration depends on details about how your AWS account and your local AWS credentials are configured.)

For longer jobs that require AWS credentials, we suggest configuring the Instance Profile with the permissions your code needs.

API#

Coiled Batch jobs can be submitted and monitored using either the coiled batch CLI or coiled.batch Python API.

CLI#

coiled batch run#

Submit a batch job to run on Coiled.

Batch Jobs is currently an experimental feature.

coiled batch run [OPTIONS] [COMMAND]...

Options

--name <name>#

Name to use for Coiled cluster.

--workspace <workspace>#

Coiled workspace (uses default workspace if not specified).

--software <software>#

Existing Coiled software environment (Coiled will sync local Python software environment if neither software nor container is specified).

--container <container>#

Docker container in which to run the batch job tasks; this does not need to have Dask (or even Python), only what your task needs in order to run.

-e, --env <env>#

Environment variables transmitted to run command environment. Format is KEY=val, multiple vars can be set with separate --env for each.

--secret-env <secret_env>#

Environment variables transmitted to run command environment. Format is KEY=val, multiple vars can be set with separate --secret-env for each. Unlike environment variables specified with --env, these are only stored in our database temporarily.

-t, --tag <tag>#

Tags. Format is KEY=val, multiple vars can be set with separate --tag for each.

--vm-type <vm_type>#

VM type to use. Specify multiple times to provide multiple options.

--arm#

Use ARM VM type.

--cpu <cpu>#

Number of cores per VM.

--memory <memory>#

Memory per VM.

--gpu#

Have a GPU available.

--region <region>#

The cloud provider region in which to run the job.

--spot-policy <spot_policy>#

Default is on-demand; allows using spot VMs, or spot VMs as available with on-demand as a fallback.

Options:

on-demand | spot | spot_with_fallback

--allow-cross-zone, --no-cross-zone#

Allow workers to be placed in different availability zones.

--disk-size <disk_size>#

Use larger-than-default disk on VM, specified in GiB.

--allow-ssh-from <allow_ssh_from>#

IP address or CIDR from which connections to port 22 (SSH) are open; can also be specified as ‘everyone’ (0.0.0.0/0) or ‘me’ (automatically determines public IP detected for your local client).

--ntasks, --n-tasks <ntasks>#

Number of tasks to run. Tasks will have ID from 0 to n-1, the COILED_ARRAY_TASK_ID environment variable for each task is set to the ID of the task.

--task-on-scheduler, --no-task-on-scheduler#

Run task with lowest job ID on scheduler node.

--array <array>#

Specify array of tasks to run with specific IDs (instead of using --ntasks to array from 0 to n-1). You can specify list of IDs, a range, or a list with IDs and ranges. For example, --array 2,4-6,8-10.

--scheduler-task-array <scheduler_task_array>#

Which tasks in array to run on the scheduler node. In most cases you’ll probably want to use --task-on-scheduler instead to run task with lowest ID on the scheduler node.

-N, --max-workers <max_workers>#

Maximum number of worker nodes (by default, there will be as many worker nodes as tasks).

--wait-for-ready-cluster#

Only assign tasks once full cluster is ready.

--forward-aws-credentials#

Forward STS token from local AWS credentials.

--package-sync-strict#

Require exact package version matches when using package sync.

--package-sync-conda-extras <package_sync_conda_extras>#

A list of conda package names (available on conda-forge) to include in the environment that are not in your local environment.

--host-setup-script <host_setup_script>#

Path to local script which will be run on each VM prior to running any tasks.

Arguments

COMMAND#

Optional argument(s)

coiled batch status#

Check the status of a Coiled Batch job.

coiled batch status [OPTIONS] [CLUSTER]

Options

--workspace <workspace>#

Coiled workspace (uses default workspace if not specified).

--format <format>#
Options:

json | table

--sort <sort>#

Arguments

CLUSTER#

Optional argument

coiled batch list#

List Coiled Batch jobs in a workspace.

coiled batch list [OPTIONS]

Options

--workspace <workspace>#

Coiled workspace (uses default workspace if not specified).

--format <format>#
Options:

json | table

--limit <limit>#

coiled batch wait#

Monitor the progress of a Coiled Batch job.

coiled batch wait [OPTIONS] [CLUSTER]

Options

--workspace <workspace>#

Coiled workspace (uses default workspace if not specified).

Arguments

CLUSTER#

Optional argument

coiled logs#

coiled logs [OPTIONS] CLUSTER

Options

--account, --workspace <account>#

Coiled workspace (uses default workspace if not specified). Note: –account is deprecated, please use –workspace instead.

--scheduler#

Get scheduler logs

--workers <workers>#

Get worker logs (‘any’, ‘all’, or comma-delimited list of names, states, or internal IP addresses)

--follow#

Passed directly to aws logs tail, see aws cli docs for details.

--filter <filter>#

Passed directly to aws logs tail, see aws cli docs for details.

--since <since>#

For follow, uses aws logs tail default (10m), otherwise defaults to start time of cluster.

--format <format>#

Passed directly to aws logs tail, see aws cli docs for details.

--profile <profile>#

Passed directly to aws logs tail, see aws cli docs for details.

Arguments

CLUSTER#

Required argument

Python#

coiled.batch.run(command, *, name=None, workspace=None, software=None, container=None, env=None, secret_env=None, tag=None, vm_type=None, arm=False, cpu=None, memory=None, gpu=False, region=None, spot_policy=None, allow_cross_zone=None, disk_size=None, allow_ssh_from=None, ntasks=None, task_on_scheduler=None, array=None, scheduler_task_array=None, max_workers=None, wait_for_ready_cluster=None, forward_aws_credentials=None, package_sync_strict=False, package_sync_conda_extras=None, host_setup_script=None, logger=None)[source]

Submit a batch job to run on Coiled.

See coiled batch run --help for documentation.

Return type:

dict

coiled.batch.status(cluster='', workspace=None)[source]

Check the status of a Coiled Batch job.

See coiled batch status --help for documentation.

Return type:

list[dict]

coiled.batch.list_jobs(workspace=None, limit=10)[source]

List Coiled Batch jobs in a workspace.

See coiled batch list --help for documentation.

Return type:

list[dict]