Batch Jobs#

Run your jobs on the cloud in parallel

Coiled Batch jobs are a lightweight API that make it easy to run your code on any cloud hardware and scale out parallel workflows. This is useful when you want to run:

  1. In a region close to your cloud data

  2. On specific hardware (e.g. GPU, bigger machine)

  3. On many machines in parallel

  4. Any code (doesn’t need to be Python)

  5. With a lightweight, easy-to-use interface for the cloud

Quickstart#

To run a batch job, add special # COILED comments to your script to specify the cloud resources you want:

Spin up ten cloud VMs with 32 GB of memory to run their own echo command.

my_script.sh#
#!/bin/bash

# COILED ntasks 10
# COILED memory 32GB
# COILED container ubuntu:latest

echo Hello from $COILED_BATCH_TASK_ID

Then launch your script with coiled batch run:

$ coiled batch run my_script.sh

COILED_BATCH_TASK_ID is an identifier unique to each task which runs from “0”, “1”, “2”, …, “9” in this case.

Use # COILED comments directly in Python scripts. Drop the container directive to rely on Coiled’s environment synchronization to copy all of your Python libraries to the remote machines automatically.

my_script.py#
# COILED n-tasks     10
# COILED memory      8 GiB
# COILED region      us-east-2

import os

print(f"Hello from {os.environ['COILED_BATCH_TASK_ID']}")

and then launch your script with coiled batch run:

$ coiled batch run my_script.py

A common pattern is to list many files, get the i-th file, and run some command on that file.

For instance, you might have a file with a list of S3 paths:

inputs.txt#
s3://my-bucket/file1.csv
s3://my-bucket/file2.csv
s3://my-bucket/subdir/file3.csv
s3://my-bucket/subdir/file4.csv
...

And a script to process each file:

process.py#
# COILED memory      8 GiB
# COILED region      us-east-2

import sys
import s3fs

s3 = s3fs.S3FileSystem()

# Process that file
def process(filename):
    with s3.open(filename) as f:
        result = ...
    return result

filename = sys.argv[1]  # or use os.environ["COILED_BATCH_TASK_INPUT"]
result = process(filename)

# Store result
s3.put(...)  # put result back in S3 somewhere

You can then use coiled batch run to process each file in parallel:

$ coiled batch run \
    --map-over-file inputs.txt \
    process.py \$COILED_BATCH_TASK_INPUT

Ask for GPU machines if you need them.

my_gpu_script.py#
# COILED n-tasks     100
# COILED vm-type     g5.xlarge
# COILED region      us-east-2

import torch

# Load model
model = ...
model = model.to("cuda")

# Train model
for epoch in range(50):
    model.train()
    ...

and then launch your script with coiled batch run:

$ coiled batch run my_gpu_script.py

Under the hood Coiled will:

  • Inspect your script

  • Spin up appropriate machines as defined by # COILED comments

  • Download your software onto them or use the container you specified

  • Run your script

  • Shut down the machines

Configuration#

Specify cloud hardware, software, number of tasks, etc. directly in your script using # COILED comments:

my_script.sh#
# COILED memory 64GB
# COILED region us-west-2

...

or pass them as command line arguments to coiled batch run:

$ coiled batch run --memory 64GB --region us-west-2 my_script.py

See the API docs for the full list of available options.

Monitoring and Logs#

Batch tasks run remotely on cloud VMs in the background. To monitor the status of a batch job use the coiled batch status command, which shows information about task progress, start / stop times, cloud costs, and more.

$ coiled batch status
┏━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ ID           State   Tasks Done            Submitted             Finished  Approx Cloud Cost  Command             ┃
┡━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ 659567       done      10 /  10  2024-11-18 17:25:08  2024-11-18 17:26:07              $0.01  process.py          │
│ 658315       done      10 /  10  2024-11-17 16:24:27  2024-11-17 16:25:20              $0.01  train.sh            │
│ 658292       done      20 /  20  2024-11-17 15:28:44  2024-11-17 15:29:39              $0.02  train.sh            │
│ 658288      pending     0 /  20  2024-11-17 15:24:32                                   $0.02  train.sh            │
│ 657615      pending     0 /  10  2024-11-16 20:32:46                                   $0.01  process.py          │
│ 656571       done      10 /  10  2024-11-15 14:32:35  2024-11-15 14:33:35              $0.01  process.py          │
│ 655848       done      20 /  20  2024-11-14 23:56:06  2024-11-14 23:58:01              $0.01  echo 'Hello, world' │
└────────────┴─────────┴────────────┴─────────────────────┴─────────────────────┴───────────────────┴─────────────────────┘

And while you can always look at jobs on the Coiled web UI, we also provide a convenient coiled batch logs command to get logs given a cluster ID.

$ coiled batch logs 661309
(Task 0)        2025-07-09 16:50:37.844000 Hello from 0
(Task 1)        2025-07-09 16:50:38.387000 Hello from 1
(Task 2)        2025-07-09 16:50:38.951000 Hello from 2
(Task 3)        2025-07-09 16:50:38.968000 Hello from 3
(Task 4)        2025-07-09 16:50:39.056000 Hello from 4
(Task 5)        2025-07-09 16:50:39.500000 Hello from 5
(Task 6)        2025-07-09 16:50:39.500000 Hello from 6
(Task 7)        2025-07-09 16:50:39.589000 Hello from 7
(Task 8)        2025-07-09 16:50:39.592000 Hello from 8
(Task 9)        2025-07-09 16:50:40.043000 Hello from 9

Parallelism#

Batch makes it just as easy to run a command once on a single VM as it is to run that command 10,000 times in parallel using 1,000 VMs.

To run something once:

$ coiled batch run my_script.sh

To run the same command in parallel, 10,000 times:

$ coiled batch run --ntasks 10000 my_script.sh

By default Coiled will use 1 VM per task, with up to 1,000 VMs. Each VM runs a single task at a time, and if there are more tasks than VMs, the tasks will be queued and start running once earlier tasks finish.

To limit the number of VMs, specify --max-workers:

$ coiled batch run --ntasks 10000 --max-workers 100 my_script.sh

Usually you don’t want to run the exact same thing 10,000 times. More often, you want to do the same operations with different inputs or parameters. There are a few ways to do this:

Each task has a unique ID, exposed as the COILED_BATCH_TASK_ID environment variable.

For example, you might run coiled batch run --ntasks 10 my_script.py with this as your script:

my_script.py#
import os

task_id = os.environ.get("COILED_ARRAY_TASK_ID")
print(f"Hello from task ${task_id})

Or more realistically, you could use this task ID to control what happens inside your script.

When you specify --ntasks 10, you’ll get tasks with IDs from 0 to 9. If you want to have more control over those task IDs, you can use --array to specify a list, a range, or a list of ranges. For example, you could use --array 2,4-6,8-10 to run tasks 2, 4, 5, 8, 9 and 10; or use array 0-10:2 to run tasks 0, 2, 4, 6, 8, and 10.

If you have a list of input values, you can easily run one task per input like so:

coiled batch run --map-over-values "first,second,third" my_script.py

The value is exposed as the COILED_BATCH_TASK_INPUT environment variable, or you can specify a different environment variable name using --map-over-input-var <ENV_VAR_NAME>.

If you have a file with a list of inputs, you can easily run one task per line of that file.

For instance, you might have a file with a list of S3 paths:

inputs.txt#
s3://my-bucket/file1.csv
s3://my-bucket/file2.csv
s3://my-bucket/subdir/file3.csv
s3://my-bucket/subdir/file4.csv
...

And a script to process each file:

process.py#
# COILED memory      8 GiB
# COILED region      us-east-2

import sys
import s3fs

s3 = s3fs.S3FileSystem()

# Process that file
def process(filename):
    with s3.open(filename) as f:
        result = ...
    return result

filename = sys.argv[1]  # or use os.environ["COILED_BATCH_TASK_INPUT"]
result = process(filename)

# Store result
s3.put(...)  # put result back in S3 somewhere

You can then use coiled batch run to process each file in parallel:

$ coiled batch run \
    --map-over-file inputs.txt \
    process.py \$COILED_BATCH_TASK_INPUT

Note

To use task environment variables directly in the command you submit to coiled batch run, you need to escape the $ as \$. Otherwise $COILED_BATCH_TASK_INPUT will be interpolated by your shell (where it probably isn’t set). For example, we’d run:

process.py <blank value>

instead of the intended:

process.py $COILED_BATCH_TASK_INPUT

Coordination#

Each task within a batch job has the following environment variables automatically set:

  • COILED_BATCH_TASK_ID: ID for the current running task. For example, “0”, “1”, “2”, etc.

  • COILED_BATCH_TASK_COUNT: Total number of tasks for the job.

  • COILED_BATCH_LOCAL_ADDRESS: IP address for the current running task.

  • COILED_BATCH_SCHEDULER_ADDRESS: IP address for the head node.

  • COILED_BATCH_PROCESS_TYPE: Either “scheduler” if running on the head node or “worker” otherwise.

  • COILED_BATCH_READY_WORKERS: Comma delimited list of IP addresses for VMs that are ready for work.

  • COILED_CLUSTER_ID: ID for the cluster where the batch job is running. A useful value to log to other systems, such as MLflow, to trace which cluster ran a particular job.

  • COILED_CLUSTER_NAME: Name of the cluster where the batch job is running.

  • COILED_CLUSTER_HOSTNAME: Hostname (e.g., cluster-xyz.dask.host) for the cluster scheduler.

These are often used for handling coordination among tasks. Depending on your specific workload, you may or may not need to use these. Coiled also injects common environment variables into the tasks.

Software#

Like other Coiled APIs, Coiled Batch can use

  • package sync to automatically sync your local Python packages to the cloud VMs,

  • manually defined Coiled software environments,

  • or Docker container images.

Unlike other Coiled APIs that depend on dask, if you use a Docker container with Coiled Batch, there’s no need for that container to have dask or distributed.

For example, see this example using a pre-built public GDAL container to run GDAL, or this example using a pre-built public uv container to install Python dependencies on the fly.

Data Access#

For short jobs, you can forward STS token from your local AWS credentials using --forward-aws-credentials. Because Batch Jobs are a fire-and-forget API, Coiled does not refresh the STS token we forward, so it may expire while your job is still running. (The expiration depends on details about how your AWS account and your local AWS credentials are configured.)

For longer jobs that require AWS credentials, we suggest configuring the Instance Profile with the permissions your code needs.

For Google Cloud Compute, refer to the data access guide for instructions to use a service account.

Timeouts#

For timeouts on individual tasks, you can use the timeout utility (included in most Linux containers).

Instead of coiled batch run python my_script.py, you’d run coiled batch run timeout 600 python my_script.py to have any tasks than runs for more than 600 seconds timeout and exit with 1 as the exit code.

For timeouts on the entire batch job, you can specify --job-timeout like so:

$ coiled batch run --job-timeout "1 hour" python my_script.py

Examples#

Here are examples of how you can use Batch Jobs:

API#

Coiled Batch jobs can be submitted and monitored using either the coiled batch CLI or coiled.batch Python API.

CLI#

coiled batch run#

Submit a batch job to run on Coiled.

Batch Jobs is currently an experimental feature.

coiled batch run [OPTIONS] COMMAND...

Options

--name <name>#

Name to use for Coiled cluster.

--workspace <workspace>#

Coiled workspace (uses default workspace if not specified).

--software <software>#

Existing Coiled software environment (Coiled will sync local Python software environment if neither software nor container is specified).

--container <container>#

Docker container in which to run the batch job tasks; this does not need to have Dask (or even Python), only what your task needs in order to run.

--ignore-container-entrypoint <ignore_container_entrypoint>#

Ignore entrypoint for specified Docker container (like docker run --entrypoint); default is to use the entrypoint (if any) set on the image.

--run-on-host <run_on_host>#

Run code directly on host, not inside docker container.

-e, --env <env>#

Environment variables transmitted to run command environment. Format is KEY=val, multiple vars can be set with separate --env for each.

--secret-env <secret_env>#

Environment variables transmitted to run command environment. Format is KEY=val, multiple vars can be set with separate --secret-env for each. Unlike environment variables specified with --env, these are only stored in our database temporarily.

--env-file <env_file>#

Path to .env file; all variables set in the file will be transmitted to run command environment.

--secret-env-file <secret_env_file>#

Path to .env file; all variables set in the file will be transmitted to run command environment. These environment variables will only be stored in our database temporarily.

-t, --tag <tag>#

Tags. Format is KEY=val, multiple vars can be set with separate --tag for each.

--vm-type <vm_type>#

VM type to use. Specify multiple times to provide multiple options.

--scheduler-vm-type <scheduler_vm_type>#

VM type to use specifically for scheduler. Default is to use small VM if scheduler is not running tasks, or use same VM type(s) for all nodes if scheduler node is running tasks.

--arm#

Use ARM VM type.

--cpu <cpu>#

Number of cores per VM.

--memory <memory>#

Memory per VM.

--gpu#

Have a GPU available.

--region <region>#

The cloud provider region in which to run the job.

--spot-policy <spot_policy>#

Default is on-demand; allows using spot VMs, or spot VMs as available with on-demand as a fallback. Only applies to workers (scheduler VM is always on-demand).

Options:

on-demand | spot | spot_with_fallback

--allow-cross-zone, --no-cross-zone#

Allow workers to be placed in different availability zones.

--disk-size <disk_size>#

Use larger-than-default disk on VM, specified in GiB.

--allow-ssh-from <allow_ssh_from>#

IP address or CIDR from which connections to port 22 (SSH) are open; can also be specified as ‘everyone’ (0.0.0.0/0) or ‘me’ (automatically determines public IP detected for your local client).

--map-over-values <map_over_values>#

A list of values such that for each value, a task will be run with that value as the input. If you specify --map-over-values 'first,second,third', then batch will run three tasks with inputs ‘first’, ‘second’, and ‘third’. By default the input is passed to the task in the COILED_BATCH_TASK_INPUT environment variable, so one task will get COILED_BATCH_TASK_INPUT=first and so on.

--map-over-file <map_over_file>#

Like --map-over--values, but instead of specifying the string of values directly, you specify the path to a file with the values. Note that by default, each line in the file is treated as an individual value; this can be controlled with the --map-over-delimiter option.

--map-over-input-var <map_over_input_var>#

The value from –map-over-values or –map-over-files is exposed to the task as an environment variable. By default, the environment variable is COILED_BATCH_TASK_INPUT, but you can specify a different name for the environment variable using this option.

--map-over-delimiter <map_over_delimiter>#

Delimiter for splitting the string from --map-over-values or the file contents from --map-over-file into individual values. By default this is ‘,’ for --map-over-values and newline for --map-over-file.

--wait#
--upload <local_upload_path>#

File or directory to upload to cloud storage and download onto the VM(s). By default files will be copied into the working directory on VM where your batch script runs.

--download <local_download_path>#

When used with --wait, output files from job will be downloaded into this local directory when job is complete. When used without --wait, files won’t be automatically downloaded, but job will be configured to store result files in cloud storage for later download.

--sync <local_sync_path>#

Equivalent to specifying both --upload and --download with the same local directory.

--pipe-to-files#

Write stdout and stderr from each task to files which can be downloaded when job is complete. This is in addition to sending stdout and stderr to logs, and is more convenient than logs for when you want to use outputs from tasks as inputs to further processing).

--input-filestore <input_filestore>#

Name of input filestore

--output-filestore <output_filestore>#

Name of output filestore

--scheduler-sidecar-spec <scheduler_sidecar_spec>#

Filename for scheduler sidecar spec (yaml or json)

--ntasks, --n-tasks <ntasks>#

Number of tasks to run. Tasks will have ID from 0 to n-1, the COILED_ARRAY_TASK_ID environment variable for each task is set to the ID of the task.

--task-on-scheduler, --no-task-on-scheduler#

Run task with lowest job ID on scheduler node.

--array <array>#

Specify array of tasks to run with specific IDs (instead of using --ntasks to array from 0 to n-1). You can specify list of IDs, a range, or a list with IDs and ranges. For example, --array 2,4-6,8-10.

--scheduler-task-array <scheduler_task_array>#

Which tasks in array to run on the scheduler node. In most cases you’ll probably want to use --task-on-scheduler instead to run task with lowest ID on the scheduler node.

-N, --max-workers <max_workers>#

Maximum number of worker nodes. By default, there will be as many worker nodes as tasks, up to 1000; use -1 to explicitly request no limit.

--wait-for-ready-cluster#

Only assign tasks once full cluster is ready.

--forward-aws-credentials#

Forward STS token from local AWS credentials.

--package-sync-strict#

Require exact package version matches when using package sync.

--package-sync-conda-extras <package_sync_conda_extras>#

A list of conda package names (available on conda-forge) to include in the environment that are not in your local environment.

--package-sync-ignore <package_sync_ignore>#

A list of package names to exclude from the environment. Note their dependencies may still be installed,or they may be installed by another package that depends on them!

--host-setup-script <host_setup_script>#

Path to local script which will be run on each VM prior to running any tasks.

--job-timeout <job_timeout>#

Timeout for batch job; timer starts when the job starts running (after VMs have been provisioned). For example, you can specify ‘30 minutes’ or ‘1 hour’. Default is no timeout.

--dask-container <dask_container>#

Arguments

COMMAND#

Required argument(s)

coiled batch status#

Check the status of a Coiled Batch job.

coiled batch status [OPTIONS] [CLUSTER]

Options

--workspace <workspace>#

Coiled workspace (uses default workspace if not specified).

--format <format>#
Options:

json | table

--sort <sort>#

Arguments

CLUSTER#

Optional argument

coiled batch list#

List Coiled Batch jobs in a workspace.

coiled batch list [OPTIONS]

Options

--workspace <workspace>#

Coiled workspace (uses default workspace if not specified).

--format <format>#
Options:

json | table

--limit <limit>#

coiled batch wait#

Monitor the progress of a Coiled Batch job.

coiled batch wait [OPTIONS] [CLUSTER]

Options

--workspace <workspace>#

Coiled workspace (uses default workspace if not specified).

--download <download>#

Arguments

CLUSTER#

Optional argument

coiled logs#

coiled logs [OPTIONS] CLUSTER

Options

--account, --workspace <account>#

Coiled workspace (uses default workspace if not specified). Note: –account is deprecated, please use –workspace instead.

--scheduler#

Get scheduler logs

--workers <workers>#

Get worker logs (‘any’, ‘all’, or comma-delimited list of names, states, or internal IP addresses)

--follow#

Passed directly to aws logs tail, see aws cli docs for details.

--filter <filter>#

Passed directly to aws logs tail, see aws cli docs for details.

--since <since>#

For follow, uses aws logs tail default (10m), otherwise defaults to start time of cluster.

--format <format>#

Passed directly to aws logs tail, see aws cli docs for details.

--profile <profile>#

Passed directly to aws logs tail, see aws cli docs for details.

Arguments

CLUSTER#

Required argument

Python#

coiled.batch.run(command, *, name=None, workspace=None, software=None, container=None, run_on_host=None, cluster_kwargs=None, env=None, secret_env=None, tag=None, vm_type=None, scheduler_vm_type=None, arm=False, cpu=None, memory=None, gpu=False, region=None, spot_policy=None, allow_cross_zone=None, disk_size=None, allow_ssh_from=None, ntasks=None, task_on_scheduler=None, array=None, scheduler_task_array=None, map_over_values=None, map_over_file=None, map_over_input_var=None, map_over_task_var_dicts=None, map_over_delimiter=None, max_workers=None, wait_for_ready_cluster=None, forward_aws_credentials=None, package_sync_strict=False, package_sync_conda_extras=None, package_sync_ignore=None, local_upload_path=None, buffers_to_upload=None, host_setup_script=None, host_setup_script_content=None, command_as_script=None, ignore_container_entrypoint=None, job_timeout=None, logger=None)[source]

Submit a batch job to run on Coiled.

See coiled batch run --help for documentation.

Return type:

dict

Additional Parameters#

map_over_task_var_dicts

takes a list of dictionaries, so you can specify multiple environment variables for each task. For example, [{"FOO": 1, "BAR": 2}, {"FOO": 3, "BAR": 4}] will pass FOO=1 BAR=2 to one task and FOO=3 BAR=4 to another.

buffers_to_upload

takes a list of dictionaries, each should have path where file should be written on VM(s) relative to working directory, and io.BytesIO which provides content of file, for example [{"relative_path": "hello.txt", "buffer": io.BytesIO(b"hello")}].

coiled.batch.status(cluster='', workspace=None)[source]

Check the status of a Coiled Batch job.

See coiled batch status --help for documentation.

Return type:

list[dict]

coiled.batch.list_jobs(workspace=None, limit=10)[source]

List Coiled Batch jobs in a workspace.

See coiled batch list --help for documentation.

Return type:

list[dict]