Batch Jobs#
Run your jobs on the cloud in parallel
Coiled Batch jobs are a lightweight API that make it easy to run your code on any cloud hardware and scale out parallel workflows. This is useful when you want to run:
In a region close to your cloud data
On specific hardware (e.g. GPU, bigger machine)
On many machines in parallel
Any code (doesn’t need to be Python)
With a lightweight, easy-to-use interface for the cloud
Quickstart#
To run a batch job, add special #COILED
comments to your script to specify the cloud resources you want:
Spin up ten cloud VMs with 32 GB of memory to run their own echo
command.
my_script.sh
##!/bin/bash
#COILED ntasks 10
#COILED memory 32GB
#COILED container ubuntu:latest
echo Hello from $COILED_BATCH_TASK_ID
Then launch your script with coiled batch run
:
$ coiled batch run my_script.sh
COILED_BATCH_TASK_ID
is an identifier unique to each task which runs from “0”, “1”, “2”, …, “9” in this case.
Use #COILED
comments directly in Python scripts. Drop the container directive to rely on Coiled’s environment synchronization to copy all of your Python libraries to the remote machines automatically.
my_script.py
##COILED n-tasks 10
#COILED memory 8 GiB
#COILED region us-east-2
import os
print(f"Hello from {os.environ['COILED_BATCH_TASK_ID']}")
and then launch your script with coiled batch run
:
$ coiled batch run my_script.py
A common pattern is to list many files, get the i
-th file, and run some command on that file.
For instance, you might have a file with a list of S3 paths:
inputs.txt
#s3://my-bucket/file1.csv
s3://my-bucket/file2.csv
s3://my-bucket/subdir/file3.csv
s3://my-bucket/subdir/file4.csv
...
And a script to process each file:
process.py
##COILED memory 8 GiB
#COILED region us-east-2
import sys
import s3fs
s3 = s3fs.S3FileSystem()
# Process that file
def process(filename):
with s3.open(filename) as f:
result = ...
return result
filename = sys.argv[1] # or use os.environ("COILED_BATCH_TASK_INPUT")
result = process(filename)
# Store result
s3.put(...) # put result back in S3 somewhere
You can then use coiled batch run
to process each file in parallel:
$ coiled batch run \
--map-over-file inputs.txt \
process.py \$COILED_BATCH_TASK_INPUT
Ask for GPU machines if you need them.
my_gpu_script.py
##COILED n-tasks 100
#COILED vm-type g5.xlarge
#COILED region us-east-2
import torch
# Load model
model = ...
model = model.to("cuda")
# Train model
for epoch in range(50):
model.train()
...
and then launch your script with coiled batch run
:
$ coiled batch run my_gpu_script.py
Under the hood Coiled will:
Inspect your script
Spin up appropriate machines as defined by
#COILED
commentsDownload your software onto them or use the container you specified
Run your script
Shut down the machines
Configuration#
Specify cloud hardware, software, number of tasks, etc. directly in your script using #COILED
comments:
my_script.sh
##COILED memory 64GB
#COILED region us-west-2
...
or pass them as command line arguments to coiled batch run
:
$ coiled batch run --memory 64GB --region us-west-2 my_script.py
See the API docs for the full list of available options.
Monitoring and Logs#
Batch tasks run remotely on cloud VMs in the background. To monitor
the status of a batch job use the coiled batch status
command, which
shows information about task progress, start / stop times, cloud costs, and more.
$ coiled batch status
┏━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ ID ┃ State ┃ Tasks Done ┃ Submitted ┃ Finished ┃ Approx Cloud Cost ┃ Command ┃
┡━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ 659567 │ done │ 10 / 10 │ 2024-11-18 17:25:08 │ 2024-11-18 17:26:07 │ $0.01 │ process.py │
│ 658315 │ done │ 10 / 10 │ 2024-11-17 16:24:27 │ 2024-11-17 16:25:20 │ $0.01 │ train.sh │
│ 658292 │ done │ 20 / 20 │ 2024-11-17 15:28:44 │ 2024-11-17 15:29:39 │ $0.02 │ train.sh │
│ 658288 │ pending │ 0 / 20 │ 2024-11-17 15:24:32 │ │ $0.02 │ train.sh │
│ 657615 │ pending │ 0 / 10 │ 2024-11-16 20:32:46 │ │ $0.01 │ process.py │
│ 656571 │ done │ 10 / 10 │ 2024-11-15 14:32:35 │ 2024-11-15 14:33:35 │ $0.01 │ process.py │
│ 655848 │ done │ 20 / 20 │ 2024-11-14 23:56:06 │ 2024-11-14 23:58:01 │ $0.01 │ echo 'Hello, world' │
└────────────┴─────────┴────────────┴─────────────────────┴─────────────────────┴───────────────────┴─────────────────────┘
And while you can always look at jobs on the Coiled web UI, we also provide a
convenient coiled batch logs
command to get logs given a cluster ID.
$ coiled batch logs 661309
(Task 0) 2025-07-09 16:50:37.844000 Hello from 0
(Task 1) 2025-07-09 16:50:38.387000 Hello from 1
(Task 2) 2025-07-09 16:50:38.951000 Hello from 2
(Task 3) 2025-07-09 16:50:38.968000 Hello from 3
(Task 4) 2025-07-09 16:50:39.056000 Hello from 4
(Task 5) 2025-07-09 16:50:39.500000 Hello from 5
(Task 6) 2025-07-09 16:50:39.500000 Hello from 6
(Task 7) 2025-07-09 16:50:39.589000 Hello from 7
(Task 8) 2025-07-09 16:50:39.592000 Hello from 8
(Task 9) 2025-07-09 16:50:40.043000 Hello from 9
Parallelism#
Batch makes it just as easy to run a command once on a single VM as it is to run that command 10,000 times in parallel using 1,000 VMs.
To run something once:
$ coiled batch run my_script.sh
To run the same command in parallel, 10,000 times:
$ coiled batch run --ntasks 10000 my_script.sh
By default Coiled will use 1 VM per task, with up to 1,000 VMs. Each VM runs a single task at a time, and if there are more tasks than VMs, the tasks will be queued and start running once earlier tasks finish.
To limit the number of VMs, specify --max-workers
:
$ coiled batch run --ntasks 10000 --max-workers 100 my_script.sh
Usually you don’t want to run the exact same thing 10,000 times. More often, you want to do the same operations with different inputs or parameters. There are a few ways to do this:
Each task has a unique ID, exposed as the COILED_BATCH_TASK_ID
environment variable.
For example, you might run coiled batch run --ntasks 10 my_script.py
with this as your script:
my_script.py
#import os
task_id = os.environ.get("COILED_ARRAY_TASK_ID")
print(f"Hello from task ${task_id})
Or more realistically, you could use this task ID to control what happens inside your script.
When you specify --tasks 10
, you’ll get tasks with IDs from 0 to 9.
If you want to have more control over those task IDs, you can use --array
to specify a list, a range, or a list of ranges.
For example, you could use --array 2,4-6,8-10
to run tasks 2, 4, 5, 8, 9 and 10;
or use array 0-10:2
to run tasks 0, 2, 4, 6, 8, and 10.
If you have a list of input values, you can easily run one task per input like so:
coiled batch run --map-over-values "first,second,third" my_script.py
The value is exposed as the COILED_BATCH_TASK_INPUT
environment variable,
or you can specify a different environment variable name using --map-over-input-var <ENV_VAR_NAME>
.
If you have a file with a list of inputs, you can easily run one task per line of that file.
For instance, you might have a file with a list of S3 paths:
inputs.txt
#s3://my-bucket/file1.csv
s3://my-bucket/file2.csv
s3://my-bucket/subdir/file3.csv
s3://my-bucket/subdir/file4.csv
...
And a script to process each file:
process.py
##COILED memory 8 GiB
#COILED region us-east-2
import sys
import s3fs
s3 = s3fs.S3FileSystem()
# Process that file
def process(filename):
with s3.open(filename) as f:
result = ...
return result
filename = sys.argv[1] # or use os.environ("COILED_BATCH_TASK_INPUT")
result = process(filename)
# Store result
s3.put(...) # put result back in S3 somewhere
You can then use coiled batch run
to process each file in parallel:
$ coiled batch run \
--map-over-file inputs.txt \
process.py \$COILED_BATCH_TASK_INPUT
Note that to use task environment variables directly in the command you submit to coiled batch run
,
you need to escape the $
as \$
. Otherwise $COILED_BATCH_TASK_INPUT
would be interpolated by your
shell (where it probably isn’t set), so we’d run process.py `` rather than ``process.py $COILED_BATCH_TASK_INPUT
as intended.
Coordination#
Each task within a batch job has the following environment variables automatically set:
COILED_BATCH_TASK_ID
: ID for the current running task. For example, “0”, “1”, “2”, etc.COILED_BATCH_LOCAL_ADDRESS
: IP address for the current running task.COILED_BATCH_SCHEDULER_ADDRESS
: IP address for the head node.COILED_BATCH_PROCESS_TYPE
: Either “scheduler” if running on the head node or “worker” otherwise.COILED_BATCH_READY_WORKERS
: Comma delimited list of IP addresses for VMs that are ready for work.
These are often used for handling coordination among tasks. Depending on your specific workload, you may or may not need to use these.
Software#
Like other Coiled APIs, Coiled Batch can use
package sync to automatically sync your local Python packages to the cloud VMs,
manually defined Coiled software environments,
or Docker container images.
Unlike other Coiled APIs that depend on dask
, if you use a Docker container with Coiled Batch, there’s no need for that container to have dask
or distributed
.
For example, see this example using a pre-built public GDAL container to run GDAL,
or this example using a pre-built public uv
container to install Python dependencies on the fly.
Data Access#
For short jobs, you can forward STS token from your local AWS credentials using --forward-aws-credentials
.
Because Batch Jobs are a fire-and-forget API, Coiled does not refresh the STS token we forward, so it may expire while your job is still running.
(The expiration depends on details about how your AWS account and your local AWS credentials are configured.)
For longer jobs that require AWS credentials, we suggest configuring the Instance Profile with the permissions your code needs.
For Google Cloud Compute, refer to the data access guide for instructions to use a service account.
Timeouts#
For timeouts on individual tasks, you can use the timeout
utility (included in most Linux containers).
Instead of coiled batch run python my_script.py
, you’d run coiled batch run timeout 600 python my_script.py
to have any tasks than runs for more than 600 seconds timeout and exit with 1 as the exit code.
For timeouts on the entire batch job, you can specify --job-timeout
like so:
$ coiled batch run --job-timeout "1 hour" python my_script.py
Examples#
Here are examples of how you can use Batch Jobs:
API#
Coiled Batch jobs can be submitted and monitored using either the coiled batch
CLI
or coiled.batch
Python API.
CLI#
coiled batch run#
Submit a batch job to run on Coiled.
Batch Jobs is currently an experimental feature.
coiled batch run [OPTIONS] [COMMAND]...
Options
- --name <name>#
Name to use for Coiled cluster.
- --workspace <workspace>#
Coiled workspace (uses default workspace if not specified).
- --software <software>#
Existing Coiled software environment (Coiled will sync local Python software environment if neither software nor container is specified).
- --container <container>#
Docker container in which to run the batch job tasks; this does not need to have Dask (or even Python), only what your task needs in order to run.
- --ignore-container-entrypoint <ignore_container_entrypoint>#
Ignore entrypoint for specified Docker container (like
docker run --entrypoint
); default is to use the entrypoint (if any) set on the image.
- -e, --env <env>#
Environment variables transmitted to run command environment. Format is
KEY=val
, multiple vars can be set with separate--env
for each.
- --secret-env <secret_env>#
Environment variables transmitted to run command environment. Format is
KEY=val
, multiple vars can be set with separate--secret-env
for each. Unlike environment variables specified with--env
, these are only stored in our database temporarily.
- --env-file <env_file>#
Path to .env file; all variables set in the file will be transmitted to run command environment.
- --secret-env-file <secret_env_file>#
Path to .env file; all variables set in the file will be transmitted to run command environment. These environment variables will only be stored in our database temporarily.
- -t, --tag <tag>#
Tags. Format is
KEY=val
, multiple vars can be set with separate--tag
for each.
- --vm-type <vm_type>#
VM type to use. Specify multiple times to provide multiple options.
- --scheduler-vm-type <scheduler_vm_type>#
VM type to use specifically for scheduler. Default is to use small VM if scheduler is not running tasks, or use same VM type(s) for all nodes if scheduler node is running tasks.
- --arm#
Use ARM VM type.
- --cpu <cpu>#
Number of cores per VM.
- --memory <memory>#
Memory per VM.
- --gpu#
Have a GPU available.
- --region <region>#
The cloud provider region in which to run the job.
- --spot-policy <spot_policy>#
Default is on-demand; allows using spot VMs, or spot VMs as available with on-demand as a fallback.
- Options:
on-demand | spot | spot_with_fallback
- --allow-cross-zone, --no-cross-zone#
Allow workers to be placed in different availability zones.
- --disk-size <disk_size>#
Use larger-than-default disk on VM, specified in GiB.
- --allow-ssh-from <allow_ssh_from>#
IP address or CIDR from which connections to port 22 (SSH) are open; can also be specified as ‘everyone’ (0.0.0.0/0) or ‘me’ (automatically determines public IP detected for your local client).
- --map-over-values <map_over_values>#
A list of values such that for each value, a task will be run with that value as the input. If you specify
--map-over-values 'first,second,third'
, then batch will run three tasks with inputs ‘first’, ‘second’, and ‘third’. By default the input is passed to the task in theCOILED_BATCH_TASK_INPUT
environment variable, so one task will getCOILED_BATCH_TASK_INPUT=first
and so on.
- --map-over-file <map_over_file>#
Like
--map-over--values
, but instead of specifying the string of values directly, you specify the path to a file with the values. Note that by default, each line in the file is treated as an individual value; this can be controlled with the--map-over-delimiter
option.
- --map-over-input-var <map_over_input_var>#
The value from –map-over-values or –map-over-files is exposed to the task as an environment variable. By default, the environment variable is
COILED_BATCH_TASK_INPUT
, but you can specify a different name for the environment variable using this option.
- --map-over-delimiter <map_over_delimiter>#
Delimiter for splitting the string from
--map-over-values
or the file contents from--map-over-file
into individual values. By default this is ‘,’ for--map-over-values
and newline for--map-over-file
.
- --ntasks, --n-tasks <ntasks>#
Number of tasks to run. Tasks will have ID from 0 to n-1, the
COILED_ARRAY_TASK_ID
environment variable for each task is set to the ID of the task.
- --task-on-scheduler, --no-task-on-scheduler#
Run task with lowest job ID on scheduler node.
- --array <array>#
Specify array of tasks to run with specific IDs (instead of using
--ntasks
to array from 0 to n-1). You can specify list of IDs, a range, or a list with IDs and ranges. For example,--array 2,4-6,8-10
.
- --scheduler-task-array <scheduler_task_array>#
Which tasks in array to run on the scheduler node. In most cases you’ll probably want to use
--task-on-scheduler
instead to run task with lowest ID on the scheduler node.
- -N, --max-workers <max_workers>#
Maximum number of worker nodes. By default, there will be as many worker nodes as tasks, up to 1000; use -1 to explicitly request no limit.
- --wait-for-ready-cluster#
Only assign tasks once full cluster is ready.
- --forward-aws-credentials#
Forward STS token from local AWS credentials.
- --package-sync-strict#
Require exact package version matches when using package sync.
- --package-sync-conda-extras <package_sync_conda_extras>#
A list of conda package names (available on conda-forge) to include in the environment that are not in your local environment.
- --host-setup-script <host_setup_script>#
Path to local script which will be run on each VM prior to running any tasks.
- --job-timeout <job_timeout>#
Timeout for batch job; timer starts when the job starts running (after VMs have been provisioned). For example, you can specify ‘30 minutes’ or ‘1 hour’. Default is no timeout.
Arguments
- COMMAND#
Optional argument(s)
coiled batch status#
Check the status of a Coiled Batch job.
coiled batch status [OPTIONS] [CLUSTER]
Options
- --workspace <workspace>#
Coiled workspace (uses default workspace if not specified).
- --format <format>#
- Options:
json | table
- --sort <sort>#
Arguments
- CLUSTER#
Optional argument
coiled batch list#
List Coiled Batch jobs in a workspace.
coiled batch list [OPTIONS]
Options
- --workspace <workspace>#
Coiled workspace (uses default workspace if not specified).
- --format <format>#
- Options:
json | table
- --limit <limit>#
coiled batch wait#
Monitor the progress of a Coiled Batch job.
coiled batch wait [OPTIONS] [CLUSTER]
Options
- --workspace <workspace>#
Coiled workspace (uses default workspace if not specified).
Arguments
- CLUSTER#
Optional argument
coiled logs#
coiled logs [OPTIONS] CLUSTER
Options
- --account, --workspace <account>#
Coiled workspace (uses default workspace if not specified). Note: –account is deprecated, please use –workspace instead.
- --scheduler#
Get scheduler logs
- --workers <workers>#
Get worker logs (‘any’, ‘all’, or comma-delimited list of names, states, or internal IP addresses)
- --follow#
Passed directly to aws logs tail, see aws cli docs for details.
- --filter <filter>#
Passed directly to aws logs tail, see aws cli docs for details.
- --since <since>#
For follow, uses aws logs tail default (10m), otherwise defaults to start time of cluster.
- --format <format>#
Passed directly to aws logs tail, see aws cli docs for details.
- --profile <profile>#
Passed directly to aws logs tail, see aws cli docs for details.
Arguments
- CLUSTER#
Required argument
Python#
- coiled.batch.run(command, *, name=None, workspace=None, software=None, container=None, env=None, secret_env=None, tag=None, vm_type=None, scheduler_vm_type=None, arm=False, cpu=None, memory=None, gpu=False, region=None, spot_policy=None, allow_cross_zone=None, disk_size=None, allow_ssh_from=None, ntasks=None, task_on_scheduler=None, array=None, scheduler_task_array=None, map_over_values=None, map_over_file=None, map_over_input_var=None, map_over_task_var_dicts=None, map_over_delimiter=None, max_workers=None, wait_for_ready_cluster=None, forward_aws_credentials=None, package_sync_strict=False, package_sync_conda_extras=None, host_setup_script=None, ignore_container_entrypoint=None, job_timeout=None, logger=None)[source]
Submit a batch job to run on Coiled.
See
coiled batch run --help
for documentation.- Return type:
dict
Additional Parameters#
- map_over_task_var_dicts
takes a list of dictionaries, so you can specify multiple environment variables for each task. For example,
[{"FOO": 1, "BAR": 2}, {"FOO": 3, "BAR": 4}]
will passFOO=1 BAR=2
to one task andFOO=3 BAR=4
to another.
- coiled.batch.status(cluster='', workspace=None)[source]
Check the status of a Coiled Batch job.
See
coiled batch status --help
for documentation.- Return type:
list
[dict
]
- coiled.batch.list_jobs(workspace=None, limit=10)[source]
List Coiled Batch jobs in a workspace.
See
coiled batch list --help
for documentation.- Return type:
list
[dict
]