Batch Jobs#
Run your jobs on the cloud in parallel
Coiled Batch jobs are a lightweight API that make it easy to run your code on any cloud hardware and scale out parallel workflows. This is useful when you want to run:
In region close to your cloud data
On specific hardware (e.g. GPU, bigger machine)
On many machines in parallel
Any code (doesn’t need to be Python)
With a lightweight, easy-to-use interface for the cloud
To run a batch job, add special #COILED
comments to your script to specify the cloud resources you want:
# my_script.sh
#COILED memory 32GB
#COILED container ubuntu:latest
#COILED ntasks 10
echo Hello from $COILED_BATCH_TASK_ID
and then launch your script with coiled batch run
:
$ coiled batch run my_script.sh
Under the hood Coiled will:
Inspect your script
Spin up appropriate machines as defined by
#COILED
commentsDownload your software on to them
Run your script
Shut down the machines
Your script can do anything and you can run it on any cloud hardware. Coiled batch jobs are simple, but also powerful. Let’s explore a few more examples.
Examples#
The example from above, for completeness.
#COILED memory 32GB
#COILED container ubuntu:latest
#COILED ntasks 10
echo Hello from $COILED_BATCH_TASK_ID
Here, ten cloud VMs with 32 GB are spun up and each runs their own echo
command.
COILED_BATCH_TASK_ID
on each machine is a unique identifier which run from “0”, “1”, “2”, …, “9” in this case.
coiled batch run
works directly with Python scripts
#COILED n-tasks 10
#COILED memory 8 GiB
#COILED region us-east-2
import os
print(f"Hello from {os.environ['COILED_BATCH_TASK_ID']}")
Here we also drop the container directive, and rely on Coiled’s environment synchronization to copy all of our Python libraries to the remote machines automatically.
A common pattern is to list many files, get the i
-th file, and run some command on that file.
#COILED n-tasks 1000
#COILED memory 8 GiB
#COILED region us-east-2
import os
import s3fs
s3 = s3fs.S3FileSystem()
# Get the i'th file to process
task_id = int(os.environ["COILED_BATCH_TASK_ID"])
filenames = s3.ls("my-bucket")
filename = filenames[task_id]
# Process that file
def process(filename):
result = ...
return result
result = process(filename)
# Store result
s3.put(...) # put result back in S3 somewhere
We can ask for any hardware on any cloud, including a collection of GPUs machines.
#COILED n-tasks 100
#COILED vm-type g5.xlarge
#COILED region us-east-2
import torch
# Load model
model = ...
model = model.to("cuda")
# Train model
for epoch in range(50):
model.train()
...
Configuration#
Coiled Batch options for specifying cloud hardware, software, number of tasks, etc., can be either added as #COILED
comments in your script:
# my_script.py
#COILED memory 64GB
#COILED region us-west-2
...
or passed as command line arguments directly to coiled batch run
:
$ coiled batch run --memory 64GB --region us-west-2 my_script.py
See the API docs for the full list of available options.
Monitoring and Logs#
Batch tasks run remotely on cloud VMs in the background. To monitor
the status of a batch job use the coiled batch status
command, which
shows information about task progress, start / stop times, cloud costs, and more.
$ coiled batch status
┏━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ ID ┃ State ┃ Tasks Done ┃ Submitted ┃ Finished ┃ Approx Cloud Cost ┃ Command ┃
┡━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ 659567 │ done │ 10 / 10 │ 2024-11-18 17:25:08 │ 2024-11-18 17:26:07 │ $0.01 │ process.py │
│ 658315 │ done │ 10 / 10 │ 2024-11-17 16:24:27 │ 2024-11-17 16:25:20 │ $0.01 │ train.sh │
│ 658292 │ done │ 20 / 20 │ 2024-11-17 15:28:44 │ 2024-11-17 15:29:39 │ $0.02 │ train.sh │
│ 658288 │ pending │ 0 / 20 │ 2024-11-17 15:24:32 │ │ $0.02 │ train.sh │
│ 657615 │ pending │ 0 / 10 │ 2024-11-16 20:32:46 │ │ $0.01 │ process.py │
│ 656571 │ done │ 10 / 10 │ 2024-11-15 14:32:35 │ 2024-11-15 14:33:35 │ $0.01 │ process.py │
│ 655848 │ done │ 20 / 20 │ 2024-11-14 23:56:06 │ 2024-11-14 23:58:01 │ $0.01 │ echo 'Hello, world' │
└────────────┴─────────┴────────────┴─────────────────────┴─────────────────────┴───────────────────┴─────────────────────┘
And while you can always look at jobs on the Coiled UI, we also provide a
convenient coiled logs
command to get logs given a job ID.
$ coiled logs 661309 | grep Hello
2024-11-19 15:59:46.131000 Hello from 0
2024-11-19 15:59:46.419000 Hello from 1
2024-11-19 15:59:46.743000 Hello from 2
2024-11-19 15:59:47.019000 Hello from 3
2024-11-19 15:59:47.324000 Hello from 4
2024-11-19 15:59:47.587000 Hello from 5
2024-11-19 15:59:47.862000 Hello from 6
2024-11-19 15:59:48.256000 Hello from 7
2024-11-19 15:59:48.315000 Hello from 8
2024-11-19 15:59:48.508000 Hello from 9
Coordination#
Each task within a batch job has the following environment variables automatically set:
COILED_BATCH_TASK_ID
: ID for the current running task. For example, “0”, “1”, “2”, etc.COILED_BATCH_LOCAL_ADDRESS
: IP address for the current running task.COILED_BATCH_SCHEDULER_ADDRESS
: IP address for the head node.COILED_BATCH_PROCESS_TYPE
: Either “scheduler” if running on the head node or “worker” otherwise.COILED_BATCH_READY_WORKERS
: Comma delimited list of IP addresses for VMs that are ready for work.
These are often used for handling coordination among tasks. Depending on your specific workload, you may or may not need to use these.
API#
coiled batch run#
Submit a batch job to run on Coiled.
Batch Jobs is currently an experimental feature.
coiled batch run [OPTIONS] [COMMAND]...
Options
- --workspace <workspace>#
Coiled workspace (uses default workspace if not specified).
- --software <software>#
Existing Coiled software environment (Coiled will sync local Python software environment if neither software nor container is specified).
- --container <container>#
Docker container in which to run the batch job tasks; this does not need to have Dask (or even Python), only what your task needs in order to run.
- -e, --env <env>#
Environment variables transmitted to run command environment. Format is
KEY=val
, multiple vars can be set with separate--env
for each.
- --secret-env <secret_env>#
Environment variables transmitted to run command environment. Format is
KEY=val
, multiple vars can be set with separate--secret-env
for each. Unlike environment variables specified with--env
, these are only stored in our database temporarily.
- -t, --tag <tag>#
Tags. Format is
KEY=val
, multiple vars can be set with separate--tag
for each.
- --vm-type <vm_type>#
VM type to use. Specify multiple times to provide multiple options.
- --arm#
Use ARM VM type.
- --cpu <cpu>#
Number of cores per VM.
- --memory <memory>#
Memory per VM.
- --gpu#
Have a GPU available.
- --region <region>#
The cloud provider region in which to run the job.
- --disk-size <disk_size>#
Use larger-than-default disk on VM, specified in GiB.
- --ntasks, --n-tasks <ntasks>#
Number of tasks to run. Tasks will have ID from 0 to n-1, the
COILED_ARRAY_TASK_ID
environment variable for each task is set to the ID of the task.
- --task-on-scheduler#
Run task with lowest job ID on scheduler node.
- --array <array>#
Specify array of tasks to run with specific IDs (instead of using
--ntasks
to array from 0 to n-1). You can specify list of IDs, a range, or a list with IDs and ranges. For example,--array 2,4-6,8-10
.
- --scheduler-task-array <scheduler_task_array>#
Which tasks in array to run on the scheduler node. In most cases you’ll probably want to use
--task-on-scheduler
instead to run task with lowest ID on the scheduler node.
- -N, --max-workers <max_workers>#
Maximum number of worker nodes (by default, there will be as many worker nodes as tasks).
- --wait-for-ready-cluster#
Only assign tasks once full cluster is ready.
- --forward-aws-credentials#
Forward STS token from local AWS credentials.
- --package-sync-strict#
Require exact package version matches when using package sync.
- --package-sync-conda-extras <package_sync_conda_extras>#
A list of conda package names (available on conda-forge) to include in the environment that are not in your local environment.
Arguments
- COMMAND#
Optional argument(s)
coiled batch status#
coiled batch status [OPTIONS] [CLUSTER]
Options
- --workspace <workspace>#
Coiled workspace (uses default workspace if not specified).
- --format <format>#
- Options:
json | table
- --limit <limit>#
Arguments
- CLUSTER#
Optional argument
coiled logs#
coiled logs [OPTIONS] CLUSTER
Options
- --account, --workspace <account>#
Coiled workspace (uses default workspace if not specified). Note: –account is deprecated, please use –workspace instead.
- --scheduler#
Get scheduler logs
- --workers <workers>#
Get worker logs (‘any’, ‘all’, or comma-delimited list of names, states, or internal IP addresses)
- --follow#
Passed directly to aws logs tail, see aws cli docs for details.
- --filter <filter>#
Passed directly to aws logs tail, see aws cli docs for details.
- --since <since>#
For follow, uses aws logs tail default (10m), otherwise defaults to start time of cluster.
- --format <format>#
Passed directly to aws logs tail, see aws cli docs for details.
- --profile <profile>#
Passed directly to aws logs tail, see aws cli docs for details.
Arguments
- CLUSTER#
Required argument