Non-Python Jobs on the Cloud#

Running jobs on the cloud is common when processing cloud-hosted data, needing hardware that’s not available locally, or wanting to run in parallel across many machines. However, it’s surprisingly difficult to run basic jobs using APIs like AWS Batch / GCP Cloud Run / Azure Batch offered by cloud providers.

This example uses Coiled Batch to run jobs using GDAL, a popular C++ library for transforming geospatial data, to reproject thousands of Sentinel-2 satellite images hosted on AWS.

While this example focuses on a geospatial application, the same approach can be applied to any job using any CLI or script (for example, fine-tuning LLMs with HuggingFace’s accelerate, running simulations with Fortran codes like wrf, or heck, even mining Bitcoin).

We hope this example shows how to run your jobs on the cloud in a straightforward way with just a few lines of code.

Process script#

The processing script for our job looks like this:

#!/usr/bin/env bash

#COILED n-tasks 3111
#COILED max-workers 100
#COILED region us-west-2
#COILED memory 8 GiB
#COILED container ghcr.io/osgeo/gdal
#COILED secret-env AWS_ACCESS_KEY_ID
#COILED secret-env AWS_SECRET_ACCESS_KEY

# Install aws CLI
if [ ! "$(which aws)" ]; then
    curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
    unzip -qq awscliv2.zip
    ./aws/install
fi

# Download file to be processed
filename=$(aws s3 ls --no-sign-request --recursive  s3://sentinel-cogs/sentinel-s2-l2a-cogs/54/E/XR/ | \
           grep ".tif" | \
           awk '{print $4}' | \
           awk "NR==$(($COILED_ARRAY_TASK_ID + 1))")
aws s3 cp --no-sign-request s3://sentinel-cogs/$filename in.tif

# Reproject GeoTIFF
gdalwarp -t_srs EPSG:4326 in.tif out.tif

# Move result to processed bucket
aws s3 mv out.tif s3://oss-scratch-space/sentinel-reprojected/$filename

This script:

  1. Installs the aws CLI

  2. Downloads the GeoTIFF data file to be processed

  3. Uses the gdalwarp command to reproject the GeoTIFF data (a common operation in geospatial workflows)

  4. Moves the reprojected GeoTIFF to an output bucket

Because we need to process many files (3,111) hosted on AWS, we’ll want to run this script in the same cloud region where the data lives (us-west-2 in this case). Running this job serially on a laptop takes over 2 days, so we’ll also run on many cloud machines in parallel to finish processing in a reasonable amount of time.

Run on the cloud#

To run this script on the cloud in parallel using Coiled Batch, we’ve added special #COILED comments to specify cloud hardware, software, number of times to run the script, etc.

#COILED n-tasks 3111
#COILED max-workers 100
#COILED region us-west-2
#COILED memory 8 GiB
#COILED container ghcr.io/osgeo/gdal
#COILED secret-env AWS_ACCESS_KEY_ID
#COILED secret-env AWS_SECRET_ACCESS_KEY

In particular:

  • #COILED n-tasks 3111: This job will run 3111 different tasks – one for each file being processed.

  • #COILED max-workers 100: Tasks are run in parallel across 100 different cloud VMs. By default, there are as many VMs as tasks.

  • #COILED region us-west-2: Run VMs in the same cloud region as where the data is hosted to avoid data transfer costs.

  • #COILED memory 8 GiB: Run on relatively small VMs as the data processing here doesn’t require many computational resources.

  • #COILED container ghcr.io/osgeo/gdal: Run our script inside this Docker container. Using the default osgeo/gdal container ensures that GDAL is installed and set up properly.

  • #COILED secret-env AWS_ACCESS_KEY_ID / #COILED secret-env AWS_SECRET_ACCESS_KEY: Authenticate with AWS so we can move processed data to a private results bucket.

Additionally, each task will have a COILED_ARRAY_TASK_ID environment variable set to “0”, “1”, “2”, … , “3110” that we can use to select the i-th file in the dataset to process.

We can then launch our job using the coiled batch run command:

coiled batch run reproject.sh

This job takes ~5 minutes to run and costs ~$0.70 in cloud hardware.

Status of job VMs over time.

Status of job VMs over time.#

And that’s it.

Summary#

Running jobs on the cloud is both common and surprisingly hard using APIs like AWS Batch / GCP Cloud Run / Azure Batch offered by cloud providers.

 

This example showed how to use Coiled Batch to run a script on the cloud, in parallel across many machines. This was:

  • Easy: Minimal code changes, and no cloud devops needed, to run on the cloud

  • Flexible: Can run anything, not just Python code

  • Fast: ~600x speedup by running in-region, in parallel

  • Cheap: Cloud hardware costs ~$0.70 for the entire job

We hope this example shows that with just a few lines of code you can run your jobs on the cloud in a straightforward way. We also hope it provides a template you can copy and adapt for your own use case.