Geospatial Workloads with Coiled

Coiled makes it easy to process large volumes of data in the cloud scalably and efficiently with tools like Xarray, Dask, Zarr, GeoTIFF, and others. For example, in the video below we process 250 TiB of simulation data from NOAA for $25.

 

There are many ways to use Coiled for Geospatial analysis. This page collects a couple common patterns and points you to useful documentation and examples. This page may be helpful for you if

You want to…

  • Scale Xarray to TiB datasets

  • Churn through thousands of files from observational data

  • Rechunk and regrid massive datasets

You’re pained by…

  • Running out of memory

  • Confusing and uncontrollable costs

  • Managing cloud VMs or Kubernetes clusters is annoying

People typically use Coiled with geospatial workloads in three ways:

import xarray as xr
import coiled

cluster = coiled.Cluster(
    n_workers=500
)

client = cluster.get_client()

ds = xr.open_zarr("s3://mybucket/all-my-data.zarr")

import coiled

files = s3.ls("s3://my-bucket/*.nc")  # List all files

@coiled.function(region="us-west-2")  # Create function to process each file.
def process(file):
    ...
    return result

results = process.map(files)          # Process all files
plot(results)                         # Analyze results locally

Application: Xarray at scale

If you use Xarray today then maybe you already know Dask, it’s what lets Xarray operate on larger datasets than fit in memory. Maybe you also know that Dask can also run on a distributed cluster to give you more computational power. Maybe you’ve even tried to set this up yourself. If so, you know that it’s hard to do well.

Often we find that Xarray users run on single large VMs in the cloud.

Unfortunately, this approach is:

  • Expensive: Big machines are costly, and people leave them on for a long time.

  • Doesn’t scale: You can get a lot done with a big machine, but geospatial datasets today just keep getting bigger.

Fortunately, using Xarray with Dask clusters on Coiled is really easy and gives you as much computing power and memory as you might want. Once you’ve set up Coiled you can ask for a cluster from within Python:

import coiled

cluster = coiled.Cluster(
    n_workers=100,                     # Ask for as many workers as you want
    worker_memory="16 GiB",            # Ask for any kind of worker
    region="us-west-2",                # Run where your data is
    spot_policy="spot_with_fallback",  # Cost saving measure
    arm=True,                          # Cost saving measure
)
client = cluster.get_client()

Then all of your Xarray code works as before, assuming that you’re targeting some cloud data store like S3 or GCS.

import xarray as xr

ds = xr.open_zarr("s3://mybucket/all-data.zarr")

For more fully worked examples, see Example: National Water Model or Example: 250 TiB for $25

Application: Churn through cloud files

Satellites and simulations pump out a lot of data today. Fortunately cloud storage and VMs have the scale to easily handle this load. Also fortunately, for simple workloads where you just want to do your processing file-by-file, it’s really easy for anyone to do.

After you’ve connected Coiled to your cloud account (takes about five minutes) you can define a function to process one S3 file of data, and get a list of all your files. If you were to run this locally it would be slow and costly (cloud egress costs are expensive), but it’s easy to scale this out with Coiled serverless functions.

import s3fs



def process(filename):
    # download data
    # process data locally
    # return or upload result

s3 = s3fs.S3FileSystem()
filenames = s3.ls("s3://my-bucket/all-my-files.*.nc")

results = []
for filename in filenames:
    result = process(filename)
    results.append(result)
import s3fs
import coiled

@coiled.function(region=...)
def process(filename):
    # download data
    # process data locally
    # return or upload result

s3 = s3fs.S3FileSystem()
filenames = s3.ls("s3://my-bucket/all-my-files.*.nc")

results = list(process.map(filenames))



You can annotate your Python functions to run on cloud machines. Coiled will handle provisioning those machines, installing all of the software you need on them, scaling out to many machines if you have many files, and then finally shutting them down when you’re done.

For more fully worked examples, see Processing Terabyte-Scale NASA Cloud Datasets with Coiled

Application: Regularly Scheduled Jobs

Sometimes you want to process new data as it comes in every day/hour/minute. Fortunately it is easy to combine Coiled with workflow managers like Prefect, Dagster, or Airflow to perform this orchestration.

For a fully worked example, see Example: Scheduling Satellite Processing with Prefect.

Cost Optimization and Controls

With Coiled you’re going to be able to process a lot of data very cheaply (processing a terabyte typically costs around $0.10 if you’re operating at peak efficiency). It’s also easy to only pay for cloud resources while you’re using them with features like adaptive scaling and idle shutdowns by default. Still, it’s good to set bounds on how much you and your teammates can spend.

 

With Coiled it’s easy to add members of your team to your account, track their costs, and limit how much they can spend.

Best Practices and Common Issues

Run computations close to your data

Cloud computing can be surprisingly cheap when done well. When processing large amounts of cloud-hosted data, it’s important to have your compute VMs in the same cloud region where your data is hosted. This avoids expensive data transfer costs that scale with the size of your data.

Avoid data transfer costs by using the region= option to provision your cloud VMs in the same region as the data you’re processing.

# Create Dask cluster in `us-west-2`
import coiled
cluster = coiled.Cluster(region="us-west-2", ...)
client = cluster.get_client()

# Load dataset that lives in `us-west-2`
import xarray as xr
ds = xr.open_zarr("s3://mybucket/all-data.zarr")
Third-party platform authentication

Commonly used platforms like NASA Earthdata, etc. have their own authentication systems for accessing assets. When using these platforms on Coiled, cloud VMs also need to be authenticated. Most platforms support authentication through environment variables (e.g. Earthaccess uses EARTHDATA_USERNAME / EARTHDATA_PASSWORD). We recommend using the environ= keyword in @coiled.function or coiled.Cluster to securely pass authentication information to Coiled cloud VMs.

import coiled
import earthaccess

@coiled.function(environ={
    "EARTHDATA_USERNAME": <your-username>,
    "EARTHDATA_PASSWORD": <your-password>,
})
def process(file):
    data = earthaccess.open(file)
    ...
    return result