Geospatial Workloads with Coiled#
Coiled makes it easy to process large volumes of data in the cloud scalably and efficiently with tools like Xarray, Dask, Zarr, GeoTIFF, and others. For example, in the video below we process 250 TiB of simulation data from NOAA for $25.
There are many ways to use Coiled for Geospatial analysis. This page collects a couple common patterns and points you to useful documentation and examples. This page may be helpful for you if
You want to…
Scale Xarray to TiB datasets
Churn through thousands of files from observational data
Rechunk and regrid massive datasets
You’re pained by…
Running out of memory
Confusing and uncontrollable costs
Managing cloud VMs or Kubernetes clusters is annoying
People typically use Coiled with geospatial workloads in three ways:
Xarray at scale for complex but intuitive data manipulation
Serverless functions to efficiently churn through thousands of cloud files
Regularly scheduled jobs to run processing tasks as new data arrives
Inviting and managing users and costs to both constrain costs while empowering a larger team
import xarray as xr
import coiled
cluster = coiled.Cluster(
n_workers=500
)
client = cluster.get_client()
ds = xr.open_zarr("s3://mybucket/all-my-data.zarr")
import coiled
files = s3.ls("s3://my-bucket/*.nc") # List all files
@coiled.function(region="us-west-2") # Create function to process each file.
def process(file):
...
return result
results = process.map(files) # Process all files
plot(results) # Analyze results locally
Application: Xarray at scale#
If you use Xarray today then maybe you already know Dask, it’s what lets Xarray operate on larger datasets than fit in memory. Maybe you also know that Dask can also run on a distributed cluster to give you more computational power. Maybe you’ve even tried to set this up yourself. If so, you know that it’s hard to do well.
Often we find that Xarray users run on single large VMs in the cloud.
Unfortunately, this approach is:
Expensive: Big machines are costly, and people leave them on for a long time.
Doesn’t scale: You can get a lot done with a big machine, but geospatial datasets today just keep getting bigger.
Fortunately, using Xarray with Dask clusters on Coiled is really easy and gives you as much computing power and memory as you might want. Once you’ve set up Coiled you can ask for a cluster from within Python:
import coiled
cluster = coiled.Cluster(
n_workers=100, # Ask for as many workers as you want
worker_memory="16 GiB", # Ask for any kind of worker
region="us-west-2", # Run where your data is
spot_policy="spot_with_fallback", # Cost saving measure
arm=True, # Cost saving measure
)
client = cluster.get_client()
Then all of your Xarray code works as before, assuming that you’re targeting some cloud data store like S3 or GCS.
import xarray as xr
ds = xr.open_zarr("s3://mybucket/all-data.zarr")
For more fully worked examples, see Example: National Water Model or Example: 250 TiB for $25
Application: Churn through cloud files#
Satellites and simulations pump out a lot of data today. Fortunately cloud storage and VMs have the scale to easily handle this load. Also fortunately, for simple workloads where you just want to do your processing file-by-file, it’s really easy for anyone to do.
After you’ve connected Coiled to your cloud account (takes about five minutes) you can define a function to process one S3 file of data, and get a list of all your files. If you were to run this locally it would be slow and costly (cloud egress costs are expensive), but it’s easy to scale this out with Coiled serverless functions.
import s3fs
def process(filename):
# download data
# process data locally
# return or upload result
s3 = s3fs.S3FileSystem()
filenames = s3.ls("s3://my-bucket/all-my-files.*.nc")
results = []
for filename in filenames:
result = process(filename)
results.append(result)
import s3fs
import coiled
@coiled.function(region=...)
def process(filename):
# download data
# process data locally
# return or upload result
s3 = s3fs.S3FileSystem()
filenames = s3.ls("s3://my-bucket/all-my-files.*.nc")
results = list(process.map(filenames))
You can annotate your Python functions to run on cloud machines. Coiled will handle provisioning those machines, installing all of the software you need on them, scaling out to many machines if you have many files, and then finally shutting them down when you’re done.
For more fully worked examples, see Processing Terabyte-Scale NASA Cloud Datasets with Coiled
Application: Regularly Scheduled Jobs#
Sometimes you want to process new data as it comes in every day/hour/minute. Fortunately it is easy to combine Coiled with workflow managers like Prefect, Dagster, or Airflow to perform this orchestration.
For a fully worked example, see Example: Scheduling Satellite Processing with Prefect.
Cost Optimization and Controls#
With Coiled you’re going to be able to process a lot of data very cheaply (processing a terabyte typically costs around $0.10 if you’re operating at peak efficiency). It’s also easy to only pay for cloud resources while you’re using them with features like adaptive scaling and idle shutdowns by default. Still, it’s good to set bounds on how much you and your teammates can spend.
With Coiled it’s easy to add members of your team to your account, track their costs, and limit how much they can spend.
Best Practices and Common Issues#
Run computations close to your data
Cloud computing can be surprisingly cheap when done well. When processing large amounts of cloud-hosted data, it’s important to have your compute VMs in the same cloud region where your data is hosted. This avoids expensive data transfer costs that scale with the size of your data.
Avoid data transfer costs by using the region=
option to provision your cloud VMs in the same region as the data you’re processing.
# Create Dask cluster in `us-west-2`
import coiled
cluster = coiled.Cluster(region="us-west-2", ...)
client = cluster.get_client()
# Load dataset that lives in `us-west-2`
import xarray as xr
ds = xr.open_zarr("s3://mybucket/all-data.zarr")
Third-party platform authentication
Commonly used platforms like NASA Earthdata, etc. have their own authentication systems for accessing assets. When using these platforms on Coiled, cloud VMs also need to be authenticated. Most platforms support authentication through environment variables (e.g. Earthaccess uses EARTHDATA_USERNAME
/ EARTHDATA_PASSWORD
). We recommend using the environ=
keyword in @coiled.function
or coiled.Cluster
to securely pass authentication information to Coiled cloud VMs.
import coiled
import earthaccess
@coiled.function(environ={
"EARTHDATA_USERNAME": <your-username>,
"EARTHDATA_PASSWORD": <your-password>,
})
def process(file):
data = earthaccess.open(file)
...
return result