All Posts

Choosing an AWS Batch alternative

05 August 2025

2025-08-05

Read more ...

Long-running Lambda workloads: Challenges and alternatives

16 July 2025

2025-07-16

Read more ...

uv and Coiled for easy Python scripts on the cloud

30 June 2025

James Bourbeau

Read more ...

How to run a Python script on EC2

25 June 2025

2025-06-25

Read more ...

Coiled: A GPU-accelerated serverless alternative to AWS Lambda

11 June 2025

2025-06-11

Read more ...

Python is annoying to run on EMR, but you have options

08 May 2025

May 7, 2025

Read more ...

Reducing Memory Pressure for Xarray + Dask Workloads

17 March 2025

Mar 17, 2025

Read more ...

Coiled 2024 in Review

02 January 2025

Jan 2, 2025

Read more ...

Faster Xarray Quantile Computations with Dask

17 December 2024

Dec 17, 2024

Read more ...

Improving GroupBy.map with Dask and Xarray

21 November 2024

Nov 21, 2024

Read more ...

SLURM-Style Job Arrays on the Cloud with Coiled

19 November 2024

Nov 19, 2024

Read more ...

Airflow, Dask, & Coiled: Adding Big Data Processing to Your Cloud Toolkit

12 November 2024

Nov 12, 2024

Read more ...

Large Scale Geospatial Benchmarks: First Pass

16 October 2024

We implement several large-scale geo benchmarks. Most break. Fun!

Read more ...

Scaling AI-Based Data Processing with Hugging Face + Dask

09 October 2024

Oct 9, 2024

Read more ...

Large Scale Geospatial Benchmarks

09 September 2024

Sep 9, 2024

Read more ...

DataFrames at Scale Comparison: TPC-H

14 May 2024

May 14, 2024

Read more ...

Dask vs. Spark

19 April 2024

Apr 19, 2024

Bar chart comparing the relative difference in TPC-H query runtime for Dask vs. PySpark when executed on a M1 MacBook Pro with 8 cores. Orange represents queries where Dask is faster and blue where PySpark is faster.

Read more ...

Easy Scalable Production ETL

08 April 2024

We show a lightweight scalable data pipeline that runs large Python jobs on a schedule on the cloud.

Scalable data pipeline example that runs regularly scheduled jobs on the cloud.

Read more ...

One Trillion Row Challenge

05 February 2024

Feb 5, 2024

Read more ...

Real-world Grocery Demand Forecasting

31 January 2024

Jan 31, 2024

Line graph of forecasted sales and actual sales over time.

Read more ...

Schedule Python Jobs with Prefect and Coiled

23 January 2024

Jan 23, 2024

Read more ...

One Billion Row Challenge (1BRC) in Python with Dask

16 January 2024

Jan 16, 2024

Read more ...

Xarray at Large Scale: A Beginner’s Guide

21 December 2023

Dec 21, 2023

Read more ...

Polars queries in the cloud with Coiled

17 November 2023

Data

Code snippet of using the coiled.function decorator to run a query with Polars on a large VM in the cloud

Read more ...

Processing Terabyte-Scale NASA Cloud Datasets with Coiled

01 November 2023

We show how to run existing NASA data workflows on the cloud, in parallel, with minimal code changes using Coiled. We also discuss cost optimization.

Comparing cost and duration between running the same workflow locally on a laptop, running on AWS, and running with cost optimizations on AWS.

Read more ...

Run Jupyter Notebooks on a GPU on the Cloud

10 October 2023

Oct 10, 2023

Read more ...

Ten Cents Per Terabyte

06 October 2023

Oct 6, 2023

Read more ...

TPC-H Benchmarks for Query Optimization with Dask Expressions

05 October 2023

Dask-expr is an ongoing effort to add a logical query optimization layer to Dask DataFrames. We now have the first benchmark results to share that were run against the current DataFrame implementation.

Read more ...

Coiled observability wins: Chunksize

19 September 2023

Distributed computing is hard, distributed debugging is even harder. Dask tries to simplify this process as much as possible. Coiled adds additional observability features for your Dask clusters and processes them to help users understand their workflows better.

Read more ...

Parallel Serverless Functions at Scale

07 September 2023

Data

Comparing code runtime between a laptop, single cloud VM, and multiple cloud VMs in parallel

Read more ...

Processing a 250 TB dataset with Coiled, Dask, and Xarray

05 September 2023

We processed 250TB of geospatial cloud data in twenty minutes on the cloud with Xarray, Dask, and Coiled. We do this to demonstrate scale and to think about costs.

County-level heat map of the continental US showing mean depth to soil saturation (in meters) in 2020.

Read more ...

Reduce training time for CPU intensive models with scikit-learn and Coiled Functions

01 September 2023

Sep 1, 2023

Read more ...

Fine Performance Metrics and Spans

23 August 2023

While it’s trivial to measure the end-to-end runtime of a Dask workload, the next logical step - breaking down this time to understand if it could be faster - has historically been a much more arduous task that required a lot of intuition and legwork, for novice and expert users alike. We wanted to change that.

Populated Fine Performance Metrics dashboard

Read more ...

Data-proximate Computing with Coiled Functions

10 August 2023

Coiled Functions make it easy to improve performance and reduce costs by moving your computations next to your cloud data.

Read more ...

Dask, Dagster, and Coiled for Production Analysis at OnlineApp

09 August 2023

We show a simple integration between Dagster and Dask+Coiled. We discuss how this made a common problem, processing a large set of files every month, really easy.

Conceptual diagram showing how to use Dagster with Coiled and Dask.

Read more ...

Process Hundreds of GB of Data with DuckDB in the Cloud

07 August 2023

Aug 7, 2023

Read more ...

High Level Query Optimization in Dask

04 August 2023

Dask DataFrame doesn’t currently optimize your code for you (like Spark or a SQL database would). This means that users waste a lot of computation. Let’s look at a common example which looks ok at first glance, but is actually pretty inefficient.

Read more ...

Easy Heavyweight Serverless Functions

01 August 2023

What is the easiest way to run Python code in the cloud, especially for compute jobs?

Read more ...

How to Train a Neural Network on a GPU in the Cloud with coiled functions

24 July 2023

Jul 24, 2023

Read more ...

Dask performance benchmarking put to the test: Fixing a pandas bottleneck

23 June 2023

Jun 23, 2023

Read more ...

Coiled notebooks

14 June 2023

We recently pushed out a new, experimental notebooks feature for easily launching Jupyter servers in the cloud from your local machine. We’re excited about Coiled notebooks because they:

Read more ...

Utilizing PyArrow to improve pandas and Dask workflows

05 June 2023

Jun 5, 2023

Read more ...

Distributed printing

18 May 2023

Dask makes it easy to print whether you’re running code locally on your laptop, or remotely on a cluster in the cloud.

Read more ...

Observability for Distributed Computing with Dask

16 May 2023

May 16, 2023

8m read

Read more ...

GIL monitoring in Dask

15 May 2023

May 15, 2023

Read more ...

Performance testing at Coiled

05 May 2023

At Coiled we develop Dask and automatically deploy it to large clusters of cloud workers (sometimes 1000+ EC2 instances at once!). In order to avoid surprises when we publish a new release, Dask needs to be covered by a comprehensive battery of tests — both for functionality and performance.

Read more ...

How well does Dask run on Graviton?

05 May 2023

May 5, 2023

bar chart of AWS cost vs. processor type

Read more ...

Upstream testing in Dask

18 April 2023

Dask has deep integrations with other libraries in the PyData ecosystem like NumPy, pandas, Zarr, PyArrow, and more. Part of providing a good experience for Dask users is making sure that Dask continues to work well with this community of libraries as they push out new releases. This post walks through how Dask maintainers proactively ensure Dask continuously works with its surrounding ecosystem.

Read more ...

Burstable vs non-burstable AWS instance types for data engineering workloads

04 April 2023

Apr 4, 2023

Read more ...

Shuffling large data at constant memory in Dask

15 March 2023

Mar 15, 2023

Read more ...

Just in time Python environments

23 February 2023

Feb 23, 2023

Read more ...

How many PEPs does it take to install a package?

17 January 2023

A few months ago we released package sync, a feature that takes your Python environment and replicates it in the cloud with zero effort.

Read more ...

Scaling Hyperparameter Optimization With XGBoost, Optuna, and Dask

06 January 2023

XGBoost is one of the most well-known libraries among data scientists, having become one of the top choices among Kaggle competitors. It is performant in a wide of array of supervised machine learning problems, implements scalable training through the rabit library, and integrates with many big data processing tools, including Dask.

Read more ...

Handling Unexpected AWS IAM Changes

06 January 2023

The cloud is tricky! You might think the rules that determine which IAM permissions are required for which actions will continue to apply in the same way. You might think they’d apply the same way to different AWS accounts. Or that if these things aren’t true, at least AWS will let you know. (I did.) You’d be wrong!

Read more ...

AWS Cost Explorer Tips and Tricks

06 January 2023

Jan 6, 2023

Read more ...

Automated Data Pipelines On Dask With Coiled & Prefect

19 December 2022

Dask is widely used among data scientists and engineers proficient in Python for interacting with big data, doing statistical analysis, and developing machine learning models. Operationalizing this work has traditionally required lengthy code rewrites, which makes moving from development and production hard. This gap slows business progress and increases risk for data science and data engineering projects in an enterprise setting. The need to remove this bottleneck has prompted the emergence of production deployment solutions that allow code written by data scientists and engineers to be directly deployed to production, unlocking the power of continuous deployment for pure Python data science and engineers.

_images/coiled-prefect-problem-to-solve.png

Read more ...

Reading CSV files into Dask DataFrames with read_csv

09 February 2022

Feb 9, 2022

Read more ...

Writing Parquet Files with Dask using to_parquet

01 January 2022

This blog post explains how to write Parquet files with Dask using the to_parquet method.

Read more ...

Why we passed on Kubernetes

01 January 2022

Kubernetes is great if you need to organize many always-on services and have in-house expertise, but can add an extra burden and abstraction when deploying a single bursty service like Dask, especially in a user environment with quickly changing needs.

Read more ...

Use Mambaforge to Conda Install PyData Stack on your Apple M1 Silicon Machine

01 January 2022

Running PyData libraries on an Apple M1 machine requires you to use an ARM64-tailored version of conda-forge. This article provides a step-by-step guide of how to set that up on your machine using mambaforge.

Read more ...

Understanding Managed Dask (Dask as a Service)

01 January 2022

‍Dask is a flexible Python library for parallel and distributed computing. There are a number of ways you can create Dask clusters, each with their own benefits. In this article, we explore how Coiled provides a managed cloud infrastructure solution for Dask users, addressing:

Read more ...

Tackling unmanaged memory with Dask

01 January 2022

TL;DR: unmanaged memory is RAM that the Dask scheduler is not directly aware of and which can cause workers to run out of memory and cause computations to hang and crash.

Read more ...

Speed up a pandas query 10x with these 6 Dask DataFrame tricks

01 January 2022

Updated April 18th, 2024: For Dask versions >= 2024.3.0, Dask will perform a number of the optimizations discussed in this blog post automatically. See this demo for more details.

Read more ...

Spark to Dask: The Good, Bad, and Ugly of Moving from Spark to Dask

01 January 2022

Apache Spark has long been a popular tool for handling petabytes of data in analytics applications. It’s often used for big data and machine learning, and most organizations use it with cloud infrastructure to run models and build algorithms. Spark is no doubt a fast analytical tool that provides high-speed queries for large datasets, but recent client testimonials tell us that Dask is even faster. So, what should you keep in mind when moving from Spark to Dask?

Read more ...

Snowflake and Dask: a Python Connector for Faster Data Transfer

01 January 2022

Snowflake is a leading cloud data platform and SQL database. Many companies store their data in a Snowflake database.

Read more ...

Seven Stages of Open Software

01 January 2022

This post lays out the different stages of openness in Open Source Software (OSS) and the benefits and costs of each.

Read more ...

Setting a Dask DataFrame index

01 January 2022

This post demonstrates how to change a DataFrame index with set_index and explains when you should perform this operation.

Read more ...

Search at Grubhub and User Intent

01 January 2022

Alex Egg, Senior Data Scientist at Grubhub, joins Matt Rocklin and Hugo Bowne-Anderson to talk and code about how Dask and distributed compute are used throughout the user intent classification pipeline at Grubhub!

Read more ...

Scikit-learn in the Cloud with Coiled

01 January 2022

You can train scikit-learn models in parallel using the scikit-learn joblib interface. This allows scikit-learn to take full advantage of the multiple cores in your machine (or, spoiler alert, on your cluster) and speed up training.

Read more ...

Scale your data science workflows with Python and Dask

01 January 2022

Data Scientists are increasingly using Python and the Python ecosystem of tools for their analysis. Combined with the growing popularity of big data, this brings the challenge of scaling data science workflows. Dask is a library built for this exact purpose - making it easy to scale your Python code, and serve as a toolbox for distributed computing!

Read more ...

Save Money with Spot

01 January 2022

The cloud is wonderful but expensive.

Read more ...

Repartitioning Dask DataFrames

01 January 2022

This article explains how to redistribute data among partitions in a Dask DataFrame with repartitioning…

Read more ...

Reducing memory usage in Dask workloads by 80%

01 January 2022

There’s a saying in emergency response: “slow is smooth, smooth is fast”.

Read more ...

Reduce memory usage with Dask dtypes

01 January 2022

Columns in Dask DataFrames are typed, which means they can only hold certain values (e.g. integer columns can’t hold string values). This post gives an overview of DataFrame datatypes (dtypes), explains how to set dtypes when reading data, and shows how to change column types.

Read more ...

PyArrow Strings in Dask DataFrames

01 January 2022

pandas 2.0 has been released! 🎉

Read more ...

Prioritizing Pragmatic Performance for Dask

01 January 2022

Many people say the following to me:

Read more ...

Perform a Spatial Join in Python

01 January 2022

This blog explains how to perform a spatial join in Python. Knowing how to perform a spatial join is an important asset in your data-processing toolkit: it enables you to join two datasets based on spatial predicates. For example, you can join a point-based dataset with a polygon-based dataset based on whether the points fall within the polygon.

Read more ...

Introducing the Dask Active Memory Manager

01 January 2022

Historically, out-of-memory errors and excessive memory requirements have frequently been a pain point for Dask users. Two of the main causes of memory-related headaches are data duplication and imbalance between workers.

Read more ...

How to Merge Dask DataFrames

01 January 2022

This post demonstrates how to merge Dask DataFrames and discusses important considerations when making large joins.

Read more ...

How to Convert a pandas Dataframe into a Dask Dataframe

01 January 2022

In this post, we will cover:

Read more ...

How Coiled sets memory limit for Dask workers

01 January 2022

While running workloads to test Dask reliability, we noticed that some workers were freezing or dying when the OS stepped in and started killing processes when the system ran out of memory.

Read more ...

Filtering Dask DataFrames with loc

01 January 2022

This post explains how to filter Dask DataFrames based on the DataFrame index and on column values using loc.

Read more ...

Enterprise Dask Support

01 January 2022

Along with the Cloud SaaS product, Coiled sells enterprise support for Dask. Mostly people buy this for these three things:

Read more ...

Easily Run Python Functions in Parallel

01 January 2022

When you search for how to run a Python function in parallel, one of the first things that comes up is the multiprocessing module. The documentation describes parallelism in terms of processes versus threads and mentions it can side-step the infamous Python GIL (Global Interpreter Lock).

Read more ...

Dask on GCP

01 January 2022

‍Dask is a flexible Python library for parallel and distributed computing. There are a number of ways you can create Dask clusters, each with their own benefits. In this article, we explore how Coiled provides a managed cloud infrastructure solution for deploying Dask on Google Cloud, addressing:

Read more ...

Dask on Azure

01 January 2022

‍Dask is a flexible Python library for parallel and distributed computing. There are a number of ways you can create Dask clusters, each with their own benefits. In this article, we explore how Coiled provides a managed cloud infrastructure solution for deploying Dask on Azure, addressing:

Read more ...

Dask on AWS

01 January 2022

‍Dask is a flexible Python library for parallel and distributed computing. There are a number of ways you can create Dask clusters, each with their own benefits. In this article, we explore how Coiled provides a managed cloud infrastructure solution for deploying Dask on AWS, addressing:

Read more ...

Dask for Parallel Python

01 January 2022

Dask is a general purpose library for parallel computing. Dask can be used on its own to parallelize Python code, or with integrations to other popular libraries to scale out common workflows.

Read more ...

Dask and the PyData Stack

01 January 2022

The PyData stack for scientific computing in Python is an interoperable collection of tools for data analysis, visualization, statistical inference, machine learning, interactive computing and more that is used across all types of industries and academic research. Dask, the open source package for scalable data science that was developed to meet the needs of modern data professionals, was born from the PyData community and is considered foundational in this computational stack. This post describes a schematic for thinking about the PyData stack, along with detailing the technical, cultural, and community forces that led to Dask becoming the go-to package for scalable analytics in Python, including its interoperability with packages such as NumPy and pandas, among many others.

Read more ...

Dask Read Parquet Files into DataFrames with read_parquet

01 January 2022

This blog post explains how to read Parquet files into Dask DataFrames. Parquet is a columnar, binary file format that has multiple advantages when compared to a row-based file format like CSV. Luckily Dask makes it easy to read Parquet files into Dask DataFrames with read_parquet.

Read more ...

Creating Disk Partitioned Lakes with Dask using partition_on

01 January 2022

This post explains how to create disk-partitioned Parquet lakes using partition_on and how to read disk-partitioned lakes with read_parquet and filters. Disk partitioning can significantly improve performance when used correctly.

Read more ...

Cost Savings with Dask and Coiled

01 January 2022

Coiled can often save money for an organization running Dask. This article goes through the most common ways in which we see that happen.

Read more ...

Convert Large JSON to Parquet with Dask

01 January 2022

You can use Coiled, the cloud-based Dask platform, to easily convert large JSON data into a tabular DataFrame stored as Parquet in a cloud object-store. Start off by iterating with Dask locally first to build and test your pipeline, then transfer the same workflow to Coiled with minimal code changes. We demonstrate a JSON to Parquet conversion for a 75GB dataset that runs without downloading the dataset to your local machine.

Read more ...

Coiled, one year in

01 January 2022

Coiled, a Dask company, is about one year old. We’ll have a more official celebration in mid-February (official date of incorporation), but I wanted to take this opportunity to talk a little bit about the journey over the last year, where that has placed us today, and what I think comes next.

Read more ...

Coiled Cloud Architecture

01 January 2022

Running Dask in the cloud is easy.

Read more ...

Code Formatting Jupyter Notebooks with Black

01 January 2022

Black is an amazing Python code formatting tool that automatically corrects your code.

Read more ...

Cloud ETL automation with Github Actions and Coiled

01 January 2022

Github Actions let you launch automated jobs from your Github repository. Coiled lets you run your Python code in the cloud. Combining the two gives you lightweight workflow orchestration for heavy ETL (extract-transform-load) jobs that can run in the cloud without any complicated infrastructure provisioning or DevOps.

Read more ...

Better Shuffling in Dask: a Proof-of-Concept

01 January 2022

Updated May 16th, 2023: With release 2023.2.1, dask.dataframe introduced this shuffling method called P2P, making sorts, merges, and joins faster and using constant memory. Benchmarks show impressive improvements. See our blog post.

Read more ...

Accelerating Microstructural Analytics with Dask and Coiled

01 January 2022

In this article, we will discuss an interesting use case of Dask and Coiled: Accelerating Volumetric X-ray Microstructural Analytics using distributed and high-performance computing. This blog post is inspired by the article published in Kitware Blog and corresponding research.

Read more ...

Abalone Bio: Accelerating Antibody Discovery

01 January 2022

Jan 1, 2022

Read more ...

Pandas parallel apply and map with Dask DataFrame

22 November 2021

Nov 22, 2021

Read more ...

Converting a Dask DataFrame to a pandas DataFrame

01 October 2021

Oct 1, 2021

Read more ...