How Coiled Relates to …#

This page compares Coiled to many related technologies in the data infrastructure space. Some of these compete, some complement. We try to give an honest accounting below:

Dask#

Coiled makes it easy to deploy Dask on the cloud. Dask is the origin of Coiled.

However a system that runs Dask well on the cloud also happens to run other things well too! Read on…

Pandas / Xarray / NumPy / XGBoost#

Dask provides parallel versions of these libraries. Coiled makes it easy to deploy Dask on the cloud easily.

Coiled + Dask make it easy to use these libraries on large (100TiB) of cloud data easily. Here are some relevant examples:

Polars / DuckDB / PyTorch#

These libraries aren’t automatically parallelized by Dask, but Coiled can still give you machines to run them (or any other library). If you’re used to using Polars/DuckDB/PyTorch/… on your laptop, and now want to run it on VMs in the cloud, Coiled makes that easy. If you then want to run that same program thousands of times, Coiled makes that easy too.

Coiled will not automatically scale up a single Polars/DuckDB/PyTorch/… program/function (that’s hard) but often all you need is many machines running variants of the same code (like running some routine on each file, for thousands of files)

Here are some relevant examples:

Jupyter#

Coiled can spin up a machine in your cloud account to run Jupyter. This can be helpful if you want to interact with data in the cloud from a machine close to that data.

See the documentation on Coiled Notebooks for more details.

Ray#

Ray and Dask are both OSS frameworks designed to provide Python developers with the ability to create rich and sophisticated distributed applications. They’re more modern versions of Hadoop/Spark, especially with how much more flexibility they provide.

Technically, Ray and Dask both made different design decisions, and those decisions affect the kinds of applications for which they are optimal. Foremost among these is that Ray uses Distributed Scheduling while Dask has a Centralized Scheduler. That is, when a new task needs to be placed on a machine, that decision in Ray is made by any worker while in Dask it is made by a single all-knowing scheduler. This has positive and negative implications.

Robustness:

  • Ray’s distributed scheduling is more robust without a single point of failure. This is important if you have calculations that might last weeks at a time.

  • Dask’s centralized scheduler could theoretically go down in the middle of a calculation. This would cause you to recompute that result.

  • Our take: in practice this doesn’t happen much. The average Dask calculation lasts minutes or hours, and the average cloud VM stays alive for months or years. High availability is often provided by the system supporting a tool like Dask or Spark, such as Kubernetes or Coiled, which provide the necessary fallback mechanisms.

Task Throughput

  • Ray is able to generate lots of tasks, most of which stay on the same machine

  • Dask is limited to processing around 5000 tasks per second. If each task processes, say, 1GB, then this is the equivalent of processing around 5TB/s, or 20 PB/hour.

  • Our take: For applications like distributed training of deep learning models across hundreds of machines Ray’s approach is nicer. However, 99% of use cases we see don’t require this level of throughput. Additionally, Dask’s approach of centralized scheduling gives a bit more thought to each task, which can have tremendous impact on efficiency and more intelligent coordination of work.

Worker Coordination

  • Ray uses Redis to coordinate tasks between workers. No worker has full instant visibility into everything that’s going on, but they can query a Redis database to learn more.

  • Dask uses a centralized all-knowing scheduler to coordinate work. That scheduler knows the exact state of every worker, and so can make rapid and near-optimal decisions. This is especially useful to do things like minimize memory use, spread critical data around, reduce costs, and more.

  • Our take: Dask’s intelligent scheduling has proven invaluable for the applications that we see in practice (data science / data engineering / most machine learning) but may get in the way of applications requiring high volumes of tasks, like the training of foundational models from scratch.

Ray and Dask were both designed to be general purpose, and with certain applications in mind. In Ray’s case it was for reinforcement learning, which is a good proxy for distributed training of deep learning models. In contrast, Dask grew up in a general consulting environment, and so got used in lots of business critical data science/data engineering workloads. They both made technical decisions that optimized for that environment.

If your workload looks more like something you’d submit to a machine learning conference like NeurIPS then Ray might be a better choice.

If your workload looks more like traditional data science/engineering/development, then Dask might be a better choice.

Anyscale#

Anyscale is the for-profit company behind Ray. It produces a platform to host Ray, much like how Coiled hosts Dask.

To be honest we don’t know much about the Anyscale platform. It’s not something that we come across much in the wild and so aren’t well positioned to judge or compare it.

Spark#

Spark is the de facto standard for large scale open source SQL/Dataframe processing. It’s a solid tool and tried and true. It’s also a little old and crufty (although that’s subjective).

Dask is less focused and battle-hardened on dataframes (although it outperforms Spark on TPC-H benchmarks, especially at large scale) and is instead more focused on …

  • Python developer experience

  • Things that aren’t dataframes

Dask is a bit lower level than Spark. While Spark provides a big Dataframe and SQL interface, Dask gets combined with lots of other Python libraries to make parallel versions of those libraries. This includes libraries like pandas (which then competes with Spark) but also lots of other libraries (numpy, xgboost, prefect, dagster, xarray, …) each of which create Spark-like distributed applications but with wildly different data and compute semantics.

Many people also use Dask to parallelize arbitrary for-loopy Python code that doesn’t fit into any “big data” abstraction.

Dask does more things than Spark

So if you want to do mostly dataframe things and you maybe prefer SQL/Spark syntax over Python then go for Spark. It’s a safe choice.

If you like Python, or need to do things that aren’t just dataframes (ML, arrays, arbitrary code, complex task graphs) then Dask is likely a better bet.

Databricks#

Bluntly, Databricks is more mature and far more profitable than Coiled. If you have to make a quick decision on a data platform for a Fortune 50 company, Databricks or Snowflake is probably the way to go.

Databricks also does more stuff. Databricks offers a full suite of the following:

  1. Managed Spark

  2. Managed notebooks

  3. Storage with Deltalake (and now Iceberg)

  4. Machine learning experiment tracking with MLFlow

  5. Data Catalogs with Unity

If all of these things sound good for you, then Databricks is probably good for you.

In contrast, Coiled does the following:

  1. Managed Python in the cloud

Coiled does less. We do it really well though. Coiled is often seen as a “Lightweight alternative to Databricks with a nicer developer experience”. Coiled makes it easy to turn on a bunch of machines, run your code on them, and then turn them off, all from anywhere you can run Python.

Where Databricks is a Swiss-army knife, Coiled is a scalpel.

Snowflake#

Snowflake does SQL really well.
Coiled+Dask are pretty bad at SQL.

Snowflake does everything else pretty poorly (sorry Snowpark team)
Coiled+Dask does everything else pretty well

They’re a good pairing, like burgers and beer.

AWS Batch / GCP Cloud Run / Azure Batch#

These systems make it possible to run any application (not just Python code) on many machines in the cloud. This is a simple yet common pattern with broad applicability.

The Coiled equivalent is coiled batch. Coiled batch runs a CLI command or script on many machines. Coiled batch is like AWS Batch / GCP Cloud Run / Azure Batch, but with a much nicer developer experience.

Coiled Batch

Coiled batch is easy to use

#/bin/bash

#COILED ntasks 100
#COILED memory 8 GIB
#COILED container ubuntu:latest

echo Hello from $COILED_BATCH_ARRAY_ID !
coiled batch run hello.sh

AWS Batch

AWS Batch requires many steps and is hard

AWS Lambda / GCP Cloud Functions / Azure Functions#

These tools let you run serverless Python functions in the cloud. They offer relatively fast cold-start times of around one second, which makes them appropriate for use in user-facing web endpoints, or coordination tasks in other systems (like triggering some task whenever a new files arrives in object storage). They’re great for these tasks.

Coiled provides a serverless Python API, Coiled Functions, which differs in a few ways:

  • Coiled functions are easier to use

    No need to pre-register functions, build docker images, etc. You just define a function, decorate it, and call it immediately.

    @coiled.function(memory="128 GiB")
    def myfunc(filename):
        ...
    
    result = myfunc(filename)
    
  • Coiled functions take longer to spin up

    Coiled uses raw VMs for execution, which means that cold start times are about a minute, rather than a second. This makes Coiled inappropriate for backing user-facing web endpoints that need fast response, and more appropriate for bulk execution.

  • Coiled functions are cheaper

    While AWS Lambda and friends only charge you for what you use, the machines are on and costing the clouds money regardless. To make up for this gap, they charge you about a 5x surcharge on raw cloud costs. Coiled is far cheaper.

If you need rapid response for small computations then AWS Lambda and friends are a great fit, although you might also consider Modal which gives a nicer UX for some degradation in security. If you’re doing large scale bulk computation then Coiled is a much better fit, both for ease of use, and for efficiency.

EC2#

EC2 machines are great! Coiled is built on EC2 (or GCP or Azure equivalents). We believe strongly in the simple efficacy of the Raw VM here. You can get any number of raw VMs anywhere in the world in about 30s, all under your control, and all for just pennies. It’s incredible.

Unfortunately, EC2 (and GCP and Azure equivalents) is not very intuitive for most data science/engineering professionals to quickly manipulate, and so they don’t get used as often as they should, and when they do get used they often get used inefficiently (say, by turning one on and leaving it on for the work-week).

Coiled is effectively a delightful wrapper around EC2, making it trivial to summon thousands of them to do your bidding one minute, and then cast them aside the next, all within a line or two of Python code.

Dask-Gateway / Dask-Kubernetes / Dask-Cloudprovider#

Dask has many great open source ways to deploy itself in the cloud. These are free to use and entirely under your control.

Coiled is more mature than these solutions, and generally offers a better experience on the cloud in a number of ways:

  • Ease of use

  • Ease of setup and zero-maintenance

  • User controls and quotas

  • Cost optimization

  • Improved debugging and alerting

  • Security

  • … and much more

For more information, see Dask Deployment Options

Airflow / Prefect / Dagster#

Workflow managers like Airflow, Prefect, and Dagster are great to trigger when work occurs. They’re useful if you want something to happen every hour, or whenever a new file arrives.

Coiled complements these tools by providing the cloud hardware at those times. Coiled allows you to easily augment a workflow with cloud hardware with optimal cost and minimal maintenance needs.

To learn more, consider the following links: