DataFrames at Scale Comparison: TPC-H

Hendrik Makait, Sarah Johnson, Matthew Rocklin

2024-05-14

14 min read

We run benchmarks derived from the TPC-H benchmark suite on a variety of scales, hardware architectures, and dataframe projects, notably Apache Spark, Dask, DuckDB, and Polars. No project wins.

This post analyzes results within each project and between projects. To start, here is the raw data of the results:

There is too much data here to understand all at once, so this post breaks down results by project and by comparing different projects.

When to use DuckDB vs. Polars vs. Dask vs. Spark

Deciding which tool is right for you depends on many factors, some of which are objectively quantifiable (speed, scale), some of which are not (language preference, workload type, deployment pain).

In this section we’ll go project by project, explaining how each did.

First though, we’ll include an arbitrary table and some general subjective takes.

DuckDB and Dask are the most robust choices today, although Polars is developing rapidly.

Spark

Dask

DuckDB

Polars

Fast Locally

🤔

Fast on Cloud (1 TB)

Fast on Cloud (10 TB)

Scales Out

SQL

🤔

🤔

More than SQL

Sensible Defaults

Local + small? Choose DuckDB or Polars. For small data (10 GB or less) Polars and DuckDB (or pandas) are great. Conversely, Dask and Spark leave a lot of performance on the table in this configuration.

Local + big? Choose Dask or DuckDB. For larger data, Polars isn’t yet fully stable. Polars is improving quickly, although, and is the most dynamic project in this group. DuckDB is more reliable (so is Dask, but it’s slower). Spark fails here pretty easily. If you want to perform SQL queries on 100 GB of local data, DuckDB is the clear winner here.

Cloud at 1 TB or less? Choose Dask or DuckDB. For datasets that fit on a single cloud VM, Dask and DuckDB usually outperform Spark and Polars, respectively. Performance between the fast-local systems (Polars/DuckDB) and the distributed systems (Dask/Spark) converge a lot when we move to the cloud, largely due to S3 access bottlenecks (it doesn’t matter if you’re lightning fast if you’re waiting on S3 for data access). DuckDB does well across scales for the same hardware, but can’t scale beyond a single VM.

Cloud at large scale? Choose Dask or Spark. DuckDB can be as efficient as Dask or Spark on a single big VM, but if you want to process large volumes quickly, you eventually need a scalable system. Dask and Spark can outperform DuckDB and Polars on larger problems by adding more hardware. For large scale, Dask often outperforms Spark and scales well beyond DuckDB and Polars.

How does Spark do on TPC-H Benchmarks?

Spark is the de-facto standard technology in this space today, it is also the least performant in general. Spark’s days seem to be numbered.

Spark is able to complete queries at all scales, but is out-performed at large scale by Dask (by about 2x), and very out-performed at small scale by everyone (by about 10x+). Spark also performs poorly when given far less memory than the data it has to process, as was the case in the 10 TiB / 128 core case (which we included mostly so we could directly compare Spark against DuckDB, which is restricted to run on one machine).

Anecdotally, using Spark was also more painful than using the other systems. We spent more time configuring Spark than we spent on all other systems combined. Even then, we’re uncertain that we have it running optimally. Spark’s performance was highly sensitive to non-default settings (like explicitly setting an output number of shuffle partitions). Additionally we found that Spark was surprisingly inefficient. It had the highest CPU occupancy rates of any of the projects, while also some of the slowest performance.

Still, Spark’s history makes it a sane choice for any organization looking for a solid and well-known solution. We would recommend it mostly for inertia, and for its broad support among enterprise data software systems. We’re also curious about the various projects that rewrite Spark internals, like Photon from Databricks, and the open source Comet system, which were not evaluated here.

How does Dask do on TPC-H Benchmarks?

Disclaimer: We like Dask and know how to use it better than the other frameworks.

Locally, Dask is outperformed at small scale by DuckDB and Polars, but performs reliably across all scales. Dask performs universally better than Spark, and is more robust than Polars and DuckDB even if it’s slower at small scales with fast disk. Dask consistently outperforms other technologies at scale in the cloud, particularly as we increase the amount of hardware available.

Dask was the most robust system among the configurations tested (although again, there’s bias here because we’re better at Dask than at other systems).

Compared to Spark, Dask was easy to configure. Dask automatically inspected local hardware and configured itself appropriately. Additionally Dask’s out-of-core shuffling algorithms were stable even when data sizes were far larger than memory (compare against the 10 TiB 128 core case).

How does DuckDB do on TPC-H Benchmarks?

DuckDB performs commendably at all scales, both locally and in the cloud. It can be occasionally outperformed locally (by Polars, see comparisons below) and in the cloud (by Dask) but is a robust choice across all platforms.

However, two things stop us from recommending DuckDB universally (aside from self-interest):

  1. Scale Limitations: If you need to do fast queries on 10+TB-scale data, DuckDB’s inability to use multiple machines is a limitation. Dask and Spark can easily get 5x speed increases over DuckDB at a 10x hardware cost difference.

  2. SQL-only: Often people need more than just SQL (arguably this explain’s Python’s popularity today) and DuckDB is just SQL.

Using DuckDB was easy (it was pretty much plug-and-play). It is the only framework where we did not feel the urge to communicate with experts of the software, whereas we collaborated with Polars and Spark developers in producing these numbers (see the experiment details section below). It was however mildly inconvenient to add configuration to access data on S3, but documentation and LLMs sufficed to point us in the right direction. There are some small and straightforward things DuckDB can do to reduce friction for lazy users (and honestly, we’re all pretty lazy).

Kudos to the DuckDB team.

How does Polars do on TPC-H Benchmarks?

Edit: this post benchmarks the Polars Streaming system, which is intended for larger-than-memory workloads (very often these benchmarks operate in constrained space) and not the standard in-memory system. The streaming engine isn’t as fast or as mature as the in-memory system. It may also be entirely rewritten in the near-future by the Polars team.

Polars does fantastically well at modest scale (10 GB) locally, where it can use fast local disk and fit tables in memory. The only contender at this scale is DuckDB. However, Polars struggles in a few situations:

  1. Its cloud data readers are not yet very performant

  2. Its unable to handle large multi-table joins well, which are unfortunately common in this benchmark suite

Additionally, when Polars fails, it tends to fail hard, often resulting in machines grinding to a halt. We unreservedly recommend Polars for the 10 GB scale on local machines in memory, and are excited to see how Polars develops in the future (the project is moving quickly).

Using Polars was an interesting case. It both broke frequently, but was also rapidly fixed by the Polars maintainers. This gave the impression that it’s not yet ready (again, this is with the experimental streaming backend, not in-memory), but moving very fast. This was especially true when deploying on the cloud, where Polars has not historically seen as much use. Note that we have not run benchmarks with Polars on the cloud for scales above 100 GB because initial attempts showed poor scaling behavior.

Comparisons

In this section we consider subsets of the data to show head-to-head matchups between similar technologies. We compare Dask vs Spark, Polars vs DuckDB, and so on. Restricting the comparison space allows us to get a bit more insight into our data.

Dask vs. Spark

Today TPC-H data shows that Dask is usually faster than Spark, and more robustly performant across scales.

Additionally, users tend to like Dask for non-performance reasons like the following:

  1. Dask is easier to use and debug (for Python devs). Dask is well-integrated in the larger Python ecosystem and doesn’t require digging through JVM logs.

  2. Dask is easier to deploy. Deploying Spark is painful and most users rely on Databricks or EMR. You can deploy Dask easily with Kubernetes, HPC job managers, or on the cloud with Coiled.

  3. Dask is highly flexible. You can use Dask DataFrame with tabular datasets and it also extends to other applications like training ML models (XGBoost, PyTorch), workflow management (Prefect, Dagster), and highly custom parallelism.

For more information see Dask vs. Spark.

Polars vs. DuckDB

Our benchmarks show that typically DuckDB outperforms Polars on single-machine local computations, such as a Macbook Pro. This differs from other benchmarks we’ve read from other vendors, so it’s important to mention that local-hardware benchmarks are highly hardware specific. In particular, we were using the Polars streaming system, due to being memory constrained in these benchmarks, which differs from the system used in typical Polars benchmarks.

As we increase scale and move to the cloud, DuckDB’s lead becomes more prominent. In general, DuckDB feels like a more mature project today.

However, DuckDB also requires SQL knowledge (though there is an integration with Ibis Dataframes). Python users who prefer DataFrames may prefer Polars. Subjectively, Polars also “feels more Pythonic” which is admittedly squishy and hard to define, but also pretty important.

While we would recommend DuckDB from a performance standpoint, when given a novel data problem to solve we find ourselves reaching for Polars (or just pandas) more often.

Polars vs. Pandas and Dask

Polars beats the crap out of Dask locally on smallish data. Dask beat the crap out of Polars in the cloud on large data.

However, today when we say “Dask” in this context we mean “Dask DataFrame” which is a Dask + pandas project. In an ideal world, Polars and Dask would work together, much like how pandas and Dask work together today. Dask handles distributed coordination, but leaves in-memory computing to other projects (in this case Pandas). It would be very feasible to swap Polars in instead. They do very different things well, and other things less well (polars works great on a single machine, Dask helps coordinate many machines).

Dask vs. DuckDB

We mostly view a competition between Dask and Spark (scalable systems) and DuckDB and Polars (single-machine-but-fast systems) but Dask and DuckDB seem to win competitions often enough that we wanted to see how they compared against each other.

DuckDB generally does better for smaller datasets. Dask does a bit better on larger datasets. They’re both pretty solid though.

These results aren’t quite fair in two ways:

  1. We’re not looking at local systems, where DuckDB will be faster

  2. We’re constraining Dask to run at 128 cores or less, because DuckDB can’t go higher than that for a fair comparison

In practice we expect the application and language preference to have a higher impact on choice between these two. If you have an application that uses SQL then you’ll probably use DuckDB, even when it’s kinda slow at large scale. If you have an application which requires more generic Python logic you’ll probably choose Dask, even when it’s kinda slow at small scale.

TPC-H Experimental Details

We ran the TPC-H benchmarks with Dask, Spark, DuckDB and Polars on a variety of scales locally and on the cloud. Implementation details and instructions on how to run the benchmarks yourself are in the public coiled/benchmarks GitHub repository.

import coiled

# Get Dask Cluster
cluster = coiled.Cluster(
    n_workers=32,
    worker_vm_types=["m6i.xlarge"]
)
client = cluster.get_client()

Learn more

import coiled

# Get Spark Cluster
cluster = coiled.Cluster(
    n_workers=32,
    worker_vm_types=["m6i.xlarge"]
)
spark = cluster.get_spark()

Learn more

import coiled

# Run DuckDB query
@coiled.function(vm_types=["m6i.32xlarge"])
def query():
    connection = ...

Learn more

import coiled

# Run Polars Query
@coiled.function(vm_types=["m6i.32xlarge"])
def query():
    df = pl.read.parquet(...)
    ...

Learn more

While infrastructure setup was easy, no project “just worked” efficiently and everything required some degree of looking up documentation, tweaking configuration parameters to get decent performance at scale, passing credentials around, etc., but some more than others (gah! Spark.) Some projects degraded decently even when not optimally configured (DuckDB and Dask, nice job) while others failed ungracefully, grinding machines to a halt (Spark and Polars).

Bias

We’ve done our best to avoid intentionally tuning for our favorite project. We’ve chosen benchmarks chosen by other groups, run everything on standard hardware, and have asked the following experts in other systems to help ensure good performance for the different systems:

  • Polars: Ritchie Vink

  • Spark: Jeffrey Chou and Vinoo Ganesh of Sync Computing

  • DuckDB: no one, but this didn’t seem necessary (they do pretty well!)

Although of course everyone’s time is limited, and asking experts for input makes no guarantee that we’re still not missing important opportunities for performance in those projects.

Dataset Layout

The datasets used in our benchmarks are derived from the datasets generated for the TPC-H benchmarks at the corresponding scale factors stored in Parquet format. The scale reflects the total size of the entire dataset. To generate the data, we used the DuckDB TPC-H extension to generate the data and PyArrow to write partitions of ~100 MB each to disk. The resulting partitions are partially inhomogenous, which reflects the datasets we encounter in practice. All files are compressed with Snappy. See generate_data.py for details.

Measurements

We report three main metrics:

Runtime: This is the end-to-end runtime from generating the query to receiving the result (in seconds). This is our primary measurement.

Relative Difference in Runtime: This metric shows how large the performance difference between two systems is (in percent). We calculated the relative difference in runtime betwen two systems by dividing the absolute runtime difference by the runtime of the faster system:

\[\text{Relative Difference in Runtime} = abs(\frac{\text{Runtime}_X - \text{Runtime}_Y}{min(\text{Runtime}_X, \text{Runtime}_Y)})\]

Calculated this way, the measurement is symmetric, i.e., the value does not change if we compare X to Y or Y to X, and it is linear, i.e., each multiple of absolute difference will result in a multiple of the relative difference.

Query Completion: This boolean measurement shows whether the system completed the query by returning a result. If the system did not complete the query, it either encountered a hard failure (like processes being killed because they ran out of memory) or it timed out after several hours.

Local Configuration

The local benchmarks were run for scales 10 (10 GB) and 100 (100 GB). We executed them on a MacBook Pro with an M1 (8-core) Apple silicon CPU and 16 GB of RAM and the built-in SSD used for storage. We did our best to create reproducible benchmarks by closing noisy background processes where possible, but your mileage may vary here depending on what background processes you still have running on your machine. The software was installed with conda/mamba with builds from conda-forge. The datasets were stored on local SSD for fast access.

Cloud Configuration

The benchmarks were run on AWS on the cloud for scales 10 (10 GB), 100 (100 GB), 1,000 (1 TB), and 10,000 (10 TB) using EC2 instances from the M6i family (more on why we prefer non-burstable instances). DuckDB and Polars ran on a single, large machine with Coiled serverless functions while Dask and Spark ran on clusters. For scale 10,000 (10 TB), we performed two experiments and scaled the amount of vCPUs from 320 by a factor of 10 to 320. Since there is no instance type with 320 vCPUs available, this experiment was only run on the scale-out capable systems, Dask and PySpark.

Cloud hardware used for running benchmarks at various scales (see the configuration for more details).

Scale

vCPUs

Memory

Instances (single)

Instances (cluster)

100 GB

32

128 GiB

1 x m6i.8xlarge

16 x m6i.large

1 TB

128

512 GiB

1 x m6i.32xlarge

32 x m6i.xlarge

10 TB

128

512 GiB

1 x m6i.32xlarge

32 x m6i.xlarge

10 TB

1280

5 TiB

n/a

320 x m6i.xlarge

The datasets are stored on S3 in the public requester-pays bucket with subdirectories for each scale (e.g. s3://coiled-runtime-ci/tpc-h/snappy/scale-10/).

Library-Specific Configuration

Spark

For Spark, we manually configured memory limits in order to utilize existing memory while avoiding OOM errors. For cloud-based benchmarks, we configured spark.executor.memory to 80% of the available memory after tweaking for several iterations. For local benchmarks, we configured spark.driver.memory to 10g to avoid collect getting stuck. Locally, we also had to manually set spark.driver.host and spark.driver.bindAddress to 127.0.0.1 since the default does not work for M1 MacBooks. We used SparkSQL to run the benchmarks.

Dask

For Dask, we enabled Copy-on-Write (COW), which will become the default in pandas 3, which will be released soon. We also disabled task queueing. Since Dask is a dataframe-based system, we had to translate the reference SQL queries to idiomatic Dask queries. As we currently lack join order optimization (see dask-expr#1065), we had to manually order joins. For this, we diverged from the order in which tables were introduced in the SQL query to a manually optimized order that a careful user would typically choose when following the dimensions and facts defined in the TPC-H schema.

DuckDB

For DuckDB, we did not perform any configuration apart from passing security credentials.

Polars

Since Polars is a dataframe-based system, the reference SQL queries had to be translated to idiomatic Polars queries. For this, we reused the reference implementations published by Polars. To allow Polars to process larger-than-memory datasets, we executed all queries in streaming mode.

Learn More

This article covers quantitative results on the TPC-H Benchmark across common DataFrame libraries. For information on how to run these projects in the cloud yourself, consider looking at the Coiled documentation for each project:

We (Coiled) care about making data processing in the cloud easy, which involves running the right tool for the right job. That being said, we’re also biased towards Dask. If you’d like to read more about recent Dask performance, or how it compares to Spark today, we invite you to consider the following: