May 14, 2024

14m read

DataFrames at Scale Comparison: TPC-H#

Hendrik Makait, Sarah Johnson, Matthew Rocklin

We run benchmarks derived from the TPC-H benchmark suite on a variety of scales, hardware architectures, and dataframe projects, notably Apache Spark, Dask, DuckDB, and Polars. No project wins.

This post analyzes results within each project and between projects. To start, here is the raw data of the results:

There is too much data here to understand all at once, so this post breaks down results by project and by comparing different projects.

Project results: How each project performs individually
- Spark
- Dask
- DuckDB
- Polars
Comparisons between Projects: Head-to-head comparisons
Experimental Details: How we ran these benchmarks

When to use DuckDB vs. Polars vs. Dask vs. Spark#

Deciding which tool is right for you depends on many factors, some of which are objectively quantifiable (speed, scale), some of which are not (language preference, workload type, deployment pain).

In this section we’ll go project by project, explaining how each did.

Spark
Dask
DuckDB
Polars

First though, we’ll include an arbitrary table and some general subjective takes.

*DuckDB and Dask are the most robust choices today. Polars is developing rapidly, though, and we recommend their benchmarks for the latest performance details.*#
	Spark	Dask	DuckDB	Polars
Fast Locally	❌	🤔	✅	✅
Fast on Cloud (1 TB)	✅	✅	✅	❌
Fast on Cloud (10 TB)	❌	✅	✅	❌
Scales Out	✅	✅	❌	❌
SQL	✅	🤔	✅	🤔
More than SQL	✅	✅	❌	✅
Sensible Defaults	❌	✅	✅	✅

Local + small? Choose DuckDB or Polars. For small data (10 GB or less) Polars and DuckDB (or pandas) are great. Conversely, Dask and Spark leave a lot of performance on the table in this configuration.

Local + big? Choose Dask or DuckDB. For larger data, Polars isn’t yet fully stable. Polars is improving quickly, although, and is the most dynamic project in this group. DuckDB is more reliable (so is Dask, but it’s slower). Spark fails here pretty easily. If you want to perform SQL queries on 100 GB of local data, DuckDB is the clear winner here.

Cloud at 1 TB or less? Choose Dask or DuckDB. For datasets that fit on a single cloud VM, Dask and DuckDB usually outperform Spark and Polars, respectively. Performance between the fast-local systems (Polars/DuckDB) and the distributed systems (Dask/Spark) converge a lot when we move to the cloud, largely due to S3 access bottlenecks (it doesn’t matter if you’re lightning fast if you’re waiting on S3 for data access). DuckDB does well across scales for the same hardware, but can’t scale beyond a single VM.

Cloud at large scale? Choose Dask or Spark. DuckDB can be as efficient as Dask or Spark on a single big VM, but if you want to process large volumes quickly, you eventually need a scalable system. Dask and Spark can outperform DuckDB and Polars on larger problems by adding more hardware. For large scale, Dask often outperforms Spark and scales well beyond DuckDB and Polars.

How does Spark do on TPC-H Benchmarks?#

Spark is the de-facto standard technology in this space today, it is also the least performant in general. Spark’s days seem to be numbered.

Spark is able to complete queries at all scales, but is out-performed at large scale by Dask (by about 2x), and very out-performed at small scale by everyone (by about 10x+). Spark also performs poorly when given far less memory than the data it has to process, as was the case in the 10 TiB / 128 core case (which we included mostly so we could directly compare Spark against DuckDB, which is restricted to run on one machine).

Anecdotally, using Spark was also more painful than using the other systems. We spent more time configuring Spark than we spent on all other systems combined. Even then, we’re uncertain that we have it running optimally. Spark’s performance was highly sensitive to non-default settings (like explicitly setting an output number of shuffle partitions). Additionally we found that Spark was surprisingly inefficient. It had the highest CPU occupancy rates of any of the projects, while also some of the slowest performance.

Still, Spark’s history makes it a sane choice for any organization looking for a solid and well-known solution. We would recommend it mostly for inertia, and for its broad support among enterprise data software systems. We’re also curious about the various projects that rewrite Spark internals, like Photon from Databricks, and the open source Comet system, which were not evaluated here.

How does Dask do on TPC-H Benchmarks?#

Disclaimer: We like Dask and know how to use it better than the other frameworks.

Locally, Dask is outperformed at small scale by DuckDB and Polars, but performs reliably across all scales. Dask performs universally better than Spark, and is more robust than Polars and DuckDB even if it’s slower at small scales with fast disk. Dask consistently outperforms other technologies at scale in the cloud, particularly as we increase the amount of hardware available.

Dask was the most robust system among the configurations tested (although again, there’s bias here because we’re better at Dask than at other systems).

Compared to Spark, Dask was easy to configure. Dask automatically inspected local hardware and configured itself appropriately. Additionally Dask’s out-of-core shuffling algorithms were stable even when data sizes were far larger than memory (compare against the 10 TiB 128 core case).

How does DuckDB do on TPC-H Benchmarks?#

DuckDB performs commendably at all scales, both locally and in the cloud. It can be occasionally outperformed locally (by Polars, see comparisons below) and in the cloud (by Dask) but is a robust choice across all platforms.

However, two things stop us from recommending DuckDB universally (aside from self-interest):

Scale Limitations: If you need to do fast queries on 10+TB-scale data, DuckDB’s inability to use multiple machines is a limitation. Dask and Spark can easily get 5x speed increases over DuckDB at a 10x hardware cost difference.
SQL-only: Often people need more than just SQL (arguably this explain’s Python’s popularity today) and DuckDB is just SQL.

Using DuckDB was easy (it was pretty much plug-and-play). It is the only framework where we did not feel the urge to communicate with experts of the software, whereas we collaborated with Polars and Spark developers in producing these numbers (see the experiment details section below). It was however mildly inconvenient to add configuration to access data on S3, but documentation and LLMs sufficed to point us in the right direction. There are some small and straightforward things DuckDB can do to reduce friction for lazy users (and honestly, we’re all pretty lazy).

Kudos to the DuckDB team.

How does Polars do on TPC-H Benchmarks?#

Updated June 17, 2025: The Polars streaming engine has undergone significant rewrites since these benchmarks were run. We recommend the Polars benchmarks for the latest performance details.

Polars does fantastically well at modest scale (10 GB) locally, where it can use fast local disk and fit tables in memory. The only contender at this scale is DuckDB. However, Polars struggles in a few situations:

Its cloud data readers are not yet very performant
Its unable to handle large multi-table joins well, which are unfortunately common in this benchmark suite

Additionally, when Polars fails, it tends to fail hard, often resulting in machines grinding to a halt. We unreservedly recommend Polars for the 10 GB scale on local machines in memory, and are excited to see how Polars develops in the future (the project is moving quickly).

Using Polars was an interesting case. It both broke frequently, but was also rapidly fixed by the Polars maintainers. This gave the impression that it’s not yet ready (again, this is with the experimental streaming backend, not in-memory), but moving very fast. This was especially true when deploying on the cloud, where Polars has not historically seen as much use. Note that we have not run benchmarks with Polars on the cloud for scales above 100 GB because initial attempts showed poor scaling behavior.

TPC-H Experimental Details#

We ran the TPC-H benchmarks with Dask, Spark, DuckDB and Polars on a variety of scales locally and on the cloud. Implementation details and instructions on how to run the benchmarks yourself are in the public coiled/benchmarks GitHub repository.

Dask

import coiled

# Get Dask Cluster
cluster = coiled.Cluster(
    n_workers=32,
    worker_vm_types=["m6i.xlarge"]
)
client = cluster.get_client()

Learn more

PySpark

import coiled

# Get Spark Cluster
cluster = coiled.Cluster(
    n_workers=32,
    worker_vm_types=["m6i.xlarge"]
)
spark = cluster.get_spark()

Learn more

DuckDB

import coiled

# Run DuckDB query
@coiled.function(vm_types=["m6i.32xlarge"])
def query():
    connection = ...

Learn more

Polars

import coiled

# Run Polars Query
@coiled.function(vm_types=["m6i.32xlarge"])
def query():
    df = pl.read.parquet(...)
    ...

Learn more

While infrastructure setup was easy, no project “just worked” efficiently and everything required some degree of looking up documentation, tweaking configuration parameters to get decent performance at scale, passing credentials around, etc., but some more than others (gah! Spark.) Some projects degraded decently even when not optimally configured (DuckDB and Dask, nice job) while others failed ungracefully, grinding machines to a halt (Spark and Polars).

Bias#

We’ve done our best to avoid intentionally tuning for our favorite project. We’ve chosen benchmarks chosen by other groups, run everything on standard hardware, and have asked the following experts in other systems to help ensure good performance for the different systems:

Polars: Ritchie Vink
Spark: Jeffrey Chou and Vinoo Ganesh of Sync Computing
DuckDB: no one, but this didn’t seem necessary (they do pretty well!)

Although of course everyone’s time is limited, and asking experts for input makes no guarantee that we’re still not missing important opportunities for performance in those projects.

Dataset Layout#

The datasets used in our benchmarks are derived from the datasets generated for the TPC-H benchmarks at the corresponding scale factors stored in Parquet format. The scale reflects the total size of the entire dataset. To generate the data, we used the DuckDB TPC-H extension to generate the data and PyArrow to write partitions of ~100 MB each to disk. The resulting partitions are partially inhomogenous, which reflects the datasets we encounter in practice. All files are compressed with Snappy. See generate_data.py for details.

Measurements#

We report three main metrics:

Runtime: This is the end-to-end runtime from generating the query to receiving the result (in seconds). This is our primary measurement.

Relative Difference in Runtime: This metric shows how large the performance difference between two systems is (in percent). We calculated the relative difference in runtime betwen two systems by dividing the absolute runtime difference by the runtime of the faster system:

\[\text{Relative Difference in Runtime} = abs(\frac{\text{Runtime}_X - \text{Runtime}_Y}{min(\text{Runtime}_X, \text{Runtime}_Y)})\]

Calculated this way, the measurement is symmetric, i.e., the value does not change if we compare X to Y or Y to X, and it is linear, i.e., each multiple of absolute difference will result in a multiple of the relative difference.

Query Completion: This boolean measurement shows whether the system completed the query by returning a result. If the system did not complete the query, it either encountered a hard failure (like processes being killed because they ran out of memory) or it timed out after several hours.

Local Configuration#

The local benchmarks were run for scales 10 (10 GB) and 100 (100 GB). We executed them on a MacBook Pro with an M1 (8-core) Apple silicon CPU and 16 GB of RAM and the built-in SSD used for storage. We did our best to create reproducible benchmarks by closing noisy background processes where possible, but your mileage may vary here depending on what background processes you still have running on your machine. The software was installed with conda/mamba with builds from conda-forge. The datasets were stored on local SSD for fast access.

Cloud Configuration#

The benchmarks were run on AWS on the cloud for scales 10 (10 GB), 100 (100 GB), 1,000 (1 TB), and 10,000 (10 TB) using EC2 instances from the M6i family (more on why we prefer non-burstable instances). DuckDB and Polars ran on a single, large machine with Coiled serverless functions while Dask and Spark ran on clusters. For scale 10,000 (10 TB), we performed two experiments and scaled the amount of vCPUs from 320 by a factor of 10 to 320. Since there is no instance type with 320 vCPUs available, this experiment was only run on the scale-out capable systems, Dask and PySpark.

*Cloud hardware used for running benchmarks at various scales (see the configuration for more details).*#
Scale	vCPUs	Memory	Instances (single)	Instances (cluster)
100 GB	32	128 GiB	1 x `m6i.8xlarge`	16 x `m6i.large`
1 TB	128	512 GiB	1 x `m6i.32xlarge`	32 x `m6i.xlarge`
10 TB	128	512 GiB	1 x `m6i.32xlarge`	32 x `m6i.xlarge`
10 TB	1280	5 TiB	n/a	320 x `m6i.xlarge`

The datasets are stored on S3 in the public requester-pays bucket with subdirectories for each scale (e.g. s3://coiled-runtime-ci/tpc-h/snappy/scale-10/).

Library-Specific Configuration#

Spark#

For Spark, we manually configured memory limits in order to utilize existing memory while avoiding OOM errors. For cloud-based benchmarks, we configured spark.executor.memory to 80% of the available memory after tweaking for several iterations. For local benchmarks, we configured spark.driver.memory to 10g to avoid collect getting stuck. Locally, we also had to manually set spark.driver.host and spark.driver.bindAddress to 127.0.0.1 since the default does not work for M1 MacBooks. We used SparkSQL to run the benchmarks.

Dask#

For Dask, we enabled Copy-on-Write (COW), which will become the default in pandas 3, which will be released soon. We also disabled task queueing. Since Dask is a dataframe-based system, we had to translate the reference SQL queries to idiomatic Dask queries. As we currently lack join order optimization (see dask-expr#1065), we had to manually order joins. For this, we diverged from the order in which tables were introduced in the SQL query to a manually optimized order that a careful user would typically choose when following the dimensions and facts defined in the TPC-H schema.

DuckDB#

For DuckDB, we did not perform any configuration apart from passing security credentials.

Polars#

Since Polars is a dataframe-based system, the reference SQL queries had to be translated to idiomatic Polars queries. For this, we reused the reference implementations published by Polars. To allow Polars to process larger-than-memory datasets, we executed all queries in streaming mode.

Learn More#

This article covers quantitative results on the TPC-H Benchmark across common DataFrame libraries. For information on how to run these projects in the cloud yourself, consider looking at the Coiled documentation for each project:

We (Coiled) care about making data processing in the cloud easy, which involves running the right tool for the right job. That being said, we’re also biased towards Dask. If you’d like to read more about recent Dask performance, or how it compares to Spark today, we invite you to consider the following:

Coiled

DataFrames at Scale Comparison: TPC-H#

When to use DuckDB vs. Polars vs. Dask vs. Spark#

How does Spark do on TPC-H Benchmarks?#

How does Dask do on TPC-H Benchmarks?#

How does DuckDB do on TPC-H Benchmarks?#

How does Polars do on TPC-H Benchmarks?#

Comparisons#

Dask vs. Spark#

Polars vs. DuckDB#

Polars vs. Pandas and Dask#

Dask vs. DuckDB#

TPC-H Experimental Details#

Bias#

Dataset Layout#

Measurements#

Local Configuration#

Cloud Configuration#

Library-Specific Configuration#

Spark#

Dask#

DuckDB#

Polars#

Learn More#