DuckDB queries in the cloud with Coiled#

DuckDB is a library for processing and storing tabular data in a highly efficient way. In this guide, you’ll learn how to leverage the multithreaded nature of DuckDB with Serverless Functions while running on a huge VM in the cloud.

You can download this jupyter notebook to follow along.

Before you start#

You’ll first need to install the necessary packages. For the purposes of this example, we’ll do this in a new virtual environment, but you could also install them in whatever environment you’re already using for your project.

conda create -n coiled-duckdb-example -c conda-forge python=3.10 coiled s3fs python-duckdb jupyterlab
conda activate coiled-duckdb-example

You also could use pip for everything, or any other package manager you prefer; conda isn’t required.

When you run your code with Coiled Functions, Coiled will automatically replicate your local coiled-duckdb-example environment to your cluster.

About the Data#

In this example, we will use a dataset that the Coiled team created by pre-processing the Uber/Lyft dataset from the High-Volume For-Hire Services, joined it with the NYC Taxi Zone Lookup Table. This results in a dataset with ~1.4 billion rows.

About the query#

We will do a simple Group By followed by an aggregation, which will compute the percentage of tipped rides per operator. We will load the data as a preprocessing step to operate on in memory data.

Dispatch the computation to a VM in the cloud#

We have to decorate our function that will do the work with a Coiled specific decorator that will offload the computation.

import coiled

@coiled.function(
    vm_type="m6i.16xlarge", # 256 GB of RAM
    region="us-east-2",     # region of our data
    keepalive="5 minutes",  # keep alive to run multiple queries if necessary
)

This will offload our workload to an EC2 instance with 256GB of RAM. This will be enough to hold our dataset in memory.

Query the data#

We will load the data into memory before executing our query.

def load_data(conn):
    # Load data into memory so that subsequent queries are fast
    conn.execute(
        """
    CREATE TABLE test AS
    SELECT * FROM read_parquet("s3://coiled-datasets/uber-lyft-tlc/*")
    """
    )


def compute_percentage_of_tipped_rides(conn):
    # Run the actual query
    return conn.execute(
        """
       SELECT hvfhs_license_num, sum(tipped) / count(tipped)
        FROM (select
           *,
           CASE WHEN tips > 0.0 then 1 ELSE 0 end as tipped
           from test) GROUP BY hvfhs_license_num
       """
    ).fetchall()


def create_conn():
    import duckdb

    conn = duckdb.connect()
    conn.execute("CALL load_aws_credentials()")
    return conn

Let’s put all of this together to run the actual queries:

import coiled


@coiled.function(
    vm_type="m6i.16xlarge", # 256 GB of RAM
    region="us-east-2",     # region of our data
    keepalive="5 minutes",  # keep alive to run multiple queries if necessary
)
def query_results():
    conn = create_conn()
    load_data(conn)
    return compute_percentage_of_tipped_rides(conn)

query_results()

[('HV0003', 0.1498555901186066),
 ('HV0005', 0.1912300216459857),
 ('HV0002', 0.08440046492889111),
 ('HV0004', 0.09294857737045926)]

This is where the actual decorator comes in. Coiled Functions will offload the computation to AWS and get back the results, which is a very small list with tuples. The provisioned instance will stay alive for 5 minutes so that we can reconnect if we have to run another query.