DuckDB with Coiled Functions

DuckDB is a library for processing and storing tabular data in a highly efficient way. In this guide, you’ll learn how to leverage the multithreaded nature of DuckDB with Serverless Functions while running on a huge VM in the cloud.

You can download this jupyter notebook to follow along.

Before you start

You’ll first need to install the necessary packages. For the purposes of this example, we’ll do this in a new virtual environment, but you could also install them in whatever environment you’re already using for your project.

conda create -n coiled-duckdb-example -c conda-forge python=3.10 coiled s3fs python-duckdb jupyterlab
conda activate coiled-duckdb-example

You also could use pip for everything, or any other package manager you prefer; conda isn’t required.

When you run your code with Coiled Functions, Coiled will automatically replicate your local coiled-duckdb-example environment to your cluster.

About the Data

In this example, we will use a dataset that the Coiled team created by pre-processing the Uber/Lyft dataset from the High-Volume For-Hire Services, joined it with the NYC Taxi Zone Lookup Table. This results in a dataset with ~1.4 billion rows.

About the query

We will do a simple Group By followed by an aggregation, which will compute the percentage of tipped rides per operator. We will load the data as a preprocessing step to operate on in memory data.

Dispatch the computation to a VM in the cloud

We have to decorate our function that will do the work with a Coiled specific decorator that will offload the computation.

import coiled

@coiled.function(
    vm_type="m6i.16xlarge", # 256 GB of RAM
    region="us-east-2",     # region of our data
    keepalive="5 minutes",  # keep alive to run multiple queries if necessary
)

This will offload our workload to an EC2 instance with 256GB of RAM. This will be enough to hold our dataset in memory.

Query the data

We will load the data into memory before executing our query.

def load_data(conn):
    # Load data into memory so that subsequent queries are fast
    conn.execute(
        """
    CREATE TABLE test AS
    SELECT * FROM read_parquet("s3://coiled-datasets/uber-lyft-tlc/*")
    """
    )


def compute_percentage_of_tipped_rides(conn):
    # Run the actual query
    return conn.execute(
        """
       SELECT hvfhs_license_num, sum(tipped) / count(tipped)
        FROM (select
           *,
           CASE WHEN tips > 0.0 then 1 ELSE 0 end as tipped
           from test) GROUP BY hvfhs_license_num
       """
    ).fetchall()


def create_conn():
    import duckdb

    conn = duckdb.connect()
    conn.execute("CALL load_aws_credentials()")
    return conn

Let’s put all of this together to run the actual queries:

import coiled


@coiled.function(
    vm_type="m6i.16xlarge", # 256 GB of RAM
    region="us-east-2",     # region of our data
    keepalive="5 minutes",  # keep alive to run multiple queries if necessary
)
def query_results():
    conn = create_conn()
    load_data(conn)
    return compute_percentage_of_tipped_rides(conn)

query_results()
[('HV0003', 0.1498555901186066),
 ('HV0005', 0.1912300216459857),
 ('HV0002', 0.08440046492889111),
 ('HV0004', 0.09294857737045926)]

This is where the actual decorator comes in. Coiled Functions will offload the computation to AWS and get back the results, which is a very small list with tuples. The provisioned instance will stay alive for 5 minutes so that we can reconnect if we have to run another query.