DuckDB with Coiled Functions#

DuckDB is a library for processing and storing tabular data in a highly efficient way. In this guide, you’ll learn how to leverage the multithreaded nature of DuckDB with Coiled Functions while running on a huge VM in the cloud.

Before you start#

You’ll first need to install the necessary packages. For the purposes of this example, we’ll do this in a new virtual environment, but you could also install them in whatever environment you’re already using for your project.

conda create -n coiled-duckdb-example -c conda-forge python=3.10 coiled s3fs python-duckdb jupyterlab
conda activate coiled-duckdb-example

You also could use pip for everything, or any other package manager you prefer; conda isn’t required.

When you run your code with Coiled Functions, Coiled will automatically replicate your local coiled-duckdb-example environment to your cluster.

About the Data#

In this example, we will use a dataset that the Coiled team created by pre-processing the Uber/Lyft dataset from the High-Volume For-Hire Services, joined it with the NYC Taxi Zone Lookup Table. This results in a dataset with ~1.4 billion rows.

About the query#

We will do a simple Group By followed by an aggregation, which will compute the percentage of tipped rides per operator. We will load the data as a preprocessing step to operate on in memory data.

Dispatch the computation to a VM in the cloud#

We have to decorate our function that will do the work with a Coiled specific decorator that will offload the computation.

import coiled

    vm_type="m6i.16xlarge", # 256 GB of RAM
    region="us-east-2",     # region of our data
    keepalive="5 minutes",  # keep alive to run multiple queries if necessary

This will offload our workload to an EC2 instance with 256GB of RAM. This will be enough to hold our dataset in memory.

Query the data#

We will load the data into memory before executing our query.

import os

def load_data(conn):
  # Load data into memory so that subsequent queries are fast
    SELECT * FROM read_parquet("s3://coiled-datasets/uber-lyft-tlc/*")

def compute_percentage_of_tipped_rides(conn):
    # Run the actual query
   return conn.execute(
       SELECT hvfhs_license_num, sum(tipped) / count(tipped)
        FROM (select 
           CASE WHEN tips > 0.0 then 1 ELSE 0 end as tipped
           from test) GROUP BY hvfhs_license_num

def create_conn():
    import duckdb
    conn = duckdb.connect()
    conn.execute("INSTALL httpfs")
    conn.execute("LOAD httpfs")
    conn.execute(f"SET s3_region='us-east-2'")
    conn.execute(f"SET s3_access_key_id='{os.environ['AWS_ACCESS_KEY_ID']}'")
    conn.execute(f"SET s3_secret_access_key='{os.environ['AWS_SECRET_ACCESS_KEY']}'")
    return conn

Let’s put all of this together to run the actual queries:

import coiled

    vm_type="m6i.16xlarge", # 256 GB of RAM
    region="us-east-2",     # region of our data
    keepalive="5 minutes",  # keep alive to run multiple queries if necessary
def query_results():
    conn = create_conn()
    return compute_percentage_of_tipped_rides(conn)

[('HV0003', 0.1498555901186066),
 ('HV0005', 0.1912300216459857),
 ('HV0004', 0.09294857737045926),
 ('HV0002', 0.08440046492889111)]

This is where the actual decorator comes in. Coiled Functions will offload the computation to AWS and get back the results, which is a very small list with tuples. The provisioned instance will stay alive for 5 minutes so that we can reconnect if we have to run another query.