DuckDB with Coiled Functions#
DuckDB is a library for processing and storing tabular data in a highly efficient way. In this guide, you’ll learn how to leverage the multithreaded nature of DuckDB with Serverless Functions while running on a huge VM in the cloud.
You can download this jupyter notebook
to follow along.
Before you start#
You’ll first need to install the necessary packages. For the purposes of this example, we’ll do this in a new virtual environment, but you could also install them in whatever environment you’re already using for your project.
conda create -n coiled-duckdb-example -c conda-forge python=3.10 coiled s3fs python-duckdb jupyterlab
conda activate coiled-duckdb-example
You also could use pip
for everything, or any other package manager you prefer; conda
isn’t required.
When you run your code with Coiled Functions, Coiled will automatically replicate your local coiled-duckdb-example
environment to your cluster.
About the Data#
In this example, we will use a dataset that the Coiled team created by pre-processing the Uber/Lyft dataset from the High-Volume For-Hire Services, joined it with the NYC Taxi Zone Lookup Table. This results in a dataset with ~1.4 billion rows.
About the query#
We will do a simple Group By followed by an aggregation, which will compute the percentage of tipped rides per operator. We will load the data as a preprocessing step to operate on in memory data.
Dispatch the computation to a VM in the cloud#
We have to decorate our function that will do the work with a Coiled specific decorator that will offload the computation.
import coiled
@coiled.function(
vm_type="m6i.16xlarge", # 256 GB of RAM
region="us-east-2", # region of our data
keepalive="5 minutes", # keep alive to run multiple queries if necessary
)
This will offload our workload to an EC2 instance with 256GB of RAM. This will be enough to hold our dataset in memory.
Query the data#
We will load the data into memory before executing our query.
def load_data(conn):
# Load data into memory so that subsequent queries are fast
conn.execute(
"""
CREATE TABLE test AS
SELECT * FROM read_parquet("s3://coiled-datasets/uber-lyft-tlc/*")
"""
)
def compute_percentage_of_tipped_rides(conn):
# Run the actual query
return conn.execute(
"""
SELECT hvfhs_license_num, sum(tipped) / count(tipped)
FROM (select
*,
CASE WHEN tips > 0.0 then 1 ELSE 0 end as tipped
from test) GROUP BY hvfhs_license_num
"""
).fetchall()
def create_conn():
import duckdb
conn = duckdb.connect()
conn.execute("CALL load_aws_credentials()")
return conn
Let’s put all of this together to run the actual queries:
import coiled
@coiled.function(
vm_type="m6i.16xlarge", # 256 GB of RAM
region="us-east-2", # region of our data
keepalive="5 minutes", # keep alive to run multiple queries if necessary
)
def query_results():
conn = create_conn()
load_data(conn)
return compute_percentage_of_tipped_rides(conn)
query_results()
[('HV0003', 0.1498555901186066),
('HV0005', 0.1912300216459857),
('HV0002', 0.08440046492889111),
('HV0004', 0.09294857737045926)]
This is where the actual decorator comes in. Coiled Functions will offload the computation to AWS and get back the results, which is a very small list with tuples. The provisioned instance will stay alive for 5 minutes so that we can reconnect if we have to run another query.