DuckDB queries in the cloud with Coiled#
DuckDB is a library for processing and storing tabular data in a highly efficient way. In this guide, you’ll learn how to leverage the multithreaded nature of DuckDB with Serverless Functions while running on a huge VM in the cloud.
You can download this jupyter notebook to follow along.
Before you start#
You’ll first need to install the necessary packages. For the purposes of this example, we’ll do this in a new virtual environment, but you could also install them in whatever environment you’re already using for your project.
conda create -n coiled-duckdb-example -c conda-forge python=3.10 coiled s3fs python-duckdb jupyterlab
conda activate coiled-duckdb-example
You also could use pip for everything, or any other package manager you prefer; conda isn’t required.
When you run your code with Coiled Functions, Coiled will automatically replicate your local coiled-duckdb-example environment to your cluster.
About the Data#
In this example, we will use a dataset that the Coiled team created by pre-processing the Uber/Lyft dataset from the High-Volume For-Hire Services, joined it with the NYC Taxi Zone Lookup Table. This results in a dataset with ~1.4 billion rows.
About the query#
We will do a simple Group By followed by an aggregation, which will compute the percentage of tipped rides per operator. We will load the data as a preprocessing step to operate on in memory data.
Dispatch the computation to a VM in the cloud#
We have to decorate our function that will do the work with a Coiled specific decorator that will offload the computation.
import coiled
@coiled.function(
    vm_type="m6i.16xlarge", # 256 GB of RAM
    region="us-east-2",     # region of our data
    keepalive="5 minutes",  # keep alive to run multiple queries if necessary
)
This will offload our workload to an EC2 instance with 256GB of RAM. This will be enough to hold our dataset in memory.
Query the data#
We will load the data into memory before executing our query.
def load_data(conn):
    # Load data into memory so that subsequent queries are fast
    conn.execute(
        """
    CREATE TABLE test AS
    SELECT * FROM read_parquet("s3://coiled-datasets/uber-lyft-tlc/*")
    """
    )
def compute_percentage_of_tipped_rides(conn):
    # Run the actual query
    return conn.execute(
        """
       SELECT hvfhs_license_num, sum(tipped) / count(tipped)
        FROM (select
           *,
           CASE WHEN tips > 0.0 then 1 ELSE 0 end as tipped
           from test) GROUP BY hvfhs_license_num
       """
    ).fetchall()
def create_conn():
    import duckdb
    conn = duckdb.connect()
    conn.execute("CALL load_aws_credentials()")
    return conn
Let’s put all of this together to run the actual queries:
import coiled
@coiled.function(
    vm_type="m6i.16xlarge", # 256 GB of RAM
    region="us-east-2",     # region of our data
    keepalive="5 minutes",  # keep alive to run multiple queries if necessary
)
def query_results():
    conn = create_conn()
    load_data(conn)
    return compute_percentage_of_tipped_rides(conn)
query_results()
[('HV0003', 0.1498555901186066),
 ('HV0005', 0.1912300216459857),
 ('HV0002', 0.08440046492889111),
 ('HV0004', 0.09294857737045926)]
This is where the actual decorator comes in. Coiled Functions will offload the computation to AWS and get back the results, which is a very small list with tuples. The provisioned instance will stay alive for 5 minutes so that we can reconnect if we have to run another query.