Process Hundreds of GB of Data with DuckDB in the Cloud

Code snippet of using the coiled.function decorator to run a query with DuckDB on a large VM in the cloud.

DuckDB is great tool for running efficient queries on large datasets. When you want cloud data proximity or need more RAM, Coiled makes it easy to run your Python function in the cloud. In this post we’ll use Coiled Functions API to process the 150 GB Uber-Lyft dataset on a single machine with DuckDB.

Query Parquet files with DuckDB

We start with creating the SQL queries that we want to run against the data locally. We’ll use the DuckDB Python API.

def load_data(conn):
  # Load data into memory so that subsequent queries are fast
  conn.execute(
    '''
    CREATE TABLE test AS 
    SELECT * FROM read_parquet("s3://coiled-datasets/uber-lyft-tlc/*")
    '''
  )


def compute_percentage_of_tipped_rides(conn):
    # Run the actual query
   return conn.execute(
       '''
       SELECT hvfhs_license_num, sum(tipped) / count(tipped)
        FROM (select 
           *,  
           CASE WHEN tips > 0.0 then 1 ELSE 0 end as tipped
           from test) GROUP BY hvfhs_license_num
       '''
   ).fetchall()


def create_conn():
    import duckdb

    return duckdb.connect()


def query_results():
    conn = create_conn()
    load_data(conn)
    return compute_percentage_of_tipped_rides(conn)

These queries aren’t particularly fancy, they are meant to illustrate how we can process these files. If we execute these queries as is, it would pull all the data onto our machine. The whole dataset won’t fit in memory on most workstations, so let’s look at how Coiled can make this work. The loading would take a long time, even if we had enough memory.

How do I query data stored in S3 using DuckDB?

Coiled Functions come into the equation since we need access to machines that have enough resources and are also close to our data. Coiled can connect to AWS and thus, use all resources that are available there. These steps also work for data in Google Cloud Storage or Azure Blob Storage. We will go through the necessary steps execute these queries on a VM in the same region as our data with enough memory available.

The dataset is stored in the S3 bucket s3://coiled-datasets/uber-lyft-tlc in AWS region us-east-2. We’ll have to adapt our create_conn function to use load_aws_credentials and set the region. This DuckDB AWS extension builds on the httpfs extension to add authentication. Coiled forwards your credentials as short-lived (refreshable) STS tokens, which is more secure than putting long-lived AWS access keys (which consist of access key ID and secret access key).

def create_conn():
    import duckdb
    
    conn = duckdb.connect()
    conn.execute("CALL load_aws_credentials()")
    return conn

The next step is adding the @coiled.function decorator to the function that executes our queries. This syntax will tell Coiled that it should spin up a large VM on AWS and run the query there, and then return the result locally.

@coiled.function(
    vm_type="m6i.16xlarge", # 256 GB of RAM
    region="us-east-2",     # region of our data
    keepalive="5 minutes",  # keep alive to run multiple queries if necessary
)
def query_results():
    conn = create_conn()
    load_data(conn)
    return compute_percentage_of_tipped_rides(conn)

Let’s execute our queries and pull the results back to our local machine:

result = query_results()

print(result)
[
  ('HV0005', 0.1912300216459857), 
  ('HV0003', 0.1498555901186066), 
  ('HV0004', 0.09294857737045926), 
  ('HV0002', 0.08440046492889111),
]

The data are now all in memory on our VM in the cloud:

Plot from the Coiled dashboard showing 150 GB of memory usage on our VM.

Memory usage goes up to 150 GB as we load the dataset on a big VM in the cloud.

There is no need to adjust the other functions. Coiled will run our query on a VM in the cloud with enough resources and close to our data.

Let’s take a brief look at the arguments to coiled.function():

  • vm_type: This specifies the type of AWS EC2 instance. We are looking for an instance that has enough memory to hold our data. This instance has 256GB, so this should be sufficient.

  • region: The region specifies the AWS region that our VM is started in. Our data are also in "us-east-2".

  • keepalive: Keeps the VM alive so that we can run multiple queries against the data in memory.

coiled.function() will now start a VM in AWS with the specified EC2 instance. The VM is normally up and running in 1-2 minutes. Coiled will scan our local environment and replicate the same dependencies on this machine. We don’t have to specify an explicit Python environment. Inputs of your function are serialized and sent to the VM as well. Coiled will return our results back to our local machine.

Coiled would normally shut down the VM immediately after the Python interpreter finishes. This is mostly to reduce costs. We specified keepalive="5 minutes" to keep the VM alive for a few of minutes after our Python interpreter finished. This ensures that new local runs can connect to the same VM avoiding the boot time of up to 2 minutes; we call this a warm start.

Conclusion

You can use Coiled serverless functions to run queries on a machine with as much memory as you want. This grants you access to computational resources that can be very close to your data. Doing data processing in the cloud becomes very easy with this functionality. Check out the docs for more details.

Want to run this example yourself? Get started with Coiled for free at coiled.io/start. This example run comfortably within the free tier.

Additional use cases: