Process Hundreds of GB of Data in the Cloud with Polars

Code snippet of using the coiled.function decorator to run a query with Polars on a large VM in the cloud

Local machines can struggle to process large datasets due to memory and network limitations. Coiled Functions provide a cloud-based solution that allows for efficient and cost-effective handling of such extensive datasets, overcoming the constraints of local hardware for complex data processing tasks. Incorporating libraries like Polars can further enhance this approach, leveraging optimized computation capabilities to process data more quickly and efficiently.

In this post we’ll use Coiled Functions to process the 150 GB Uber-Lyft dataset on a single cloud machine with Polars.

Query S3 data locally with Polars

The process starts with creating queries as though we would for local execution against the dataset.

def load_data():
    # Define the S3 path and storage options and load the data into memory
    s3_path = 's3://coiled-datasets/uber-lyft-tlc/*'
    storage_options = {"aws_region": "us-east-2"}
    return pl.scan_parquet(s3_path, storage_options=storage_options)

def compute_percentage_of_tipped_rides(lazy_df):
    # Compute the percentage of rides with a tip for each license number
    result = (lazy_df
              .with_columns((pl.col("tips") > 0).alias("tipped"))
              .group_by("hvfhs_license_num")
              .agg([
                  (pl.sum("tipped") / pl.count("tipped")).alias("percentage_tipped")
              ]))
    return result.collect()

def query_results():
    # Load and process the dataset to compute the results
    lazy_df = load_data()
    return compute_percentage_of_tipped_rides(lazy_df)

This query is pretty basic; it is meant as a simple example of how we can process these large files. If we execute this query locally, we would pull all of the data onto the machine, process it, and return a small fraction of the data. Unfortunately for me, at least, this would hit the physical memory limits of my machine. It would also take a long time, as I would have to download the whole dataset and be limited by my network bandwidth. In other situations this would also be an expensive operation as I would have to pay egress costs on the data. Let’s see how Coiled can help solve these issues.

Query S3 data in the cloud with Coiled serverless functions

Coiled Functions allow us to use cloud machines that are large enough to process this dataset. Coiled connects to AWS or GCP, so we can access any virtual machine (VM) that is available there. We can also co-locate the VM with the data, speeding up processing and avoiding egress costs.

All we need to do is add a @coiled.function decorator to our main function. The decorator tells Coiled that it should spin up a large VM on AWS, run the query there, and then return the result locally.

@coiled.function(
    vm_type="m6i.4xlarge",  # VM with 64GB RAM
    region="us-east-2",     # AWS region to match the data location
    keepalive="5 minutes",  # VM keepalive time for potential subsequent queries
)
def query_results():
    # Load and process the dataset to compute the desired metric
    lazy_df = load_data()
    return compute_percentage_of_tipped_rides(lazy_df)

coiled.function() takes a variety of arguments. Let’s take a look at a few of the important ones:

  • vm_type: This specifies the type of VM that we want the computation to run on.

  • region: This specifies the AWS or GCP region that we want to start the VM in.

  • keepalive: This keeps the VM alive so that we can run multiple queries against the data in memory.

Let’s execute our queries and pull the results back to our local machine:

result = query_results()
print(result)
[
  ('HV0002', 0.08440046492889111),
  ('HV0003', 0.1498555901186066),
  ('HV0004', 0.09294857737045926),
  ('HV0005', 0.1912300216459857),
]

That’s all you need to do to run this function in the cloud. There is no need to adjust the other functions or manually spin up a VM.

What just happened

coiled.function() first scanned my local environment for the installed packages and replicated those on the VM that I requested. We didn’t need to specify a Docker container or a specific Python environment (though we could). Coiled then started a VM in AWS with the specified EC2 instance. The VM took about two minutes to start up, and then downloading the dataset and running the computations in the cloud took another five minutes. Not too bad for processing 150 GB! Finally, Coiled returned the results back to our local machine.

If we look at the Coiled Dashboard, we can see that all the data was pulled into the VM in the cloud, processed, and returned.

Plot from the Coiled dashboard showing the memory increase on our VM

VM Memory Utilization Increasing as the Computation Runs

Coiled would normally shut down the VM immediately after the Python interpreter shuts down in order to reduce costs. We specified keepalive="5 minutes" to keep the VM running in the cloud so that we could use the same VM to do other queries. This avoids the normal two minute boot time; we call this “warm start.”

Conclusion

coiled functions and Polars provide a solution for processing large datasets in the cloud. This approach allows data practitioners to overcome local machine limitations and reduce operational costs. This example demonstrates how simple it is to execute large-scale queries using Coiled. By deploying a @coiled.function decorator, we can swiftly transition from local to cloud processing, gaining the benefits of powerful cloud computing resources. The outcome is a faster, more scalable data processing workflow that enhances productivity and focuses on data analysis rather than infrastructure challenges.

Want to run this example yourself?

  • Get started with Coiled for free at coiled.io/start. This example runs comfortably within the free tier.

  • Copy and paste the first code snippet and add the coiled.function() decorator.

You can check out the docs or take a look at how to utilize Coiled Functions to run the same query using DuckDB.