DataFrames and ETL

Dask works with Pandas to scale out large parallel dataframes.

These make it easy to load large datasets from cloud object stores or databases, perform feature engineering, and machine learning.

Conceptual diagram of a single Dask DataFrame comprised of four smaller pandas DataFrames.

After installing the following packages:

conda create -n coiled-dataframe -c conda-forge python=3.10 coiled dask s3fs
conda activate coiled-dataframe

You can run this example using the NYC Taxi High Volume for Hire Vehicle services dataset:

import coiled
import dask.dataframe as dd

cluster = coiled.Cluster(
    n_workers=30,
    region="us-east-2",  # region close to our data to avoid extra costs
)
client = cluster.get_client()

df = dd.read_parquet(
    "s3://coiled-datasets/uber-lyft-tlc/",
)
df["tipped"] = df.tips != 0
df.groupby(
    "hvfhs_license_num" # different carriers (eg Uber, Lyft)
).tipped.mean().compute() # proportion of riders who tip

More Examples

For more in-depth examples consider the following: