DataFrames and ETL#
After installing the following packages:
conda create -n coiled-dataframe -c conda-forge python=3.10 coiled dask s3fs
conda activate coiled-dataframe
You can run this example using the NYC Taxi High Volume for Hire Vehicle services dataset:
import coiled
import dask.dataframe as dd
cluster = coiled.Cluster(
n_workers=30,
region="us-east-2", # region close to our data to avoid extra costs
)
client = cluster.get_client()
df = dd.read_parquet(
"s3://coiled-datasets/uber-lyft-tlc/",
)
df["tipped"] = df.tips != 0
df.groupby(
"hvfhs_license_num" # different carriers (eg Uber, Lyft)
).tipped.mean().compute() # proportion of riders who tip
For more in-depth examples consider the following: