Visualize 1,000,000,000 Points#

In this notebook we process roughly one billion points and set them up for interactive visualization. You can download this jupyter notebook to follow along.

Before you start#

You’ll first need to install the necessary packages. For the purposes of this example, we’ll do this in a new virtual environment, but you could also install them in whatever environment you’re already using for your project.

conda create -n coiled-datashader -c conda-forge python=3.10 coiled dask s3fs pyarrow datashader hvplot jupyter_bokeh
conda activate coiled-datashader

You also could use pip for everything, or any other package manager you prefer; conda isn’t required.

When you later create a Coiled cluster, your local coiled-datashader environment will be automatically replicated on your cluster.

import dask.dataframe as dd
import datashader
import hvplot.dask
import coiled
from dask.distributed import Client, wait

Create Cluster#

cluster = coiled.Cluster(
    region="us-east-2",  # start workers close to data to minimize costs

client = cluster.get_client()

Load data#


df = dd.read_parquet(
    columns=["dropoff_longitude", "dropoff_latitude", "pickup_longitude", "pickup_latitude"]

# clean data to limit to lat-longs near nyc
df = df.loc[
    (df.dropoff_longitude > -74.1) & (df.dropoff_longitude < -73.7) & 
    (df.dropoff_latitude > 40.6) & (df.dropoff_latitude < 40.9) &
    (df.pickup_longitude > -74.1) & (df.pickup_longitude < -73.7) &
    (df.pickup_latitude > 40.6) & (df.pickup_latitude < 40.9)

# now we have to get a DataFrame with just dropoff locations
df_drop = df[["dropoff_longitude", "dropoff_latitude"]]
df_drop["journey_type"] = "dropoff"
df_drop = df_drop.rename(columns={'dropoff_longitude': 'long', 'dropoff_latitude': 'lat'})

# now do the same for pickups
df_pick = df[["pickup_longitude", "pickup_latitude"]]
df_pick["journey_type"] = "pickup"
df_pick = df_pick.rename(columns={'pickup_longitude': 'long', 'pickup_latitude': 'lat'})

# concatenate two dask dataframes
df_plot = dd.concat([df_drop, df_pick])

df_plot = df_plot.astype({"journey_type": "category"})
df_plot["journey_type"] = df_plot["journey_type"].cat.set_categories(["dropoff", "pickup"])

#partitions are small - better to repartition
df_plot = df_plot.persist()
df_plot = df_plot.repartition(partition_size="256MiB").persist()

print("Number of records:", len(df_plot))
Number of records: 1693136554
CPU times: user 5.65 s, sys: 1.27 s, total: 6.92 s
Wall time: 1min 3s


import holoviews as hv

color_key = {"pickup": "#EF1561", "dropoff": "#1F5AFF"}