Visualize 1,000,000,000 Points#

In this notebook we process roughly one billion points and set them up for interactive visualization. You can download this jupyter notebook to follow along.

Before you start#

You’ll first need to install the necessary packages. For the purposes of this example, we’ll do this in a new virtual environment, but you could also install them in whatever environment you’re already using for your project.

conda create -n coiled-datashader -c conda-forge python=3.10 coiled dask s3fs pyarrow datashader hvplot jupyter_bokeh
conda activate coiled-datashader

You also could use pip for everything, or any other package manager you prefer; conda isn’t required.

When you later create a Coiled cluster, your local coiled-datashader environment will be automatically replicated on your cluster.

import dask.dataframe as dd
import datashader
import hvplot.dask
import coiled
from dask.distributed import Client, wait

Create Cluster#

cluster = coiled.Cluster(
    n_workers=20,
    name="datashader",
    region="us-east-2",  # start workers close to data to minimize costs
) 

client = cluster.get_client()

Load data#

%%time

df = dd.read_parquet(
    "s3://coiled-datasets/dask-book/nyc-tlc/2009-2013/",
    columns=["dropoff_longitude", "dropoff_latitude", "pickup_longitude", "pickup_latitude"]
)

# clean data to limit to lat-longs near nyc
df = df.loc[
    (df.dropoff_longitude > -74.1) & (df.dropoff_longitude < -73.7) & 
    (df.dropoff_latitude > 40.6) & (df.dropoff_latitude < 40.9) &
    (df.pickup_longitude > -74.1) & (df.pickup_longitude < -73.7) &
    (df.pickup_latitude > 40.6) & (df.pickup_latitude < 40.9)
]

# now we have to get a DataFrame with just dropoff locations
df_drop = df[["dropoff_longitude", "dropoff_latitude"]]
df_drop["journey_type"] = "dropoff"
df_drop = df_drop.rename(columns={'dropoff_longitude': 'long', 'dropoff_latitude': 'lat'})


# now do the same for pickups
df_pick = df[["pickup_longitude", "pickup_latitude"]]
df_pick["journey_type"] = "pickup"
df_pick = df_pick.rename(columns={'pickup_longitude': 'long', 'pickup_latitude': 'lat'})

# concatenate two dask dataframes
df_plot = dd.concat([df_drop, df_pick])

df_plot = df_plot.astype({"journey_type": "category"})
df_plot["journey_type"] = df_plot["journey_type"].cat.set_categories(["dropoff", "pickup"])

#partitions are small - better to repartition
df_plot = df_plot.persist()
df_plot = df_plot.repartition(partition_size="256MiB").persist()

print("Number of records:", len(df_plot))
Number of records: 1693136554
CPU times: user 5.65 s, sys: 1.27 s, total: 6.92 s
Wall time: 1min 3s

Visualize#

import holoviews as hv
hv.extension('bokeh')

color_key = {"pickup": "#EF1561", "dropoff": "#1F5AFF"}

df_plot.hvplot.scatter(
    x="long", 
    y="lat", 
    aggregator=datashader.by("journey_type"), 
    datashade=True, 
    cnorm="eq_hist",
    frame_width=700, 
    aspect=1.33, 
    color_key=color_key
)