Visualize 1,000,000,000 Points#
In this notebook we process roughly one billion points and set them up for interactive visualization. You can download this jupyter notebook
to follow along.
Before you start#
You’ll first need to install the necessary packages. For the purposes of this example, we’ll do this in a new virtual environment, but you could also install them in whatever environment you’re already using for your project.
conda create -n coiled-datashader -c conda-forge python=3.10 coiled dask s3fs pyarrow datashader hvplot jupyter_bokeh
conda activate coiled-datashader
You also could use pip
for everything, or any other package manager you prefer; conda
isn’t required.
When you later create a Coiled cluster, your local coiled-datashader
environment will be automatically replicated on your cluster.
import dask.dataframe as dd
import datashader
import hvplot.dask
import coiled
from dask.distributed import Client, wait
Create Cluster#
cluster = coiled.Cluster(
n_workers=20,
name="datashader",
region="us-east-2", # start workers close to data to minimize costs
)
client = cluster.get_client()
Load data#
%%time
df = dd.read_parquet(
"s3://coiled-datasets/dask-book/nyc-tlc/2009-2013/",
columns=["dropoff_longitude", "dropoff_latitude", "pickup_longitude", "pickup_latitude"]
)
# clean data to limit to lat-longs near nyc
df = df.loc[
(df.dropoff_longitude > -74.1) & (df.dropoff_longitude < -73.7) &
(df.dropoff_latitude > 40.6) & (df.dropoff_latitude < 40.9) &
(df.pickup_longitude > -74.1) & (df.pickup_longitude < -73.7) &
(df.pickup_latitude > 40.6) & (df.pickup_latitude < 40.9)
]
# now we have to get a DataFrame with just dropoff locations
df_drop = df[["dropoff_longitude", "dropoff_latitude"]]
df_drop["journey_type"] = "dropoff"
df_drop = df_drop.rename(columns={'dropoff_longitude': 'long', 'dropoff_latitude': 'lat'})
# now do the same for pickups
df_pick = df[["pickup_longitude", "pickup_latitude"]]
df_pick["journey_type"] = "pickup"
df_pick = df_pick.rename(columns={'pickup_longitude': 'long', 'pickup_latitude': 'lat'})
# concatenate two dask dataframes
df_plot = dd.concat([df_drop, df_pick])
df_plot = df_plot.astype({"journey_type": "category"})
df_plot["journey_type"] = df_plot["journey_type"].cat.set_categories(["dropoff", "pickup"])
#partitions are small - better to repartition
df_plot = df_plot.persist()
df_plot = df_plot.repartition(partition_size="256MiB").persist()
print("Number of records:", len(df_plot))
Number of records: 1693136554
CPU times: user 5.65 s, sys: 1.27 s, total: 6.92 s
Wall time: 1min 3s
Visualize#
import holoviews as hv
hv.extension('bokeh')
color_key = {"pickup": "#EF1561", "dropoff": "#1F5AFF"}
df_plot.hvplot.scatter(
x="long",
y="lat",
aggregator=datashader.by("journey_type"),
datashade=True,
cnorm="eq_hist",
frame_width=700,
aspect=1.33,
color_key=color_key
)