Visualize 1,000,000,000 Points#

In this notebook we process and interactively visualize over one billion points. You can download this jupyter notebook to follow along.

Before you start#

You’ll first need to install the necessary packages. For the purposes of this example, we’ll do this in a new virtual environment, but you could also install them in whatever environment you’re already using for your project.

conda create -n coiled-datashader -c conda-forge python=3.10 coiled dask s3fs pyarrow datashader hvplot jupyter_bokeh
conda activate coiled-datashader

You also could use pip for everything, or any other package manager you prefer; conda isn’t required.

When you later create a Coiled cluster, your local coiled-datashader environment will be automatically replicated on your cluster.

Create Cluster#

import coiled

cluster = coiled.Cluster(
    n_workers=20,
    region="us-east-2",  # Start workers in same region as data to minimize costs
) 

client = cluster.get_client()


Load and prep data#

To start, let’s use Dask DataFrame’s read_parquet functionality to load our dataset from S3.

import dask.dataframe as dd

df = dd.read_parquet("s3://coiled-datasets/dask-book/nyc-tlc/2009-2013/")
df.head()
vendor_id pickup_datetime dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude payment_type fare_amount surcharge tip_amount tolls_amount total_amount
0 VTS 2009-01-04 02:52:00 2009-01-04 03:02:00 1 2.63 -73.991957 40.721567 -73.993803 40.695922 CASH 8.9 0.5 0.00 0.0 9.40
1 VTS 2009-01-04 03:31:00 2009-01-04 03:38:00 3 4.55 -73.982102 40.736290 -73.955850 40.768030 Credit 12.1 0.5 2.00 0.0 14.60
2 VTS 2009-01-03 15:43:00 2009-01-03 15:57:00 5 10.35 -74.002587 40.739748 -73.869983 40.770225 Credit 23.7 0.0 4.74 0.0 28.44
3 DDS 2009-01-01 20:52:58 2009-01-01 21:14:00 1 5.00 -73.974267 40.790955 -73.996558 40.731849 CREDIT 14.9 0.5 3.05 0.0 18.45
4 DDS 2009-01-24 16:18:23 2009-01-24 16:24:56 1 0.40 -74.001580 40.719382 -74.008378 40.720350 CASH 3.7 0.0 0.00 0.0 3.70

Next, we’ll select ride data that’s near NYC.

df = df.loc[
    (df.dropoff_longitude > -74.1) & (df.dropoff_longitude < -73.7) & 
    (df.dropoff_latitude > 40.6) & (df.dropoff_latitude < 40.9) &
    (df.pickup_longitude > -74.1) & (df.pickup_longitude < -73.7) &
    (df.pickup_latitude > 40.6) & (df.pickup_latitude < 40.9)
]

We now have all the information we need, but still need to format our data so it can be easily visualized. We do this by splitting each ride into two rows, one with the pickup location and one with the dropoff location.

# Dropoff locations
df_drop = df[["dropoff_longitude", "dropoff_latitude"]]
df_drop["journey_type"] = "dropoff"
df_drop = df_drop.rename(columns={'dropoff_longitude': 'lon', 'dropoff_latitude': 'lat'})

# Pickup locations
df_pick = df[["pickup_longitude", "pickup_latitude"]]
df_pick["journey_type"] = "pickup"
df_pick = df_pick.rename(columns={'pickup_longitude': 'lon', 'pickup_latitude': 'lat'})

# Combine into single DataFrame
df = dd.concat([df_drop, df_pick])

# Convert `journey_type` column to categorical (needed for plotting below)
df = df.astype({"journey_type": "category"})
df["journey_type"] = df["journey_type"].cat.set_categories(["dropoff", "pickup"])

Finally, we repartition for efficient follow-up computations and load the dataset onto our cluster with persist().

%%time

df = df.repartition(partition_size="256 MiB").persist()
print(f"Number of records: {len(df)}")
Number of records: 1693136554
CPU times: user 3.63 s, sys: 580 ms, total: 4.21 s
Wall time: 1min 26s

We’re now ready to visualize this data.

Visualize#

import datashader
import holoviews as hv
import hvplot.dask
hv.extension('bokeh')

df.hvplot.scatter(
    x="lon", 
    y="lat", 
    aggregator=datashader.by("journey_type"), 
    datashade=True, 
    cnorm="eq_hist",
    frame_width=700, 
    aspect=1.33, 
    color_key={"pickup": "#EF1561", "dropoff": "#1F5AFF"},
)

Conclusion#

Dask DataFrame and Datashader together make it easy to visualize very large scale data. Coiled makes it easy to scale up computing resources in the cloud.

If you have a cloud account it’s easy to try this out yourself. The data is public and this computation costs less than $0.20 in AWS cloud costs. You can sign up under the Coiled free tier here (setup takes a couple minutes).