Visualize 1,000,000,000 Points#
In this notebook we process and interactively visualize over one billion points. You can download this jupyter notebook
to follow along.
Before you start#
You’ll first need to install the necessary packages. For the purposes of this example, we’ll do this in a new virtual environment, but you could also install them in whatever environment you’re already using for your project.
conda create -n coiled-datashader -c conda-forge python=3.10 coiled dask s3fs pyarrow datashader hvplot jupyter_bokeh
conda activate coiled-datashader
You also could use pip
for everything, or any other package manager you prefer; conda
isn’t required.
When you later create a Coiled cluster, your local coiled-datashader
environment will be automatically replicated on your cluster.
Create Cluster#
import coiled
cluster = coiled.Cluster(
n_workers=20,
region="us-east-2", # Start workers in same region as data to minimize costs
)
client = cluster.get_client()
Load and prep data#
To start, let’s use Dask DataFrame’s read_parquet
functionality to load our dataset from S3.
import dask.dataframe as dd
df = dd.read_parquet("s3://coiled-datasets/dask-book/nyc-tlc/2009-2013/")
df.head()
vendor_id | pickup_datetime | dropoff_datetime | passenger_count | trip_distance | pickup_longitude | pickup_latitude | dropoff_longitude | dropoff_latitude | payment_type | fare_amount | surcharge | tip_amount | tolls_amount | total_amount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | VTS | 2009-01-04 02:52:00 | 2009-01-04 03:02:00 | 1 | 2.63 | -73.991957 | 40.721567 | -73.993803 | 40.695922 | CASH | 8.9 | 0.5 | 0.00 | 0.0 | 9.40 |
1 | VTS | 2009-01-04 03:31:00 | 2009-01-04 03:38:00 | 3 | 4.55 | -73.982102 | 40.736290 | -73.955850 | 40.768030 | Credit | 12.1 | 0.5 | 2.00 | 0.0 | 14.60 |
2 | VTS | 2009-01-03 15:43:00 | 2009-01-03 15:57:00 | 5 | 10.35 | -74.002587 | 40.739748 | -73.869983 | 40.770225 | Credit | 23.7 | 0.0 | 4.74 | 0.0 | 28.44 |
3 | DDS | 2009-01-01 20:52:58 | 2009-01-01 21:14:00 | 1 | 5.00 | -73.974267 | 40.790955 | -73.996558 | 40.731849 | CREDIT | 14.9 | 0.5 | 3.05 | 0.0 | 18.45 |
4 | DDS | 2009-01-24 16:18:23 | 2009-01-24 16:24:56 | 1 | 0.40 | -74.001580 | 40.719382 | -74.008378 | 40.720350 | CASH | 3.7 | 0.0 | 0.00 | 0.0 | 3.70 |
Next, we’ll select ride data that’s near NYC.
df = df.loc[
(df.dropoff_longitude > -74.1) & (df.dropoff_longitude < -73.7) &
(df.dropoff_latitude > 40.6) & (df.dropoff_latitude < 40.9) &
(df.pickup_longitude > -74.1) & (df.pickup_longitude < -73.7) &
(df.pickup_latitude > 40.6) & (df.pickup_latitude < 40.9)
]
We now have all the information we need, but still need to format our data so it can be easily visualized. We do this by splitting each ride into two rows, one with the pickup location and one with the dropoff location.
# Dropoff locations
df_drop = df[["dropoff_longitude", "dropoff_latitude"]]
df_drop["journey_type"] = "dropoff"
df_drop = df_drop.rename(columns={'dropoff_longitude': 'lon', 'dropoff_latitude': 'lat'})
# Pickup locations
df_pick = df[["pickup_longitude", "pickup_latitude"]]
df_pick["journey_type"] = "pickup"
df_pick = df_pick.rename(columns={'pickup_longitude': 'lon', 'pickup_latitude': 'lat'})
# Combine into single DataFrame
df = dd.concat([df_drop, df_pick])
# Convert `journey_type` column to categorical (needed for plotting below)
df = df.astype({"journey_type": "category"})
df["journey_type"] = df["journey_type"].cat.set_categories(["dropoff", "pickup"])
Finally, we repartition for efficient follow-up computations and load the dataset onto our cluster with persist()
.
%%time
df = df.repartition(partition_size="256 MiB").persist()
print(f"Number of records: {len(df)}")
Number of records: 1693136554
CPU times: user 3.63 s, sys: 580 ms, total: 4.21 s
Wall time: 1min 26s
We’re now ready to visualize this data.
Visualize#
import datashader
import holoviews as hv
import hvplot.dask
hv.extension('bokeh')
df.hvplot.scatter(
x="lon",
y="lat",
aggregator=datashader.by("journey_type"),
datashade=True,
cnorm="eq_hist",
frame_width=700,
aspect=1.33,
color_key={"pickup": "#EF1561", "dropoff": "#1F5AFF"},
)
Conclusion#
Dask DataFrame and Datashader together make it easy to visualize very large scale data. Coiled makes it easy to scale up computing resources in the cloud.
If you have a cloud account it’s easy to try this out yourself. The data is public and this computation costs less than $0.20 in AWS cloud costs. You can sign up under the Coiled free tier here (setup takes a couple minutes).