Posts by Sarah Johnson

DataFrames at Scale Comparison: TPC-H

Hendrik Makait, Sarah Johnson, Matthew Rocklin

Read more ...


Dask vs. Spark

Sarah Johnson, Florian Jetter

Bar chart comparing the relative difference in TPC-H query runtime for Dask vs. PySpark when executed on a M1 MacBook Pro with 8 cores. Orange represents queries where Dask is faster and blue where PySpark is faster.

Read more ...


One Trillion Row Challenge

Sarah Johnson

Read more ...


One Billion Row Challenge (1BRC) in Python with Dask

Sarah Johnson

Read more ...


How to Run Jupyter Notebooks on a GPU on the Cloud

Sarah Johnson

Read more ...


Processing a 250 TB dataset with Coiled, Dask, and Xarray

We processed 250TB of geospatial cloud data in twenty minutes on the cloud with Xarray, Dask, and Coiled. We do this to demonstrate scale and to think about costs.

County-level heat map of the continental US showing mean depth to soil saturation (in meters) in 2020.

Read more ...


How well does Dask run on Graviton?

Sarah Johnson, Nat Tabris

bar chart of AWS cost vs. processor type

Read more ...


Just in time Python environments

Docker is a great tool for creating portable software environments, but we found it’s too slow for interactive exploration. We find that clusters depending on docker images often take 5+ minutes to launch. Ouch.

../../_images/senvs2_build_push_pull.svg

Read more ...