Deploy web applications with Streamlit#

Streamlit is an open-source Python library that makes it easy to create and share custom web apps for machine learning and data science. Coiled helps scale Python workloads by provisioning cloud-hosted Dask clusters on demand.

Coiled and Streamlit work great together - Streamlit handles the frontend layout and interactivity of your web application while Coiled handles the backend infrastructure for demanding computations. Since Coiled works anywhere that you can run Python, you can use Coiled while developing Streamlit apps on your laptop or while interacting with a hosted Streamlit application - all without having to download your data or change the way you work.

../../_images/coiled-streamlit-example.png

Before you start#

You’ll first need to create consistent local and remote software environments with dask, coiled, and the necessary dependencies installed. If you are unfamiliar with creating software environments, you can first follow the tutorial on setting up a custom software environment.

First, you will install folium, streamlit, and coiled-runtime, a Dask meta-package. Save the following file as environment.yml, replacing <x.x.x> with the versions you would like to use. You can get most up-to-date version of coiled-runtime from the latest tag in the public coiled-runtime repository.

channels:
  - conda-forge
dependencies:
  - folium=<x.x.x>
  - streamlit=<x.x.x>
  - coiled-runtime=<x.x.x>
  - python=3.9

Next, create a local software environment using the environment.yml file:

$ conda env create -f environment.yml -n streamlit-example
$ conda activate streamlit-example

Lastly, create a remote software environment using the environment.yml file:

$ coiled env create -n streamlit-example --conda environment.yml

Coiled + Streamlit#

The example below uses Coiled and Streamlit to read more than 146 million records from the NYC Taxi dataset and visualize locations of taxi pickups and dropoffs. In this guide, you’ll learn how to use Coiled to:

  1. Read in Parquet files from an Amazon S3 bucket.

  2. Filter the dataset based on user inputs.

  3. Display the results on a folium map

You’ll start a Streamlit app locally, but the computations to load, filter, and generate the folium map will happen on your cloud using Coiled.

import coiled
import dask
import dask.dataframe as dd
import folium
import streamlit as st
from dask.distributed import Client
from folium.plugins import HeatMap
from streamlit_folium import folium_static

# Text in Streamlit
st.header("Coiled and Streamlit")
st.subheader("Analyzing Large Datasets with Coiled and Streamlit")
st.write(
    """
    The computations for this Streamlit app are powered by Coiled, which
    provides on-demand, hosted Dask clusters in the cloud. Change the options
    below to view different visualizations of transportation pickups/dropoffs,
    then let Coiled handle all of the infrastructure and compute.
    """
)

# Interactive widgets in Streamlit
taxi_mode = st.selectbox("Taxi pickup or dropoff?", ("Pickups", "Dropoffs"))
num_passengers = st.slider("Number of passengers", 0, 9, (0, 9))

# Start and connect to Coiled cluster
cluster_state = st.empty()


@st.cache(allow_output_mutation=True)
def get_client():
    cluster_state.write("Starting or connecting to Coiled cluster...")
    cluster = coiled.Cluster(
        n_workers=5, name="coiled-streamlit", software="streamlit-example"
    )
    client = Client(cluster)
    return client


client = get_client()
if client.status == "closed":
    # In a long-running Streamlit app,
    # the cluster could have shut down from idleness.
    # If so, clear the Streamlit cache to restart it.
    st.caching.clear_cache()
    client = get_client()
cluster_state.write(f"Coiled cluster is up! ({client.dashboard_link})")


# Load data (runs on Coiled)
@st.cache(hash_funcs={dd.DataFrame: dask.base.tokenize})
def load_data():
    return (
        dd.read_parquet(
            "s3://coiled-datasets/nyc-tlc/2019",
            usecols=[
                "passenger_count",
                "pickup_longitude",
                "pickup_latitude",
                "dropoff_longitude",
                "dropoff_latitude",
            ],
            storage_options={"anon": True},
            blocksize="16 MiB",
        )
        .dropna()
        .persist()
    )


df = load_data()

# Filter data based on inputs (runs on Coiled)
with st.spinner("Calculating map data..."):
    map_data = df[
        (df["passenger_count"] >= num_passengers[0])
        & (df["passenger_count"] <= num_passengers[1])
    ]

    if taxi_mode == "Pickups":
        map_data = map_data.iloc[:, [2, 1]]
    elif taxi_mode == "Dropoffs":
        map_data = map_data.iloc[:, [4, 3]]

    map_data.columns = ["lat", "lon"]
    map_data = map_data.loc[~(map_data == 0).all(axis=1)]
    map_data = map_data.head(500)

# Display map in Streamlit
st.subheader("Map of selected rides")
m = folium.Map([40.76, -73.95], tiles="cartodbpositron", zoom_start=12)
HeatMap(map_data).add_to(folium.FeatureGroup(name="Heat Map").add_to(m))
folium_static(m)

Click here to download the above example script.

How Coiled helps#

Coiled comes into play in the following sections, allowing you to easily scale the resources available to the Streamlit app on the backend.

First, create your Coiled cluster:

@st.cache(allow_output_mutation=True)
def get_client():
    cluster_state.write("Starting or connecting to Coiled cluster...")
    cluster = coiled.Cluster(
        n_workers=5, name="coiled-streamlit", software="streamlit-example"
    )
    client = Client(cluster)
    return client


client = get_client()
if client.status == "closed":
    # In a long-running Streamlit app,
    # the cluster could have shut down from idleness.
    # If so, clear the Streamlit cache to restart it.
    st.caching.clear_cache()
    client = get_client()

By using @st.cache(allow_output_mutation=True) when creating your Coiled cluster, the Streamlit app will reuse the same connection to the cluster instead of reconnecting every time the app’s state changes (see the caching section in the Streamlit documentation). Additionally, by passing a name to our Coiled cluster in cluster = coiled.Cluster(name="coiled-streamlit"), you can reconnect to an existing cluster as viewers of your app come and go (see Reusing clusters).

Note

Coiled will shut down your cluster after 20 minutes of inactivity by default to save on compute costs when your Streamlit app is not in use. However, if you use @st.cache on the Coiled cluster and expect your Streamlit app to run for long time, you should add a if client.status == "closed" check as shown in the code example above, which will recreate the cluster if it has shut down. If you don’t use @st.cache, then this check is not necessary since a new Coiled cluster will be created automatically when another user visits your Streamlit app.

Next, Load the dataset using Dask on your Coiled cluster:



# Load data (runs on Coiled)
@st.cache(hash_funcs={dd.DataFrame: dask.base.tokenize})
def load_data():
    return (
        dd.read_parquet(
            "s3://coiled-datasets/nyc-tlc/2019",
            usecols=[
                "passenger_count",
                "pickup_longitude",
                "pickup_latitude",
                "dropoff_longitude",
                "dropoff_latitude",
            ],
            storage_options={"anon": True},
            blocksize="16 MiB",
        )
        .dropna()
        .persist()
    )


Here, you used .persist() to store the dataset in memory on the Coiled cluster. This helps to optimize performance of the Streamlit app, avoiding expensive computations from running each time the app is updated. You also used @st.cache to optimize performance when calling functions that preload or precompute data with Dask or other computations that only need to run once. Note that @st.cache does not know the best way to tell if two Dask collections are identical. Therefore, to cache functions that return Dask collections, you should use @st.cache(hash_funcs={dd.DataFrame: dask.base.tokenize}) and replace dd.DataFrame with the appropriate datatype that the function returns.

And filter the dataset using Dask on your Coiled cluster:

df = load_data()

# Filter data based on inputs (runs on Coiled)
with st.spinner("Calculating map data..."):
    map_data = df[
        (df["passenger_count"] >= num_passengers[0])
        & (df["passenger_count"] <= num_passengers[1])
    ]

    if taxi_mode == "Pickups":
        map_data = map_data.iloc[:, [2, 1]]
    elif taxi_mode == "Dropoffs":
        map_data = map_data.iloc[:, [4, 3]]

    map_data.columns = ["lat", "lon"]
    map_data = map_data.loc[~(map_data == 0).all(axis=1)]

By using st.spinner("Calculating map data..."), you provided a spinner to display a helpful message while the data is loading.

Next steps#

To learn more about how to use Streamlit with Coiled, check out this Coiled blogpost.