scikit-learn with Coiled Functions

scikit-learn is a library for predictive data analysis. In this guide, you’ll learn how to leverage the built-in parallelism through training the model on a VM with many cores in the cloud with Coiled Serverless Functions.

Before you start

You’ll first need to install the necessary packages. For the purposes of this example, we’ll do this in a new virtual environment, but you could also install them in whatever environment you’re already using for your project.

conda create -n coiled-sklearn-example -c conda-forge python=3.10 scikit-learn
conda activate coiled-sklearn-example

You also could use pip for everything, or any other package manager you prefer; conda isn’t required.

When you run your code with Coiled Functions, Coiled will automatically replicate your local coiled-sklearn-example environment to your cluster.

About the Data

This example will use randomized data to enable users to easily reproduce our results.

About the query

We will train a RandomForest classifier and return the trained model back to our machine.

Dispatch the computation to a VM in the cloud

We have to decorate our function that will do the work with a Coiled specific decorator that will offload the computation.

import coiled

@coiled.function(
    vm_type="c6i.32xlarge", # 256 GB of RAM
    keepalive="5 minutes",  # keep alive to run multiple queries if necessary
)

This will offload our workload to an EC2 instance with 128 cores. scikit-learn allows you to easily train your model on multiple cores, so we can benefit from a bigger machine.

Train the model

We will offload the training step to a cloud-hosted VM.

import coiled
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

@coiled.function(
    vm_type="c6i.32xlarge", # 128 cores, compute optimized
    keepalive="5 minutes",  # keep alive to train more models if necessary
)
def train():
    X, y = make_classification(n_samples=2_000_000, n_features=30, random_state=0, shuffle=False)
    clf = RandomForestClassifier(random_state=0, n_jobs=-1)
    clf.fit(X, y)
    return clf

train()

This is where the actual decorator comes in. Coiled Functions will offload the computation to AWS and get back the models. We can use the model locally for inference after offloading the expensive training step. The provisioned instance will stay alive for 5 minutes so that we can reconnect if we have to run another query.