How to Run Jupyter Notebooks on a GPU on the Cloud

You can often significantly accelerate the time it takes to train machine learning models by using GPUs instead of CPUs. In this step-by-step tutorial, we’ll use PyTorch to train a neural network on a GPU on the cloud using Coiled notebooks.

Start your Jupyter Notebook on a GPU

You’ll first have to install Coiled locally. You can use pip or conda to install coiled alongside additional notebook dependencies like jupyterlab.

$ conda install -c conda-forge coiled jupyterlab jupyter-server-proxy
$ pip install "coiled[notebook]"

Then, run the following command to start a JupyterLab instance on a GPU-enabled VM on the cloud:

coiled notebook start \
    --vm-type g5.xlarge \
    --container coiled/gpu-examples:latest \
    --region us-west-2 
Screencast of using `coiled notebook start` to show how to start a jupyter notebook on a GPU.

We used a few different arguments:

  • --vm-type g5.xlarge to request a g5.xlarge AWS EC2 instance, which has 1 GPU with 24 GiB of memory.

  • --container coiled/gpu-examples:latest to use this publicly available Docker image with the necessary packages installed like CUDA toolkit, PyTorch, and Optuna (see the Dockerfile for details). Alternatively, you can use pip or conda to create a Python environment with your necessary dependencies.

  • --region us-west2 to start the VM in the US West (Oregon) AWS region. We find GPUs are usually easier to get there.

See our documentation for more details.

Define the PyTorch neural network

Now that we have a notebook running, we can define the model. We modified this example from the Optuna examples GitHub repo.

In this example, we optimize the validation accuracy of fashion product recognition using PyTorch and the FashionMNIST dataset. We optimize the neural network architecture as well as the optimizer configuration. For demonstration purposes, we use a subset of the FashionMNIST dataset.

import os
import optuna
from optuna.trial import TrialState
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data
from torchvision import datasets, transforms


BATCHSIZE = 128
CLASSES = 10
EPOCHS = 10
N_TRAIN_EXAMPLES = BATCHSIZE * 30
N_VALID_EXAMPLES = BATCHSIZE * 10


def define_model(trial):
    # We optimize the number of layers, hidden units and dropout ratio in each layer.
    n_layers = trial.suggest_int("n_layers", 1, 3)
    layers = []

    in_features = 28 * 28
    for i in range(n_layers):
        out_features = trial.suggest_int("n_units_l{}".format(i), 4, 128)
        layers.append(nn.Linear(in_features, out_features))
        layers.append(nn.ReLU())
        p = trial.suggest_float("dropout_l{}".format(i), 0.2, 0.5)
        layers.append(nn.Dropout(p))

        in_features = out_features
    layers.append(nn.Linear(in_features, CLASSES))
    layers.append(nn.LogSoftmax(dim=1))

    return nn.Sequential(*layers)


def get_mnist():
    # Load FashionMNIST dataset.
    train_loader = torch.utils.data.DataLoader(
        datasets.FashionMNIST(
            os.getcwd(), train=True, download=True,
            transform=transforms.ToTensor()),
        batch_size=BATCHSIZE,
        shuffle=True,
    )
    valid_loader = torch.utils.data.DataLoader(
        datasets.FashionMNIST(os.getcwd(), train=False, transform=transforms.ToTensor()),
        batch_size=BATCHSIZE,
        shuffle=True,
    )

    return train_loader, valid_loader


def objective(trial):
    # requires a GPU to run
    DEVICE = torch.device("cuda")

    # Generate the model.
    model = define_model(trial).to(DEVICE)

    # Generate the optimizers.
    optimizer_name = trial.suggest_categorical("optimizer", ["Adam", "RMSprop", "SGD"])
    lr = trial.suggest_float("lr", 1e-5, 1e-1, log=True)
    optimizer = getattr(optim, optimizer_name)(model.parameters(), lr=lr)

    # Get the FashionMNIST dataset.
    train_loader, valid_loader = get_mnist()

    # Training of the model.
    for epoch in range(EPOCHS):
        model.train()
        for batch_idx, (data, target) in enumerate(train_loader):
            # Limiting training data for faster epochs.
            if batch_idx * BATCHSIZE >= N_TRAIN_EXAMPLES:
                break

            data, target = data.view(data.size(0), -1).to(DEVICE), target.to(DEVICE)

            optimizer.zero_grad()
            output = model(data)
            loss = F.nll_loss(output, target)
            loss.backward()
            optimizer.step()

        # Validation of the model.
        model.eval()
        correct = 0
        with torch.no_grad():
            for batch_idx, (data, target) in enumerate(valid_loader):
                # Limiting validation data.
                if batch_idx * BATCHSIZE >= N_VALID_EXAMPLES:
                    break
                data, target = data.view(data.size(0), -1).to(DEVICE), target.to(DEVICE)
                output = model(data)
                # Get the index of the max log-probability.
                pred = output.argmax(dim=1, keepdim=True)
                correct += pred.eq(target.view_as(pred)).sum().item()

        accuracy = correct / min(len(valid_loader.dataset), N_VALID_EXAMPLES)

        trial.report(accuracy, epoch)

        # Handle pruning based on the intermediate value.
        if trial.should_prune():
            raise optuna.exceptions.TrialPruned()

    return accuracy

Optimize with Optuna

We’ll train the model and use Optuna to find the parameters that result in the best model predictions. We train the model five times with n_trials=5, using different sets of parameters.

import optuna

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=5, timeout=600, show_progress_bar=True)

The runtime is about 25 seconds. We can scale this up and run 100 models, which takes 4 minutes 20 seconds.

study.optimize(objective, n_trials=100, timeout=600, show_progress_bar=True)

Now we can analyze the results to find the best set of parameters.

pruned_trials = study.get_trials(deepcopy=False, states=[TrialState.PRUNED])
complete_trials = study.get_trials(deepcopy=False, states=[TrialState.COMPLETE])

print("Study statistics: ")
print("  Number of finished trials: ", len(study.trials))
print("  Number of pruned trials: ", len(pruned_trials))
print("  Number of complete trials: ", len(complete_trials))

print("Best trial:")
trial = study.best_trial

print("  Value: ", trial.value)

print("  Params: ")
for key, value in trial.params.items():
    print("    {}: {}".format(key, value))

Which returns the following output:

Study statistics: 
  Number of finished trials:  100
  Number of pruned trials:  61
  Number of complete trials:  39
Best trial:
  Value:  0.84609375
  Params: 
    n_layers: 1
    n_units_l0: 109
    dropout_l0: 0.3822970315388142
    optimizer: Adam
    lr: 0.007778083042789732

Looks like the best objective value for training our model 100 times is 0.846.

You can monitor usage of GPU resources during the computation in the Coiled UI, which displays metrics like GPU memory and utilization.

Next steps

In this example, we used Coiled notebooks to run a simple PyTorch model in a Jupyter notebook on a GPU in the cloud. It cost ~$0.10 and took ~4 minutes to train the model 100 times. Though this example uses PyTorch, you could just as easily use another deep learning library like Keras or Tensorflow instead.

If you’d like to run this example yourself, you can get started with Coiled at coiled.io/start. This notebook is available in the coiled/examples repo and runs well within the Coiled free tier (though you’ll still need to pay your cloud provider).