How to Run Jupyter Notebooks on a GPU on the Cloud#
You can often significantly accelerate the time it takes to train machine learning models by using GPUs instead of CPUs. In this step-by-step tutorial, we’ll use PyTorch to train a neural network on a GPU on the cloud using Coiled notebooks.
Start your Jupyter Notebook on a GPU#
You’ll first have to install Coiled locally. You can use pip or conda to install coiled alongside additional notebook dependencies like jupyterlab.
$ conda install -c conda-forge coiled jupyterlab jupyter-server-proxy
$ pip install "coiled[notebook]"
Then, run the following command to start a JupyterLab instance on a GPU-enabled VM on the cloud:
coiled notebook start \
--vm-type g5.xlarge \
--container coiled/gpu-examples:latest \
--region us-west-2
We used a few different arguments:
--vm-type g5.xlarge
to request a g5.xlarge AWS EC2 instance, which has 1 GPU with 24 GiB of memory.--container coiled/gpu-examples:latest
to use this publicly available Docker image with the necessary packages installed like CUDA toolkit, PyTorch, and Optuna (see the Dockerfile for details). Alternatively, you can use pip or conda to create a Python environment with your necessary dependencies.--region us-west2
to start the VM in the US West (Oregon) AWS region. We find GPUs are usually easier to get there.
See our documentation for more details.
Define the PyTorch neural network#
Now that we have a notebook running, we can define the model. We modified this example from the Optuna examples GitHub repo.
In this example, we optimize the validation accuracy of fashion product recognition using PyTorch and the FashionMNIST dataset. We optimize the neural network architecture as well as the optimizer configuration. For demonstration purposes, we use a subset of the FashionMNIST dataset.
import os
import optuna
from optuna.trial import TrialState
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data
from torchvision import datasets, transforms
BATCHSIZE = 128
CLASSES = 10
EPOCHS = 10
N_TRAIN_EXAMPLES = BATCHSIZE * 30
N_VALID_EXAMPLES = BATCHSIZE * 10
def define_model(trial):
# We optimize the number of layers, hidden units and dropout ratio in each layer.
n_layers = trial.suggest_int("n_layers", 1, 3)
layers = []
in_features = 28 * 28
for i in range(n_layers):
out_features = trial.suggest_int("n_units_l{}".format(i), 4, 128)
layers.append(nn.Linear(in_features, out_features))
layers.append(nn.ReLU())
p = trial.suggest_float("dropout_l{}".format(i), 0.2, 0.5)
layers.append(nn.Dropout(p))
in_features = out_features
layers.append(nn.Linear(in_features, CLASSES))
layers.append(nn.LogSoftmax(dim=1))
return nn.Sequential(*layers)
def get_mnist():
# Load FashionMNIST dataset.
train_loader = torch.utils.data.DataLoader(
datasets.FashionMNIST(
os.getcwd(), train=True, download=True,
transform=transforms.ToTensor()),
batch_size=BATCHSIZE,
shuffle=True,
)
valid_loader = torch.utils.data.DataLoader(
datasets.FashionMNIST(os.getcwd(), train=False, transform=transforms.ToTensor()),
batch_size=BATCHSIZE,
shuffle=True,
)
return train_loader, valid_loader
def objective(trial):
# requires a GPU to run
DEVICE = torch.device("cuda")
# Generate the model.
model = define_model(trial).to(DEVICE)
# Generate the optimizers.
optimizer_name = trial.suggest_categorical("optimizer", ["Adam", "RMSprop", "SGD"])
lr = trial.suggest_float("lr", 1e-5, 1e-1, log=True)
optimizer = getattr(optim, optimizer_name)(model.parameters(), lr=lr)
# Get the FashionMNIST dataset.
train_loader, valid_loader = get_mnist()
# Training of the model.
for epoch in range(EPOCHS):
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
# Limiting training data for faster epochs.
if batch_idx * BATCHSIZE >= N_TRAIN_EXAMPLES:
break
data, target = data.view(data.size(0), -1).to(DEVICE), target.to(DEVICE)
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
# Validation of the model.
model.eval()
correct = 0
with torch.no_grad():
for batch_idx, (data, target) in enumerate(valid_loader):
# Limiting validation data.
if batch_idx * BATCHSIZE >= N_VALID_EXAMPLES:
break
data, target = data.view(data.size(0), -1).to(DEVICE), target.to(DEVICE)
output = model(data)
# Get the index of the max log-probability.
pred = output.argmax(dim=1, keepdim=True)
correct += pred.eq(target.view_as(pred)).sum().item()
accuracy = correct / min(len(valid_loader.dataset), N_VALID_EXAMPLES)
trial.report(accuracy, epoch)
# Handle pruning based on the intermediate value.
if trial.should_prune():
raise optuna.exceptions.TrialPruned()
return accuracy
Optimize with Optuna#
We’ll train the model and use Optuna to find the parameters that result in the best model predictions. We train the model five times with n_trials=5
, using different sets of parameters.
import optuna
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=5, timeout=600, show_progress_bar=True)
The runtime is about 25 seconds. We can scale this up and run 100 models, which takes 4 minutes 20 seconds.
study.optimize(objective, n_trials=100, timeout=600, show_progress_bar=True)
Now we can analyze the results to find the best set of parameters.
pruned_trials = study.get_trials(deepcopy=False, states=[TrialState.PRUNED])
complete_trials = study.get_trials(deepcopy=False, states=[TrialState.COMPLETE])
print("Study statistics: ")
print(" Number of finished trials: ", len(study.trials))
print(" Number of pruned trials: ", len(pruned_trials))
print(" Number of complete trials: ", len(complete_trials))
print("Best trial:")
trial = study.best_trial
print(" Value: ", trial.value)
print(" Params: ")
for key, value in trial.params.items():
print(" {}: {}".format(key, value))
Which returns the following output:
Study statistics:
Number of finished trials: 100
Number of pruned trials: 61
Number of complete trials: 39
Best trial:
Value: 0.84609375
Params:
n_layers: 1
n_units_l0: 109
dropout_l0: 0.3822970315388142
optimizer: Adam
lr: 0.007778083042789732
Looks like the best objective value for training our model 100 times is 0.846.
You can monitor usage of GPU resources during the computation in the Coiled UI, which displays metrics like GPU memory and utilization.
Next steps#
In this example, we used Coiled notebooks to run a simple PyTorch model in a Jupyter notebook on a GPU in the cloud. It cost ~$0.10 and took ~4 minutes to train the model 100 times. Though this example uses PyTorch, you could just as easily use another deep learning library like Keras or Tensorflow instead.
If you’d like to run this example yourself, you can get started with Coiled at coiled.io/start. This notebook is available in the coiled/examples repo and runs well within the Coiled free tier (though you’ll still need to pay your cloud provider).