PyTorch on Cloud GPUs with Coiled#

Train PyTorch models on cloud GPUs from anywhere.

Model training can often have significant performance boosts when run on advanced hardware like GPUs. However GPUs can also be difficult to access and set up properly.

Here we show how to easily train a PyTorch model on a cloud GPU using Coiled. This example can be run from anywhere, including machines that don’t have an NVIDIA GPU (like a Macbook).

../_images/gpu-pytorch.png

Define Model#

Let’s train a GPU-accelerated PyTorch model on the CIFAR10 dataset. To start we’ll install the packages we need:

$ pip install torch torchvision coiled
$ conda create -n env python=3.11 pytorch torchvision coiled
$ conda activate env

Next, we’ll define our PyTorch model:

import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 1024, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(1024, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

Train Model#

Now that we have our model, let’s train it on our dataset on the hardware of our choosing. Fortunately, PyTorch makes it straightforward to utilize different types of hardware that are available locally.

Below is a train function that takes a PyTorch device as input (defaults to "cpu"), loads the CIFAR10 dataset, trains our model on the dataset on the specified hardware, and returns the trained model.

import torch
import torchvision
import torch.optim as optim
import torchvision.transforms as transforms


def train(device="cpu"):
    # Select hardware to run on
    device = torch.device(device)

    model = Net()
    model = model.to(device)

    # Load CIFAR10 dataset
    trainset = torchvision.datasets.CIFAR10(
        root='./data', train=True, download=True, transform=transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616))]),
    )
    trainloader = torch.utils.data.DataLoader(
        trainset, batch_size=256, shuffle=True,
    )
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

    for epoch in range(10):
        print(f"Epoch {epoch + 1}")
        model.train()
        for batch in trainloader:
            # Move training data to device
            inputs, labels = batch[0].to(device), batch[1].to(device)

            optimizer.zero_grad()

            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

    return model.to("cpu")

Run on a Cloud GPU#

Our train function can now be run on any hardware that’s available locally. To train our model on a GPU, we’ll use Coiled Functions to lightly annotate our existing train function to run on a GPU-enabled cloud VM.

import coiled

@coiled.function(
    vm_type="g5.xlarge",   # NVIDIA A10 GPU instance
    region="us-west-2",
)
def train(device="cpu"):
    # Same training code as before
    ...

Now when the train function is run, Coiled will automatically handle provisioning a cloud VM (a g5.xlarge instance on AWS in this case), installing the same software that’s installed locally on the cloud VM, running our train function on the VM, and returning the result back locally.

model = train(device="cuda")  # Train model on cloud GPU

Putting it All Together#

Putting all the pieces together, here is the full workflow:

import coiled
import torch
import torchvision
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision.transforms as transforms


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 1024, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(1024, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


@coiled.function(
    vm_type="g5.xlarge",
    region="us-west-2",
)
def train(device="cpu"):
    # Select hardware to run on
    device = torch.device(device)

    model = Net()
    model = model.to(device)

    # Load dataset
    trainset = torchvision.datasets.CIFAR10(
        root='./data', train=True, download=True, transform=transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616))]),
    )
    trainloader = torch.utils.data.DataLoader(
        trainset, batch_size=256, shuffle=True,
    )
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

    for epoch in range(10):
        print(f"Epoch {epoch + 1}")
        model.train()
        for batch in trainloader:
            # Move training data to device
            inputs, labels = batch[0].to(device), batch[1].to(device)

            optimizer.zero_grad()

            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

    return model.to("cpu")


model = train(device="cuda")

This takes ~4.5 minutes to run (including VM spinup time) and costs ~$0.08 (running locally would have taken ~1.5 hours). We also see that GPU utilization is high, so we’re effectively using the available cloud hardware.

GPU utilization throughout model training.

GPU utilization throughout model training. Utilization is typically high throughout, meaning we’re utilizing the available hardware well.#

Summary#

We showed how to accelerate PyTorch model training on a cloud GPU with minimal code changes using Coiled. PyTorch made it straightforward to take advantage of advanced hardware for model training, and Coiled made it easy to run our code on the cloud hardware of our choosing.

Here are additional machine learning examples you might also be interested in: