PyTorch with Coiled Functions

PyTorch is a library for deep learning on GPUs and CPUs. In this guide, you’ll learn how to leverage GPUs with Coiled Serverless Functions while running on a cloud-hosted VM with GPUs attached to it.

Before you start

We will use cuda support in PyTorch to train our model on GPUs. Normally, Coiled can analyze your local Python environment and replicate it on cloud-hosted VMs. This would require a local machine that let’s you install the necessary libraries, which is not always given. This example uses a Coiled Software Environment to specify the necessary dependencies.

import coiled

coiled.create_software_environment(
    name="pytorch",
    conda={
        "channels": ["pytorch", "nvidia", "conda-forge", "defaults"],
        "dependencies": [
            "python=3.11",
            "coiled",
            "pytorch",
            "torchvision",
            "torchaudio",
            "cudatoolkit",
            "pynvml",
        ],
    },
    gpu_enabled=True,
)

Mainly, we have to install the cudatoolkit and pytorch and a couple of dependencies for our model. The environment creation is only necessary once. The environment is cached and can be reused later on.

About the data

In this example we will use the CIFAR10 dataset that is provided by PyTorch.

About the model

We will use the Net model that is given in the PyTorch Tutorials.

Dispatch the computation to a cloud-hosted VM with a GPU attached

We have to decorate our function that will do the work with a Coiled specific decorator that will offload the computation.

import coiled

@coiled.function(
    vm_type="g5.xlarge", # GPU instance type
    region="us-west-2",
    software="pytorch",  # Software environment that we created earlier
)

This will offload our workload to an EC2 instance that has a GPU attached to it.

Training the model

We will perform the training step on the VM that’s hosted in the cloud before sending the model back to our local machine.

import torch.nn as nn
import torch.nn.functional as F
import torchvision.transforms as transforms

transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616))])


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


@coiled.function(
    vm_type="g5.xlarge",
    region="us-west-2",
    software="pytorch",
)
def train(transform):
    import torch
    import torchvision
    import torch.nn as nn
    import torch.optim as optim

    device = torch.device("cuda:0")

    net = Net()
    net = net.to(device)

    trainset = torchvision.datasets.CIFAR10(
        root='./data', train=True, download=True, transform=transform,
    )
    trainloader = torch.utils.data.DataLoader(
        trainset, batch_size=4, shuffle=True,
    )
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

    for i, data in enumerate(trainloader, 0):
        inputs, labels = data[0].to(device), data[1].to(device)

        optimizer.zero_grad()

        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

    return net.to(torch.device("cpu"))

Let’s run the training step for our model on our local machine:

if __name__ == "__main__":
    train(transform)
Files already downloaded and verified
Net(
  (conv1): Conv2d(3, 6, kernel_size=(5, 5), stride=(1, 1))
  (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=400, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)

This is where the actual decorator comes in. Coiled Functions will offload the computation to AWS and get back the model, which is very small. The provisioned instance is shut down after our computation finishes. We converted the model to a CPU-based model so that we can run it locally.