PyTorch on Cloud GPUs with Coiled#
Train PyTorch models on cloud GPUs from anywhere.
Model training can often have significant performance boosts when run on advanced hardware like GPUs. However GPUs can also be difficult to access and set up properly.
Here we show how to easily train a PyTorch model on a cloud GPU using Coiled. This example can be run from anywhere, including machines that don’t have an NVIDIA GPU (like a Macbook).
Define Model#
Let’s train a GPU-accelerated PyTorch model on the CIFAR10 dataset. To start we’ll install the packages we need:
$ pip install torch torchvision coiled
$ conda create -n env python=3.11 pytorch torchvision coiled
$ conda activate env
Next, we’ll define our PyTorch model:
import torch.nn as nn
import torch.nn.functional as F
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 1024, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(1024, 16, 5)
self.fc1 = nn.Linear(16 * 5 * 5, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = x.view(-1, 16 * 5 * 5)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
Train Model#
Now that we have our model, let’s train it on our dataset on the hardware of our choosing. Fortunately, PyTorch makes it straightforward to utilize different types of hardware that are available locally.
Below is a train
function that takes a PyTorch device as input (defaults to "cpu"
), loads the CIFAR10 dataset, trains our model on the dataset on the specified hardware, and returns the trained model.
import torch
import torchvision
import torch.optim as optim
import torchvision.transforms as transforms
def train(device="cpu"):
# Select hardware to run on
device = torch.device(device)
model = Net()
model = model.to(device)
# Load CIFAR10 dataset
trainset = torchvision.datasets.CIFAR10(
root='./data', train=True, download=True, transform=transforms.Compose(
[transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616))]),
)
trainloader = torch.utils.data.DataLoader(
trainset, batch_size=256, shuffle=True,
)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
for epoch in range(10):
print(f"Epoch {epoch + 1}")
model.train()
for batch in trainloader:
# Move training data to device
inputs, labels = batch[0].to(device), batch[1].to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
return model.to("cpu")
Run on a Cloud GPU#
Our train
function can now be run on any hardware that’s available locally.
To train our model on a GPU, we’ll use Coiled Functions
to lightly annotate our existing train
function to run on a GPU-enabled cloud VM.
import coiled
@coiled.function(
vm_type="g5.xlarge", # NVIDIA A10 GPU instance
region="us-west-2",
)
def train(device="cpu"):
# Same training code as before
...
Now when the train
function is run, Coiled will automatically handle provisioning a cloud VM (a g5.xlarge
instance on AWS in this case), installing the same software that’s installed locally on the cloud VM, running our train
function on the VM, and returning the result back locally.
model = train(device="cuda") # Train model on cloud GPU
Putting it All Together#
Putting all the pieces together, here is the full workflow:
import coiled
import torch
import torchvision
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision.transforms as transforms
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 1024, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(1024, 16, 5)
self.fc1 = nn.Linear(16 * 5 * 5, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = x.view(-1, 16 * 5 * 5)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
@coiled.function(
vm_type="g5.xlarge",
region="us-west-2",
)
def train(device="cpu"):
# Select hardware to run on
device = torch.device(device)
model = Net()
model = model.to(device)
# Load dataset
trainset = torchvision.datasets.CIFAR10(
root='./data', train=True, download=True, transform=transforms.Compose(
[transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616))]),
)
trainloader = torch.utils.data.DataLoader(
trainset, batch_size=256, shuffle=True,
)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
for epoch in range(10):
print(f"Epoch {epoch + 1}")
model.train()
for batch in trainloader:
# Move training data to device
inputs, labels = batch[0].to(device), batch[1].to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
return model.to("cpu")
model = train(device="cuda")
This takes ~4.5 minutes to run (including VM spinup time) and costs ~$0.08 (running locally would have taken ~1.5 hours). We also see that GPU utilization is high, so we’re effectively using the available cloud hardware.
Summary#
We showed how to accelerate PyTorch model training on a cloud GPU with minimal code changes using Coiled. PyTorch made it straightforward to take advantage of advanced hardware for model training, and Coiled made it easy to run our code on the cloud hardware of our choosing.
Here are additional machine learning examples you might also be interested in: