Distributed Model Training on Cloud GPUs with Hugging Face Accelerate and Coiled#

Accelerate is a popular library from Hugging Face for simplifying PyTorch model training and inference on distributed hardware. This enables working with large datasets and models that don’t fit into a single machine’s memory.

This post shows how to run distributed training on cloud GPUs using Accelerate and Coiled, an easy-to-use, UX-focused cloud platform that provides a similarly straightforward approach for running jobs on cloud hardware.

Start locally#

There are two components to using Accelerate for model training: (1) slightly modifying your existing PyTorch code using accelerator.prepare to use distributed versions of your models, dataloaders, etc. and (2) configuring Accelerate to launch your script on the system of your choosing (CPU, GPU, multi-GPU, etc). Here we focus on the second component. Specifically, configuring Accelerate to run distributed model training on multiple GPU-enabled cloud VMs.

For this example, we’ll use this nlp_example.py script from the Accelerate team. This particular example trains a Bert base model on GLUE MRPC data, but the same pattern can be applied to any model and dataset.

First, we’ll install the dependencies we need:

$ pip install accelerate coiled datasets evaluate scipy scikit-learn torch transformers

To run the training script locally, use the accelerate launch CLI:

$ accelerate launch nlp_example.py

This works, but is slow when run on a machine without a GPU, and can be prohibitively slow, or not possible, when using larger models or datasets.

Instead, let’s run the same training script in parallel on multiple cloud GPU instances.

Distributed training on cloud GPUs#

To run this training script on multiple cloud GPUs, we’ll need to:

Configure Accelerate for distributed training and
Deploy our script on multiple cloud machines with Coiled

We’ll handle these two steps separately. Let’s start with configuring Accelerate for distributed training.

Configuring Accelerate for distributed training#

Accelerate has several command line options to customize training. The minimal setup needed for distributed GPU training looks like this (the XXX placeholders are filled in the next section):

$ accelerate launch \
    --multi_gpu \
    --machine_rank XXX \
    --main_process_ip XXX \
    --main_process_port XXX \
    --num_machines XXX \
    --num_processes XXX \
    nlp_example.py

These options specify:

--multi_gpu: Launch a distributed GPU training. This, for example, sets the PyTorch devices used in training.
--machine_rank: The rank of the machine on which this script is launched. This is an integer 0, 1, 2, 3, … assigned to each machine being used in the training process. The rank 0 machine is referred to as the “main” machine.
--main_process_ip: The IP address of the machine of rank 0.
--main_process_port: The port to use to communicate with the machine of rank 0.
--num_machines: The total number of machines being used in the training job.
--num_processes: The total number of processes being used in the training job. Multiple processes could be launched on the same machine.

Deploying on the cloud with Coiled#

To run the training job on the cloud, we’ll use Coiled Batch, which makes it straightforward to run scripts (bash, Python, etc.) on cloud VMs.

We’ll put our distributed accelerate launch command into a train.sh bash script and add these three Coiled Batch comments:

train.sh#

#!/usr/bin/env bash

#COILED n-tasks 10
#COILED gpu True
#COILED task-on-scheduler True

accelerate launch \
    --multi_gpu \
    --machine_rank XXX \
    --main_process_ip XXX \
    --main_process_port XXX \
    --num_machines XXX \
    --num_processes XXX \
    nlp_example.py

They tell Coiled:

#COILED n-tasks 10: The number of cloud VMs to launch as part of this job.
#COILED gpu True: Use an instance type with a GPU. By default this uses an NVIDIA T4 GPU, though you can choose any instance with the vm-type option. For example, #COILED vm-type g6e.xlarge would use instances with an NVIDIA L40S GPU with 48 GB of memory on AWS.
#COILED task-on-scheduler True: Ensure a dedicated machine is used for the main process.

We also fill in values for the accelerate launch CLI options. All the VMs Coiled launches can communicate with each other over the same secure network, so we set main_process_port to an arbitrary available port of 12345.

For num_machines, num_processes, machine_rank, and main_process_ip, we’ll use environment variables that Coiled automatically sets:

$COILED_BATCH_TASK_COUNT: The total number of tasks in the job. Here each task is run on its own cloud instance. Given there are 10 cloud VMs, each with a single GPU, we set num_machines and num_processes both to 10.
$COILED_BATCH_TASK_ID: Integer 0, 1, 2, 3, … that is set for each of the 10 tasks being run. When task-on-scheduler is set to True, task 0 is always run on a dedicated machine (corresponds to the main process for Accelerate).
$COILED_BATCH_SCHEDULER_ADDRESS: IP address for the main process running task 0.

train.sh#

#!/usr/bin/env bash

#COILED n-tasks 10
#COILED gpu True
#COILED task-on-scheduler True

accelerate launch \
    --multi_gpu \
    --machine_rank $COILED_BATCH_TASK_ID \
    --main_process_ip $COILED_BATCH_SCHEDULER_ADDRESS \
    --main_process_port 12345 \
    --num_machines $COILED_BATCH_TASK_COUNT \
    --num_processes $COILED_BATCH_TASK_COUNT \
    nlp_example.py

Finally we use the coiled batch run CLI to run our training job on the cloud:

$ coiled batch run train.sh

This spins up 10 GPU cloud instances, uses Coiled’s package synchronization to automatically replicate our local software on those cloud instances, and runs the train.sh script on each of them. Accelerate handles ensuring PyTorch utilizes the available GPU hardware and any coordination / communication that needs to happen between machines during distributed training.

../_images/coiled-batch-accelerate.png — *View of Coiled UI with our training script running on cloud GPUs.*#

Conclusion#

As datasets and AI models grow in size, utilizing the accelerated hardware and scale the cloud offers for distributed GPU training becomes increasingly crucial.

Accelerate is a popular library that makes it easy to take existing PyTorch code and run it distributed across multiple machines. Coiled makes it straightforward to deploy and run Accelerate on cloud VMs with GPUs.