MPI#

Multi-node GPU clusters with MPI and faster inter-node networking

The coiled mpi CLI allows you to run MPI workloads on Coiled, using fast inter-connect between nodes.

This feature is under active development and we’d love to help you get started and hear about ways to develop the feature that would help you.

Currently this feature only works on GPU instance types on AWS which support Elastic Fabric Adaptor. This includes g6.8xlarge instances with NVIDIA L4 GPUs, as well as p4d (A100) and p5 (H100) instances.

For MPI clusters, you first create a cluster with MPI configured by running coiled mpi setup, then submit commands by running coiled mpi run. Here’s a “hello world” example you can run in your terminal:

coiled mpi setup  # create the cluster (this takes 4 to 5 minutes)

coiled mpi run -- echo hello  # this will print many lines of "hello"

The cluster will automatically shut down after an idle timeout (default is 10 minutes), or can be shutdown manually using coiled cluster stop.

Software#

Usually you’ll want to install software that will be run via MPI. Docker and package sync are not currently supported. You can specify packages to install on all the nodes using apt install (for Ubuntu packages), pip (for Python packages), or you can manually install anything necessary by specifying a host setup script.

For example, you can start and cluster and install the mpi4pi Python package, the curl and tree system packages, and run a custom script to download something using curl like so:

coiled mpi setup \
  --pip mpi4pi \
  --apt curl --apt tree \
  --setup-script download_and_do_stuff.sh

Scripts and code#

Often you’ll want to run a script or code that you have locally. Because MPI assumes that required code is available on every node in the cluster, Coiled provides a way to specify files that need to be copied to every node.

Any file that’s invoked as part of the command is automatically copied. For example, if you run:

coiled mpi run -- python my_script.py

then my_script.py will be copied to each node so that it can be run on every node via MPI.

If you have additional scripts or code that are required, you can specify these using --upload. For example, you might have a local my_module directory that looks like this:

my_module
├── __init__.py
└── foo.py

and inside my_script.py you import from my_module.foo. To run this with MPI on Coiled so that both my_script.py and my_module are copied to the cluster nodes, you’d run:

coiled mpi run --upload my_module -- python my_script.py

Note that to upload multiple paths or individual files, you can specify --upload path1 --upload path2 and so on.

cuPyNumeric with Legate#

You can install cuPyNumeric on your cluster like so:

coiled mpi setup --pip nvidia-cupynumeric --worker-nodes 2

There’s a --legate flag which makes use of legate and MPI together to distribute your cuPyNumeric work across the nodes. For example, here’s a simple matrix multiplication using cuPyNumeric:

from legate.timing import time

import cupynumeric as np

start_time = time()
size = 20000
A = np.random.randn(size,size)
B = np.random.randn(size,size)
C = np.matmul(A,B)
end_time = time()
elapsed_time = (end_time - start_time)/1000
print(f"Problem size: {size}")
print(f"Matrix multiplication took {elapsed_time:.4f} ms")

You can then execute this on your MPI cluster by running:

coiled mpi run --legate -- simple_mm.py

Note that any files referenced in the coiled mpi run command—in this case simple_mm.py—will automatically be copied to all the nodes.

Drivers and network configuration#

The coiled mpi setup command creates a cluster with all the nodes in a cluster placement group. This means that they’re physically closer together, to reduce intra-node latency.

AWS Elastic Fabric Adaptors are attached, which supports UCX between nodes. NCCL will automatically make use of this.

The VMs use the “Deep Learning Base OSS Nvidia Driver GPU AMI (Ubuntu 24.04)” AMI, which has relevant drivers installed.

You can benchmark the UCX interconnect using the CUDA + EFA tests that are included in this AMI. For example, here’s the “all reduce perf” test:

coiled mpi run -- -x NCCL_BUFFSIZE=33554432 --map-by ppr:1:node --rank-by slot --mca pml ^cm --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none  /usr/local/cuda/efa/test-cuda-12.9/all_reduce_perf -b 32M -e 512M -f 2 -g 1 -c 1 -n 10