MPI#

Multi-node GPU clusters with MPI and faster inter-node networking

The coiled mpi CLI allows you to run MPI workloads on Coiled, using fast inter-connect between nodes.

This feature is under active development and we’d love to help you get started and hear about ways to develop the feature that would help you.

Currently this feature only works on GPU instance types on AWS which support Elastic Fabric Adaptor. This includes g6.8xlarge instances with NVIDIA L4 GPUs, as well as p4d (A100) and p5 (H100) instances.

For MPI clusters, you first create a cluster, then submit commands to run via MPI:

coiled mpi setup  # this takes 4 to 5 minutes

coiled mpi run -- echo hello  # this will print many lines of "hello"

The cluster will automatically shut down after an idle timeout (default is 10 minutes), or can be shutdown manually using coiled cluster stop.

Software#

Usually you’ll want to install software that will be run via MPI. Docker and package sync are not currently supported. You can specify packages to install on all the nodes using apt install (for Ubuntu packages), pip (for Python packages), or you can manually install anything necessary by specifying a host setup script.

For example, you can start and cluster and install the mpi4pi Python package, the curl and tree system packages, and run a custom script to download something using curl like so:

coiled mpi setup \
  --pip mpi4pi \
  --apt curl --apt tree \
  --setup-script download_and_do_stuff.sh

cuPyNumeric with Legate#

You can install cuPyNumeric on your cluster like so:

coiled mpi setup --pip nvidia-cupynumeric --worker-nodes 2

There’s a --legate flag which makes use of legate and MPI together to distribute your cuPyNumeric work across the nodes. For example, here’s a simple matrix multiplication using cuPyNumeric:

from legate.timing import time

import cupynumeric as np

start_time = time()
size = 20000
A = np.random.randn(size,size)
B = np.random.randn(size,size)
C = np.matmul(A,B)
end_time = time()
elapsed_time = (end_time - start_time)/1000
print(f"Problem size: {size}")
print(f"Matrix multiplication took {elapsed_time:.4f} ms")

You can then execute this on your MPI cluster by running:

coiled mpi run --legate -- simple_mm.py

Note that any files referenced in the coiled mpi run command—in this case simple_mm.py—will automatically be copied to all the nodes.

Drivers and network configuration#

The coiled mpi setup command creates a cluster with all the nodes in a cluster placement group. This means that they’re physically closer together, to reduce intra-node latency.

AWS Elastic Fabric Adaptors are attached, which supports UCX between nodes. NCCL will automatically make use of this.

The VMs use the “Deep Learning Base OSS Nvidia Driver GPU AMI (Ubuntu 24.04)” AMI, which has relevant drivers installed.

You can benchmark the UCX interconnect using the CUDA + EFA tests that are included in this AMI. For example, here’s the “all reduce perf” test:

coiled mpi run -- -x NCCL_BUFFSIZE=33554432 --map-by ppr:1:node --rank-by slot --mca pml ^cm --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none  /usr/local/cuda/efa/test-cuda-12.9/all_reduce_perf -b 32M -e 512M -f 2 -g 1 -c 1 -n 10