MLflow on Coiled with Sidecars and Filestores#

Coiled lets you run ephemeral clusters in the cloud. We introduce sidecars to run arbitrary containers alongside your workload, and filestores to sync local data to the cloud and persist data across runs. We use them to run an MLflow server concurrently with model training.

MLflow is a widely used open source tool for ML experiment tracking, model packaging, registry management, and deployment. Some people host their own server, but many use a hosted solution. Our users asked us about running MLflow on Coiled, and, using some new features, we finally found a way that aligns with Coiled’s principle of only paying for what you use. It’s idiosyncratic, but it works quite well.

Running MLflow as a Sidecar#

It’s always been possible to run MLflow on Coiled using coiled run, but this required starting the server before training, obtaining the server’s URL, and passing it to the training code—too many moving parts.[1]

Instead, we introduce sidecars, a feature that lets you specify one or more Docker containers to run on the scheduler and/or workers during jobs. With them, you can run the MLflow server alongside the training code. We have no intention of going full Docker Compose on you, but you may recognize the format if you squint at a sidecar spec YAML file:

mlflow:
  image: astral/uv:debian-slim
  command: uvx mlflow server --host 0.0.0.0 --port 5000 
  ports:
    - 5000

There’s a name for the sidecar, the image to use, the command to run, and any ports to expose. To launch MLflow, we use a small uv container and uvx instead of using the official MLflow image, like we did in the marimo post.

You can also define sidecars directly in your Python code. Here’s an example of adding a sidecar to a Dask cluster:

cluster = coiled.Cluster(
    ...,
    scheduler_sidecars=[
        {
            "name": "mlflow",
            "image": "astral/uv:debian-slim",
            "command": "uvx mlflow server --host 0.0.0.0 --port 5000",
            "ports": [5000]
        }
    ]
)

Persisting Data with Filestores#

To persist data across jobs, we use filestores, a new feature that enables:

Copying data between your local machine and Coiled-managed VMs in the cloud,
Persisting it between multiple runs, and
Transferring data between steps in a processing pipeline.

Here’s a quick example of using a filestore with coiled batch to copy local files from local-directory to the cloud VM where my_script.py runs:

coiled batch run --sync local-directory --wait my_script.py

The files will be available in the workers’ working directory, and any outputs that the code writes to the working directory on the cloud VM will be stored in a filestore, available for download to the local machine or to other Coiled-managed cloud VMs. All tasks in a batch job share the filestore, and you can sync back outputs from all the tasks together to your local machine with:

coiled file download --into output-data

In the example above, we add the --wait flag to wait until work completes, and then sync new or modified files back to the local machine, instead of the default non-blocking behavior for coiled batch run.

The sidecar spec lets you specify filestores to attach. Data in named filestores are downloaded to the file:///scratch/filestores directory on the VMs. Here, we attach a filestore named mlflow-data to persist the MLflow backend data:

mlflow:
  image: astral/uv:debian-slim
  command: uvx --python 3.13 mlflow server --host 0.0.0.0 --port 5000 
    --backend-store-uri file:///scratch/filestores/mlflow-data
  ports:
    - 5000
  filestores:
    - name: mlflow-data

Training Code and Launching the Cluster#

With Coiled Batch, the training code doesn’t need to know anything about Coiled. That way, it can also run locally. Yet, we find it convenient to set a few # COILED comments in the header that Coiled will pick up, like:

# COILED env MLFLOW_TRACKING_URI=http://coiled-scheduler:5000
# COILED env MLFLOW_TRACKING_USERNAME=alex
# COILED gpu true
# COILED map-over-values 0.1, 0.05, 0.01

These comments:

Set the MLflow server’s URL as the scheduler’s address, now available as http://coiled-scheduler on the workers (it’s still available as the COILED_BATCH_SCHEDULER_ADDRESS environment variable too),
Set the user name so it’s logged with the experiment results,
Request VMs with a GPU attached,
And specify the list of learning rates to try. Coiled will launch a VM for each value.

The model we train in the example is a straightforward PyTorch classifier on the FashionMNIST dataset and doesn’t contain any Coiled-specific code. The only notable thing is that it accepts the loss value as an argument, so we can train the model locally and on the cloud without modifying the code. You can run it locally with:

python train.py 0.1

Or in the cloud, with the MLflow sidecar alongside, with:

coiled batch run --scheduler-sidecar-spec mlflow-sidecar.yaml \
   -- train.py \$COILED_BATCH_TASK_INPUT

While the cluster is running, you can access the MLflow server using the button on the cluster’s page:

The cluster dashboard shows a button to open the sidedcar app when ports are opened on the sidecar spec.

Once the cluster has stopped, you can download the filestore data to your computer with:

coiled file download --into mlflow-data

The filestore data is downloaded to the local machine.

And start the MLflow server to browse the results:

mlflow server \
        --backend-store-uri mlflow-data \
        --artifacts-destination mlflow-data

Check out the complete MLflow example to learn how to browse MLflow data remotely and split the metadata and model data into separate filestores.

What Else Can You Do With Sidecars and Filestores?#

Other than running MLflow, we know some users plan on using sidecars to run a proxy on each worker to make authorized, encrypted connections to databases.

As for filestores, they address a common request to make it easy to upload local files to clusters when using Coiled Batch. We think you’ll like them. Check out the filestore docs to learn more, and let us know what you think about them.