Package Sync#

Package sync scans your local Python environment and replicates it on the cluster—even local packages and Git dependencies. It’s easier and faster than building a Docker image. Here’s an example showing how the locally installed httpx and my_local_package packages are automatically installed on your remote cluster.

import dask
import httpx

import coiled

from . import my_local_package


@dask.delayed
def func():
    # This function requires `httpx` and `my_local_package` to be installed.
    data = httpx.get("https://my-api.io/foo")
    return my_local_package.process(data)


# Notice we don't tell Coiled what packages we need.
# The local environment is automatically replicated on the cluster.
cluster = coiled.Cluster()
client = cluster.get_client()

# When `func` runs on the cluster, `httpx` and `my_local_package` are already there.
result = func.compute()

Using package sync#

By default Coiled inspects your local Python environment for Python packages and emulates that environment on the remote cluster.

import coiled

cluster = coiled.Cluster(
    n_workers=15,  # Coiled automatically syncs packages by default
)
client = cluster.get_client()

Package sync works with many kinds of locally installed packages:

  • Conda packages

  • Pip installed packages from PyPI

  • Locally installed editable packages

  • .py files in your working directory

See the full compatibility table for more details.

Why do I need this?#

Your code imports many Python packages, like pandas and NumPy. You have those installed on your computer, but for your code to run on many machines in the cloud, pandas, NumPy, and everything else must also be installed on those machines. Not only do they need to be installed, but they need to be the same versions as you have locally. Otherwise, you could get errors, or even incorrect results.

Package sync ensures all these versions match without any extra work on your part. It means you can just call coiled.Cluster from anywhere you run Python and get a matching environment in the cloud.

Achieving this is usually a major pain point with most distributed computing systems, Dask included. Often, to solve it, you’d build and maintain a Docker image, or provide a list of dependencies to install on the cluster. Not only is this extra work, but it easily gets out of date. Package sync both eliminates the extra work, and ensures your cluster has the right packages every time.

How does it work?#

Package sync scans all the packages installed in your current Python environment (including non-Python Conda packages, such as binary dependencies). The package metadata is then used to create an equivalent environment suitable for the remote machine. Our cluster of high end machines then creates an artifact that your remote machine downloads and installs. See our blog post for more details.

Performance#

Package sync clusters typically launch as fast, and in some cases much faster, than clusters using an equivalent Docker image (even faster if you include image build time). For the fastest build time, conda/mamba based environments are currently recommended. However your choice of package manager does not affect install time on clusters, it’s lighting fast any way.

If you don’t change any packages locally, re-launching a cluster using the same environment is faster for the following 24 hours as we cache your environment.

Note that your internet connection speed is generally not important for package sync performance: the cluster downloads packages directly from their sources (conda channels, PyPI, etc.); they’re not uploaded from your computer (besides local dependencies).

Debugging#

Package sync does not always work, especially with messy environments. In these cases we often recommend that you create a fresh conda or virtualenv environment and run from there:

conda create -n myenv -c conda-forge coiled dask ipykernel

Package sync works reliably well in these cases.

Learn more#