Reference

Package sync scans your local Python environment and replicates it on the cluster—even local packages and Git dependencies. It’s easier than building a Docker image, plus it launches significantly faster.

Here’s an example showing how your cluster can (magically!) import packages you have in your local environment, in this case httpx and my_local_package:

import dask
import httpx

import coiled

from . import my_local_package


@dask.delayed
def func():
    # This function requires `httpx` and `my_local_package` to be installed.
    data = httpx.get("https://my-api.io/foo")
    return my_local_package.process(data)


# Notice we don't tell Coiled what packages we need.
# The local environment is automatically replicated on the cluster.
cluster = coiled.Cluster()
client = cluster.get_client()

# When `func` runs on the cluster, `httpx` and `my_local_package` are already there.
result = func.compute()

Why do I need this?

Your code imports many Python packages, like pandas and NumPy. You have those installed on your computer, but for your code to run on many machines in the cloud, pandas, NumPy, and everything else must also be installed on those machines. Not only do they need to be installed, but they need to be the same versions as you have locally. Otherwise, you could get errors, or even incorrect results.

Package sync ensures all these versions match without any extra work on your part. It means you can just call coiled.Cluster from anywhere you run Python and get a matching environment in the cloud.

Achieving this is usually a major pain point with most distributed computing systems, Dask included. Often, to solve it, you’d build and maintain a Docker image, or provide a list of dependencies to install on the cluster. Not only is this extra work, but it easily gets out of date. Package sync both eliminates the extra work, and ensures your cluster has the right packages every time.

Using package sync

There’s nothing you really need to do to use package sync—it just works.

Package sync is compatible with most ways you can use Python, such as conda/mamba environments, Python virtual environments, or even packages installed in the system Python environment (see the full compatibility table). When you create a Cluster, package sync will scan all the packages installed in your current Python environment (including non-Python Conda packages, such as binary dependencies).

The package metadata is then used to create an equivalent environment suitable for the cluster. Our cluster of high end machines then creates an artifact that your cluster then downloads and installs. You can read some more details about the system behind this on our blog.

Performance

Package sync clusters typically launch as fast, and in some cases much faster, than clusters using an equivalent Docker image. (Even faster if you include image build time.)

For the fastest build time, conda/mamba based environments are currently recommended. However your choice of package manager does not affect install time on clusters, it’s lighting fast any way.

If you don’t change any packages locally, re-launching a cluster using the same environment is faster for the following 24 hours as we cache your environment.

Note that your internet connection speed is generally not important for package sync performance: the cluster downloads packages directly from their sources (conda channels, PyPI, etc.); they’re not uploaded from your computer (besides local dependencies).

Local packages and Git dependencies

If you have local packages installed, such as with pip install -e <some-directory>, package sync will try to install them on the cluster.

If you have local Python files in your working directory, those files will also be synced with the cluster. Please note that other local files are not synced.

If you’ve installed a package from Git, such as with pip install git+ssh://git@github.com/dask/distributed.git@d74f5006, the same process will occur. The reason we upload the package from your machine, instead of pulling it from Git on the cluster, is to avoid issues with private Git repos: your local credentials can stay local, instead of needing to get them onto the cluster!

Note

When installing packages from Git, you may need to add the --use-pep517 flag to pip, like:

pip install git+https://github.com/dask/distributed.git --use-pep517

Without this flag, for some packages pip may not record sufficient metadata to tell that the package was installed from Git, versus from a normal pip install distributed.

Warning

Your compiled local packages are currently uploaded to storage controlled by Coiled, then downloaded by your cluster. Only your cluster can access these files, and they are deleted after 24h. While this will change in the future, if having a copy of your source code under Coiled’s control violates your organization’s security policies, we recommend not using package sync with local code currently.

Custom PyPI URLs

If you have a custom PyPI index_url or extra-index-url, you will need to have it configured via pip config set 'global.extra-index-url' <URL> locally. If you usually just pass the --extra-index-url or --index-url argument when you run pip install, no record of where the package came from is stored in the environment, so package sync will fail to install it on the cluster.

If you do not want to include the username and password in the URL locally, you can include it in a netrc file.

We do not currently support keyring credential storage.

Ignoring packages

If you have packages installed locally that you don’t want synced to the cluster, you can list them in the package_sync_ignore argument to coiled.Cluster. This is generally not needed, though, because package sync installation on the cluster is so fast that installing extra, unused packages has a negligible effect on cluster startup time.

Note that only these exact packages are ignored—their dependencies may still be installed. Additionally, if another package depends on them, they will still be installed.

Cross-platform fuzzing

When using a macOS or Windows machine to launch clusters (which always run Linux), you may not get exactly the same versions of all packages on the cluster as you have locally. This occurs because packages sometimes require slightly different dependencies on different platforms. If package sync tried to install the exact same versions of everything you have on macOS, for example, onto Linux, some versions might not exist for Linux, or might have incompatible requirements, and installation would fail.

Therefore, package sync uses a looser version match with cross-platform clusters. Specifically, we only match major.minor in the version number, not the exact version (typically major.minor.patch).

When launching a cluster from Linux, though, the exact versions of all packages will be matched. This is the recommended way to run workloads that require the highest degree of reproducibility. See Using Package Sync in Production for details.

Mandatory packages

Package sync will refuse to start a cluster if you don’t have these basic packages for running Dask installed locally:

dask
distributed
tornado
cloudpickle
msgpack

Additionally, package sync ensures that the versions of these packages match exactly with what you have locally, even cross-platform.