Package Synchronization#

Important

Package sync is still in beta (see known limitations below).

Enabling package synchronization (package sync) is as simple as:

from coiled import Cluster

with Cluster(package_sync=True):
    # dask work!
    pass

Package sync will then scan your local python environment, and replicate it to the cluster.

The Problem#

By far the most common issue users encounter when running Dask in a true distributed fashion in the cloud, is that of environment desynchronization. When this happens, if you’re lucky the error might be obvious. If you’re unlucky you could be debugging strange error messages for hours, or worse have no errors but get results that are inconsistent!

Sometimes the errors only happen after hours of processing, leading to an incredibly frustrating experience.

So how does your environment get out of sync? Sometimes it’s pretty straightforward, you pip installed something and forgot to run create_software_environment with your new package.

Another example would be if you specify a conda environment.yml file, for example:

channels:
- conda-forge
dependencies:
- dask==2022.07.01
- distributed==2022.07.01

On paper this looks pretty good. After all, we’ve pinned the exact versions the packages we need. Let’s look at what this produces:

Over 80 packages are installed by conda, and only two of them are pinned, which means any of them could change at any time. We forgot to include python too so even the python version could change! We really only pinned the very tip of our environment iceberg.

So if you installed this environment locally and created a Coiled software environment, then you’d probably only have a synchronized environment for a week or two until one of these packages updated.

The Solution#

For production, most people make a docker image and then use that in the cluster and their pipeline, which bypasses this issue. However, very few people enjoy developing in a docker image locally, especially on platforms where there’s no native docker.

This is where package sync comes in. Instead of just looking at the tip of the iceberg, package sync works with your whole environment as-is when you create a cluster!

Iterating on a feature and need to grab a new requirement to try something out? Great! Just pip/conda install it and start up a cluster, package sync has your back.

Package Sync Features#

Package Levels#

Critical Packages#

We maintain an internal list of packages we consider to be ‘important’ for a cluster, if you don’t have these installed your cluster will never work

dask
distributed
tornado
cloudpickle
msgpack

We also ensure these packages match exactly. Even small mismatches here are likely to cause issues.

Unimportant Packages#

Both macOS and Windows have some packages that are only installed for them. For example Windows conda environments will often have Windows API-related packages. Trying to install these on the Linux-based cluster would simply not work, so by default we ignore these.

Everything else#

By default, we take the version of your package locally and install it with <yourpackage>~=<version>. We allow some wiggle room here as being too strict cross-platform is often trouble, packages frequently have slightly different dependencies between platforms.

Path or Git dependencies#

Often you’ll be working with packages installed locally via pip install -e <some-directory>. Package sync will attempt to create a wheel of that package and sync it to the cluster, ensuring you’re always running your latest changes in the cloud.

Warning

This currently has the limitation that your package must work with pip wheel <package>. If you have compiled dependencies, you must be running on the same platform as the cluster (64bit linux), we do not try to cross compile your package!

If you’ve installed a package from git with pip install git+ssh://git@github.com/dask/distributed for example, the same process will also occur. The reason we build a wheel of git packages is to smooth issues with private git repos, building a wheel means we can keep your local credentials local, instead of trying to get them onto the cluster!

Warning

The compiled wheels are currently uploaded to a secure s3 bucket under the control of Coiled so they can be downloaded by your cluster. While this will change in the future, if this is undesirable we recommend not using package sync currently.

Package Sync for production#

Package sync allows some fuzz between environments. This is to smooth over replication of environments between totally different OS’s like linux and windows. For a more strict experience where package sync tries to exactly match your local environment you can pass package_sync_strict

Cluster(package_sync=True, package_sync_strict=True)

This is almost guaranteed to fail unless you match your client os/platform to the cluster. Currently this would be ubuntu/x86.

Package Sync Limitations#

Unsolvable environments#

Your environment needs to be consistent with what your package manager would accept as a valid environment. For example

dask==2022.1.10
distributed==2022.05.1

Pip will error out trying to install this environment, as will conda. This is because distributed has a pin on the matching dask version.

Virtual environments#

Using venv with package sync is much less tested than using conda. Here are some tips that may help during continued development:

  • Install recent versions of pip and setuptools with pip install -U pip setuptools

  • Use pip install wheel

  • For pip installing packages that are not on PyPI, you may need the --use-pep517 flag, for example:

pip install git+https://github.com/dask/distributed.git --use-pep517

Packages you can’t install locally#

Sometimes you might be working with a package that can only be installed in the cluster environment, perhaps a gpu package. Currently package sync does not allow you add an extra package just for the cluster.

Packages that have special build requirements#

If you have packages installed that don’t have pre-built wheels available, and have requirements beyond what is included in the standard build-essentials and python-dev ubuntu packages, you’ll find package sync fails. If your package is available in a conda repo we suggest using that instead.

Packages that do not list their Python build time requirements#

This is a slight variant of compiled packages missing build time system dependencies. If a package uses the deprecated setup.py and imports something that is not listed as a pep 517 build requirement or setup_requires, the build will also fail.

An example of this is crick 0.0.3 which imports numpy and Cython in setup.py but does not list them as build dependencies.