Using Package Sync in Production#
When running production workloads, it’s best practice to have a deterministic software environment: to always have the same set of packages installed every time. You don’t want a critical pipeline to go down because a buggy new version of a dependency was just released.
Package sync is fully deterministic for same-platform clusters: the conda and pip packages installed on the cluster are exactly what is installed on the client. Therefore, as long as you lock down the client environment, the cluster’s Python environment will be locked down as well.
The best practices for using package sync in production are:
Ensure client environment is consistent#
Package sync works with nearly any method of reproducing a Python environment, from deterministic tools like Poetry to a
requirements.txt file, because it just scans the currently-installed packages. Examples of these methods include:
requirements.txtfor pip packages
Docker (though in some cases, you may want to just run the Docker image on the cluster; see notes here)
Note that clusters using package sync are only as deterministic as your environment. If you use an
requirements.txt, for example, it’s possible that without any changes to that file, the packages that get installed on your client—and therefore cluster—could change, just because a new version of a package was released.
Tools like Poetry, conda-lock, or pip-tools are made to address this problem. They create a lockfile that specifies every package, including transitive dependencies, so you’re guaranteed to always get the same environment, even when packages you don’t realize you depend on release new versions. By commiting the lockfile to version control, it’s also easy to roll back to past versions of the environment.
Launch clusters from Linux#
Luckily, if you’re running an automated system in production, it’s very, very likely that the client will be running Linux. Examples of places you might create a cluster from in production:
CI systems (GitHub Actions, CircleCI, etc.)
AWS ECS, AWS Fargate, GCP Cloud Run
Anything else running a Docker image
In all of these, the client creating the cluster would be running on Linux.
When to not use package sync in production#
You need system dependencies that can’t be installed with package sync.
You’re using hardware that requires dependencies that you can’t install locally (e.g. GPUs).
You already have a Docker image you’d like to use, and infrastructure to build and maintain that image.
If you’re already using Docker, you still could use package sync to get faster cluster startup than you would running the image on the cluster (assuming you only have conda and pip packages in the image, not system dependencies).
However, it might make sense to keep using your Docker image on the cluster if it already works for you. If you already have a system you’re happy with, no need to change it!
Your organization has security scanning pipelines for Docker images or other such restrictions, and expects only those images to be run.