Repeated cluster timeout errors#

Sometimes creating a cluster can fail due to a connection timeout error, for example:

OSError: Timed out trying to connect to tls://54.212.201.147:8786 after 5 s

This could be due to a port being blocked or due to using a very old version of Dask. If you’re using your own VPC (rather than one created by Coiled) or are restricting public network access to your cluster, it’s also possible something in your network is misconfigured.

First, check if default port 8786 is blocked by trying to load http://portquiz.net:8786 on your local machine. If that port is blocked on your local machine, see Working around blocked ports for instruction on how to work around this.

If you’re using your own network functionality (see Bring your own network), have configured a cluster firewall to only allow connections from certain CIDR blocks (see Opening ports for a specific CIDR block), or are trying to connect to the scheduler on a private IP address (see Connecting on a private IP address), you’ll want to double-check your network configuration to make sure that your local client is coming from the correct network/CIDR block and using the correct IP address.

If none of these apply and you’re using a version of Dask that’s over one year old, see Upgrading Dask, below.

Upgrading Dask#

If you are using Dask version 2021.10.0, you may see this error repeated.

Note

The repeated error messages were caused when a periodic callback encountered an intermittent network connectivity issue and resulted in a frequently repeating error condition, as described in the following Dask issue and resolution.

You can resolve the issue by upgrading to Dask versions >= 2021.11.0.

You’ll want to update your local version of Dask, for example:

pip install dask distributed --upgrade

And also update your Coiled software environment:

coiled.create_software_environment(
    name="my-pip-env",
    pip=["dask>=2021.11.0", "distributed>=2021.11.0"],
)