Repeated cluster timeout errors

Problem

Affected versions: Dask 2021.10.0

When creating or working with a Dask cluster in Coiled, you might see repeated messages with asyncio.exceptions.TimeoutError or asyncio.exceptions.CancelledError in your Jupyter Notebooks or Python shell, which will appear similar to the following errors:

tornado.application - ERROR - Exception in callback functools.partial(<bound
method IOLoop._discard_future_result of <zmq.eventloop.ioloop.ZMQIOLoop object
at 0x1334d13d0>>, <Task finished name='Task-337' coro=<Cluster._sync_cluster_info()
done, defined at /Users/username/dev/distributed/distributed/deploy/cluster.py:104>
exception=OSError('Timed out trying to connect to tls://54.212.201.147:8786 after 5 s')>)
Traceback (most recent call last):
  File "/Users/username/dev/distributed/distributed/comm/tcp.py", line 398, in connect
    stream = await self.client.connect(
  File "/Users/username/dev/dask-playground/env/lib/python3.9/site-packages/tornado/tcpclient.py", line 288, in connect
    stream = await stream.start_tls(
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/username/.pyenv/versions/3.9.1/lib/python3.9/asyncio/tasks.py", line 489, in wait_for
    fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/username/dev/distributed/distributed/comm/core.py", line 284, in connect
    comm = await asyncio.wait_for(
  File "/Users/username/.pyenv/versions/3.9.1/lib/python3.9/asyncio/tasks.py", line 491, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/username/dev/dask-playground/env/lib/python3.9/site-packages/tornado/ioloop.py", line 741, in _run_callback
    ret = callback()
  File "/Users/username/dev/dask-playground/env/lib/python3.9/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
    future.result()
  File "/Users/username/dev/distributed/distributed/deploy/cluster.py", line 105, in _sync_cluster_info
    await self.scheduler_comm.set_metadata(
  File "/Users/username/dev/distributed/distributed/core.py", line 785, in send_recv_from_rpc
    comm = await self.live_comm()
  File "/Users/username/dev/distributed/distributed/core.py", line 742, in live_comm
    comm = await connect(
  File "/Users/username/dev/distributed/distributed/comm/core.py", line 308, in connect
    raise OSError(
OSError: Timed out trying to connect to tls://54.212.201.147:8786 after 5 s

The repeated error messages were caused when a periodic callback encountered an intermittent network connectivity issue and resulted in a frequently repeating error condition, as described in the following Dask issue and resolution.

Solution

Upgrading to Dask 2021.11.0 or a newer version will resolve this issue and stop the repeated error messages. You can upgrade to the latest version of Dask on your local machine by running the following command in a terminal:

pip install dask distributed --upgrade

or

conda update dask distributed -c conda-forge

If you are using any custom software environments in Coiled, you’ll need to update the version of Dask in those environments and rebuild them by running the following command with the desired version of Dask (or, you can omit the version specifier to use the latest version of Dask):

coiled.create_software_environment(
    name="my-pip-env",
    pip=["dask==2021.11.0", "distributed==2021.11.0"],
)