KilledWorker Exception

This article aims to help you diagnose and understand a bit better what the KilledWorker exception means. To get us started, let’s first show you how you can diagnose this error, then I will explain briefly why this error can happen and finally, how you might mitigate against it.

Diagnose the error

The best way to understand what went wrong is to check the worker logs. You might find a traceback with some exception or some other information about what happened with that worker.

You can head over to your dashboard and click the ‘View logs’ button on Coiled’s affected cluster. You can see both the scheduler and workers’ logs, open a worker log, and scroll down to see the last thing logged in that worker.

On Dask, logs will use standard error by default, which means that if you are using Dask, you need to check your terminal for any information as to what happened.

Example

Once I forgot to add the s3fs dependency when creating a software environment. I opened a notebook and started working through an example. All worked fine until I tried to run df.head() and got the following error:

KilledWorker: ("('read-csv-c075100a8373a0cab23cd3671871aeb7', 470)", <Worker 'tls://10.2.1.52:44429', name: fabiorosado-3880-worker-5-afdae8, memory: 0, processing: 179>)

Heading over to my Coiled dashboard to check the cluster logs, I have noticed that I was missing the s3fs dependency from the exception there.

ModuleNotFoundError: No module named 's3fs'

Understanding the error

We have already seen a possible cause of this error, but there are many other reasons why we might get a KilledWorker exception:

  • Caused by a segmentation fault

  • Too much memory allocated to the worker

  • Using a library that is not threadsafe

  • The worker died unexpectedly

When a Task dies on a worker, it might be sent to a different worker. If the same tasks kill that worker, then Dask decides to blame the task itself. This behaviour is a way to protect the cluster against tasks that kill workers (like throwing a memory error, for example.)

Because there are various reasons you might get this KilledWorker exception, it’s important to reiterate that checking the logs is the best way to gather information and figure out why something went wrong.

Mitigate the error

When it comes to mitigating an exception such as this, it’s hard to point to specific things, because there are many reasons why you might get this error. A few things that I can suggest is to:

  • Increase the number of allowed failures before a worker is killed

  • Make sure you allocate the right memory to your workers

  • Watch the diagnostic dashboard for memory spikes (only possible while the worker is active)

  • Make sure you are running the same package version on your client and worker.

Increase the number of allowed failures

When creating a cluster with coiled, you can use the scheduler_options keyword to increase the number of allowed failures. For example:

cluster = coiled.Cluster(scheduler_options={"allowed_failures": 4})

Reference: coiled.Cluster()

On dask, you can use the config key distributed.scheduler.allowed-failures to change the number of allowed failures. You can also edit the file ~/.config/dask/distributed.yaml and find the line allowed-failures.

Allocate the correct memory to your workers

With coiled, you can choose how much memory your workers should have when you create your cluster. If you don’t want to decide, we will allocate 8 GiB by default per worker.

You can allocate more memory to your workers with the keyword worker_memory that you can pass to both coiled.Cluster or coiled.create_cluster_configuration. For example:

cluster = coiled.Cluster(worker_memory="16GiB")

Use the same dependencies versions

Using the coiled CLI, you can run the command coiled install <software environment name> to create a new conda environment with the same dependencies and versions we used when creating your software environment.

Doing this will also help fix the version mismatch between the client and the scheduler/workers running in the cluster.

Note that the above command creates a new conda environment, so you will need to activate it before starting your work on your cluster.