Scaling XGBoost with Dask and Coiled#

XGBoost is a library used for training gradient boosted supervised machine learning models. In this guide, you’ll learn how to train an XGBoost model in parallel in your own cloud account using Dask and Coiled. Download this jupyter notebook to follow along.

Before you start#

You’ll first need to create consistent local and remote software environments with dask, coiled, and the necessary dependencies installed. You can use coiled-runtime, a conda metapackage, which already includes xgboost and dask-ml.

You can install coiled-runtime locally in a conda environment:

conda create -n xgboost-example -c conda-forge python=3.9 coiled-runtime

And activate the conda environment you just created:

conda activate xgboost-example

Launch your Coiled cluster#

Create a Dask cluster in your cloud account with Coiled:

import coiled

cluster = coiled.Cluster(

And connect Dask to your remote Coiled cluster:

import dask.distributed

client = dask.distributed.Client(cluster)



Connection method: Cluster object Cluster type: coiled.ClusterBeta

Cluster Info

Train your model#

You’ll use the Higgs dataset available on Amazon S3. This dataset is composed of 11 million simulated particle collisions, each of which is described by 28 real-valued features and a binary label indicating which class the sample belongs to (i.e. whether the sample represents a signal or background event).

You’ll use Dask’s read_csv function makes to read in all the CSV files in the dataset:

import dask.dataframe as dd

# Load the entire dataset lazily using Dask
ddf = dd.read_csv("s3://coiled-data/higgs/higgs-*.csv", storage_options={"anon": True})

You can separate the classification label and training features and then partition the dataset into training and testing samples. Dask’s machine learning library, Dask-ML, mimics Scikit-learn’s API, providing scalable versions of sklearn.datasets.make_classification and sklearn.model_selection.train_test_split that are designed to work with Dask Arrays and DataFrames larger than available RAM.

from dask_ml.model_selection import train_test_split

X, y = ddf.iloc[:, 1:], ddf["labels"]
# use Dask-ML to generate test and train datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=2)

Next you’ll persist your training and testing datasets into memory to avoid re-computations (see the Dask documentation for best practices using persist):

import dask

X_train, X_test, y_train, y_test = dask.persist(X_train, X_test, y_train, y_test)

To do distributed training of an XGBoost model, you’ll use XGBoost with Dask (see the XGBoost tutorial on using XGBoost with Dask). You’ll need to first construct the xgboost.DMatrix object for both your training and testing datasets – these are the internal data structures XGBoost uses to manage dataset features and targets. Since you’re using XGBoost with Dask, you can pass your training and testing datasets directly to xgboost.dask.DMatrix().

import xgboost

dtrain = xgboost.dask.DaskDMatrix(client=client, data=X_train, label=y_train)

Next you’ll define the set of hyperparameters to use for the model and train the model (see the XGBoost documentation on parameters):

params = {
    'objective': 'binary:logistic',
    'max_depth': 3,
    'min_child_weight': 0.5,
    'eval_metric': 'logloss'

bst = xgboost.dask.train(client,  params, dtrain, num_boost_round=3)

Generate model predictions#

Now that your model has been trained, you can use it to make predictions on the testing dataset which was not used to train the model:

y_pred = xgboost.dask.predict(client, bst, X_test)
y_test, y_pred = dask.compute(y_test, y_pred)

Voilà! Congratulations on training a boosted decision tree in the cloud.

Once you’re done, you can shutdown the cluster (it will shutdown automatically after 20 minutes of inactivity):


Next steps#

For a more in-depth look at what you can do with XGBoost, Dask, and Coiled, check out this Coiled blogpost.