Access Remote Data#

You’ll probably need your Dask workers to process private data (e.g. in S3), so you will need a way for those workers to authenticate.

In most cases, Coiled handles this for you and you can run the same code on your cluster that you do locally.

AWS#

Suppose you have an object in an S3 bucket s3://david-auth-example/hello.txt that’s only accessible from your AWS account.

You might have code like this that you run locally to read some data from S3:

def read_object():
    s3 = boto3.client("s3")
    data = s3.get_object(Bucket='david-auth-example', Key="hello.txt")
    return data["Body"].read()

text = read_object()

When this code runs on your local machine, it’s using the AWS credentials you have in your local environment. What would you need to do to make this code work on a Coiled cluster?

Personal STS tokens#

Our goal is to make the transition from running locally to running in a cluster as seamless as possible, so by default Coiled uses your local credentials to create a temporary STS token (encrypted in transit) that we send to the cluster.

Coiled does this by generating temporary STS tokens (refreshed as needed) which we make available to your cluster. Anything that relies on standard tools like boto3 will use these credentials automatically. (In more detail, the credentials are picked up by the “Container credential provider” in the AWS Credential provider chain.)

The identity we use for this is whichever one Boto on your local machine uses by default. For example, it could come from environment variables or a ~/.aws/credentials file.

If you want to check which identity Coiled will use for this by default, you can (if you have the AWS CLI installed) run aws sts get-caller-identity or in Python:

import boto3
sts = boto3.client("sts")
sts.get_caller_identity()

If the read_object function above worked locally, it will work on the cluster too.

You can turn off this default by specifying coiled.Cluster(..., credentials=None) when you start your cluster.

GCP#

Suppose you have an object in a Google Storage bucket that’s only accessible using your Google Cloud credentials.

You might have code like this that locally reads the file from Google Storage:

from google.oauth2.credentials import Credentials
from google.cloud import storage

def read_object():
  storage_client = storage.Client(project="my-gcp-project")
  bucket = storage_client.bucket("my-private-bucket")
  return storage.Blob("hello.txt", bucket).download_as_string()

text = read_object()

Or maybe you explicitly set credentials to use in your code. What would you need to do to make this code work on a Coiled cluster?

Personal OAuth2 tokens#

Our goal is to make the transition from running locally to running in a cluster as seamless as possible, so by default Coiled uses your local Google Application Default Credentials (if set) to create a temporary OAuth2 token (encrypted in transit) that we send to the cluster.

This allows you to use the same Google data-access permissions locally and in the cloud.

There are three required steps:

  1. Configure your local Application Default Credentials

  2. Install the google-cloud-iam library so Coiled can make an OAuth2 token from your local Application Default Credentials

  3. Use the OAuth2 token in code running on your cluster

The easiest way to get Application Default Credentials is by installing gcloud and running gcloud auth application-default login:

conda install -c conda-forge google-cloud-sdk
gcloud auth application-default login

(See the Google doc on User credentials provided by using the gcloud CLI for more details).

For Coiled to automatically create temporary OAuth2 tokens from your Application Default Credentials and ship those to your cluster, you’ll need to install google-cloud-iam in the local software environment from which you’re using the coiled client to start your cluster.

conda install google-cloud-iam  # or pip install google-cloud-iam

The OAuth2 token is shipped (securely) to your cluster and set as an environment variable. You’ll have to explicitly modify your code to make use of this token.

You will need to modify your code to use the forwarded OAuth2 token on the cluster. There’s a CoiledShippedCredentials class in the coiled package that subclasses the Google Credentials class and automatically handles using and refreshing the shipped OAuth2 token.

Dask DataFrame API accepts a storage_options kwarg, you can use this to specify the credentials:

 from coiled.credentials.google import CoiledShippedCredentials

 # the dataframe you want to write to Google Cloud Storage
 df = ...

 # write dataframe
 df.to_parquet(
    "gs://gcs-bucket-name-goes-here/path/goes/here",
    # explicitly pass credentials using storage_options
    storage_options={"token": CoiledShippedCredentials()},
)

If you’re using gcsfs, you can directly pass the token like this:

from coiled.credentials.google import CoiledShippedCredentials
import gcsfs

fs = gcsfs.GCSFileSystem(
    "my-google-cloud-project",
    # explicitly pass credentials to GCSFS
    token=CoiledShippedCredentials(),
)

# do something with the filesystem
fs.ls("my-bucket")

CoiledShippedCredentials subclasses the google.oauth2.credentials.Credentials class, so it can be used directly with Google Python libraries:

from coiled.credentials.google import CoiledShippedCredentials
from google.cloud import storage

def read_object():
  storage_client = storage.Client(
    project="my-gcp-project",
    # explicitly pass credentials to storage client
    credentials=CoiledShippedCredentials(),
  )
  bucket = storage_client.bucket("my-private-bucket")
  return storage.Blob("hello.txt", bucket).download_as_string()

text = read_object()

Long-lived (revocable) Application Default Credentials#

If you’re comfortable putting longer-lived credentials on your cloud VMs, Coiled can forward your local Application Default Credentials to the cluster.

The advantage of this is that they’ll automatically be used by libraries like gcsfs and GDAL, without you needing to explicitly pass the credentials to those libraries.

The disadvantage is that if these credentials were to be compromised, they’re longer-lived than the temporary OAuth2 tokens. The credentials are sent securely over an encrypted connection directly to your cluster (they do not pass through the Coiled control-plane), and you can manually revoke the credentials if you have any concerns that they’ve been exposed.

There are three required steps:

  1. Configure your local Application Default Credentials

  2. Install the google-auth library so Coiled can retrieve your local Application Default Credentials

  3. Send the Application Default Credentials to your Coiled cluster

The easiest way to get Application Default Credentials is by installing gcloud and running gcloud auth application-default login:

conda install -c conda-forge google-cloud-sdk
gcloud auth application-default login

(See the Google doc on User credentials provided by using the gcloud CLI for more details).

To send the Application Default Credentials to your Coiled cluster, run this:

import coiled
from coiled.credentials.google import send_application_default_credentials

# create a Coiled cluster, for example...
cluster = coiled.Cluster(...)

# send Application Default Credentials
send_application_default_credentials(cluster)

# Now gcsfs will automatically pick up these credentials.
# For example, `pandas.read_csv("gs://...")` on cluster should work,
# assuming that it works locally with your local Application Default Credentials.

Should you wish to revoke the credentials, you can do this locally (from the same machine where you ran send_application_default_credentials(cluster)):

gcloud auth application-default revoke

Service Account#

If you’d rather not use locally configured Application Default Credentials, you can also grant data access using a Google Cloud Service Account.

When you configure Coiled to use your Google Cloud account (coiled setup gcp), we ask for two service accounts: one which Coiled will use to run instances on your behalf, and also the data access service account which will control what permissions those instances have.

You can give your workers permissions within Google Cloud by assigning permissions to that data access service account.

Service-agnostic authentication#

If your code needs to use passwords or secret tokens to access data, we recommend sending these to the cluster as environment variables that you can read and use in your code.

cluster = coiled.Cluster(...)
cluster.send_private_envs({"AUTH_TOKEN_FOR_CUSTOM_DATABASE": "some-token"})

This method sends the private environment variables directly from your local machine to the cluster over an encrypted connection and are never stored by Coiled except in memory on your cluster.

In code submitted to the cluster, you could then use os.environ.get("AUTH_TOKEN_FOR_CUSTOM_DATABASE") to retrieve "some-token".