Access Remote Data#

You’ll probably need your Dask workers to process private data (e.g. in S3), so you will need a way for those workers to authenticate.

In most cases, Coiled handles this for you and you can run the same code on your cluster that you do locally.

AWS#

Suppose you have an object in an S3 bucket s3://david-auth-example/hello.txt that’s only accessible from your AWS account.

You might have code like this that you run locally to read some data from S3:

def read_object():
    s3 = boto3.client("s3")
    data = s3.get_object(Bucket='david-auth-example', Key="hello.txt")
    return data["Body"].read()

text = read_object()

When this code runs on your local machine, it’s using the AWS credentials you have in your local environment. What would you need to do to make this code work on a Coiled cluster?

Personal STS tokens#

Our goal is to make the transition from running locally to running in a cluster as seamless as possible, so by default Coiled uses your local credentials to create a temporary STS token (encrypted in transit) that we send to the cluster.

Coiled does this by generating temporary STS tokens (refreshed as needed) which we make available to your cluster. Anything that relies on standard tools like boto3 will use these credentials automatically. (In more detail, the credentials are picked up by the “Container credential provider” in the AWS Credential provider chain.)

The identity we use for this is whichever one Boto on your local machine uses by default. For example, it could come from environment variables or a ~/.aws/credentials file.

If you want to check which identity Coiled will use for this by default, you can (if you have the AWS CLI installed) run aws sts get-caller-identity or in Python:

import boto3
sts = boto3.client("sts")
sts.get_caller_identity()

If the read_object function above worked locally, it will work on the cluster too.

You can turn off this default by specifying coiled.Cluster(..., credentials=None) when you start your cluster.

GCP#

Suppose you have an object in a Google Storage bucket that’s only accessible using your Google Cloud credentials.

You might have code like this that locally reads the file from Google Storage:

from google.oauth2.credentials import Credentials
from google.cloud import storage

def read_object():
  storage_client = storage.Client(project="my-gcp-project")
  bucket = storage_client.bucket("my-private-bucket")
  return storage.Blob("hello.txt", bucket).download_as_string()

text = read_object()

Or maybe you explicitly set credentials to use in your code. What would you need to do to make this code work on a Coiled cluster?

Personal OAuth2 tokens#

Our goal is to make the transition from running locally to running in a cluster as seamless as possible, so by default Coiled uses your local Google Application Default Credentials (if set) to create a temporary OAuth2 token (encrypted in transit) that we send to the cluster.

This allows you to use the same Google data-access permissions locally and in the cloud.

There are three required steps:

  1. Configure your local Application Default Credentials

  2. Add the google-cloud-iam library so Coiled can make an OAuth2 token from your local Application Default Credentials

  3. Use the OAuth2 token in code running on your cluster

The easiest way to get Application Default Credentials is by installing gcloud and running gcloud auth application-default login:

conda install -c conda-forge google-cloud-sdk
gcloud auth application-default login

(See the Google doc on User credentials provided by using the gcloud CLI for more details).

For Coiled to automatically create temporary OAuth2 tokens from your Application Default Credentials and ship those to your cluster, you’ll need to install google-cloud-iam in the local software environment from which you’re using the coiled client to start your cluster.

conda install google-cloud-iam  # or pip install google-cloud-iam

The OAuth2 token is shipped (securely) to your cluster and set as an environment variable. You’ll have to explicitly modify your code to make use of this token.

Here’s the modified version of our example using google-cloud-storage:

from google.oauth2.credentials import Credentials
from google.cloud import storage
import os

def read_object():
  # get the OAuth2 token set on cluster and make a credentials object
  token = os.environ.get("CLOUDSDK_AUTH_ACCESS_TOKEN")
  token_creds = Credentials(token) if token else None

  storage_client = storage.Client(
    project="my-gcp-project",
    credentials=token_creds,  # explicitly pass credentials when making client
  )
  bucket = storage_client.bucket("my-private-bucket")
  return storage.Blob("hello.txt", bucket).download_as_string()

text = read_object()

If you’re using gcsfs, you can directly pass the token like this:

import gcsfs
import os

# this is how the OAuth2 token is available on your cluster
token = os.environ.get("CLOUDSDK_AUTH_ACCESS_TOKEN")
fs = gcsfs.GCSFileSystem("my-google-cloud-project", token=token)

# do something with the filesystem
fs.ls("my-bucket")

Service Account#

If you’d rather not use locally configured Application Default Credentials, you can also grant data access using a Google Cloud Service Account.

When you configure Coiled to use your Google Cloud account (coiled setup gcp), we ask for two service accounts: one which Coiled will use to run instances on your behalf, and also the data access service account which will control what permissions those instances have.

You can give your workers permissions within Google Cloud by assigning permissions to that data access service account.

Service-agnostic authentication#

If your code needs to use passwords or secret tokens to access data, we recommend sending these to the cluster as environment variables that you can read and use in your code.

cluster = coiled.Cluster(...)
cluster.send_private_envs({"AUTH_TOKEN_FOR_CUSTOM_DATABASE": "some-token"})

This method sends the private environment variables directly from your local machine to the cluster over an encrypted connection and are never stored by Coiled except in memory on your cluster.

In code submitted to the cluster, you could then use os.environ.get("AUTH_TOKEN_FOR_CUSTOM_DATABASE") to retrieve "some-token".