Access Remote Data#
You’ll probably need your Dask workers to process private data (e.g. in S3), so you will need a way for those workers to authenticate.
In most cases, Coiled handles this for you and you can run the same code on your cluster that you do locally.
AWS#
Suppose you have an object in an S3 bucket s3://david-auth-example/hello.txt
that’s only accessible from your AWS account.
You might have code like this that you run locally to read some data from S3:
def read_object():
s3 = boto3.client("s3")
data = s3.get_object(Bucket='david-auth-example', Key="hello.txt")
return data["Body"].read()
text = read_object()
When this code runs on your local machine, it’s using the AWS credentials you have in your local environment. What would you need to do to make this code work on a Coiled cluster?
Personal STS tokens#
Our goal is to make the transition from running locally to running in a cluster as seamless as possible, so by default Coiled uses your local credentials to create a temporary STS token (encrypted in transit) that we send to the cluster.
Coiled does this by generating temporary STS tokens (refreshed as needed) which we make available to your cluster. Anything that relies on standard tools like boto3
will use these credentials automatically. (In more detail, the credentials are picked up by the “Container credential provider” in the AWS Credential provider chain.)
The identity we use for this is whichever one Boto on your local machine uses by default.
For example, it could come from environment variables or a ~/.aws/credentials
file.
If you want to check which identity Coiled will use for this by default, you can (if you have the AWS CLI installed) run aws sts get-caller-identity
or in Python:
import boto3
sts = boto3.client("sts")
sts.get_caller_identity()
If the read_object
function above worked locally, it will work on the cluster too.
You can turn off this default by specifying coiled.Cluster(..., credentials=None)
when you start your cluster.
GCP#
Suppose you have an object in a Google Storage bucket that’s only accessible using your Google Cloud credentials.
You might have code like this that locally reads the file from Google Storage:
from google.oauth2.credentials import Credentials
from google.cloud import storage
def read_object():
storage_client = storage.Client(project="my-gcp-project")
bucket = storage_client.bucket("my-private-bucket")
return storage.Blob("hello.txt", bucket).download_as_string()
text = read_object()
Or maybe you explicitly set credentials to use in your code. What would you need to do to make this code work on a Coiled cluster?
Personal OAuth2 tokens#
Our goal is to make the transition from running locally to running in a cluster as seamless as possible, so by default Coiled uses your local Google Application Default Credentials (if set) to create a temporary OAuth2 token (encrypted in transit) that we send to the cluster.
This allows you to use the same Google data-access permissions locally and in the cloud.
There are three required steps:
Configure your local Application Default Credentials
Install the
google-cloud-iam
library so Coiled can make an OAuth2 token from your local Application Default CredentialsUse the OAuth2 token in code running on your cluster
The easiest way to get Application Default Credentials is by installing gcloud
and running gcloud auth application-default login
:
conda install -c conda-forge google-cloud-sdk
gcloud auth application-default login
(See the Google doc on User credentials provided by using the gcloud CLI for more details).
For Coiled to automatically create temporary OAuth2 tokens from your Application Default Credentials and ship those to your cluster,
you’ll need to install google-cloud-iam
in the local software environment from which you’re using the coiled
client to start your cluster.
conda install google-cloud-iam # or pip install google-cloud-iam
The OAuth2 token is shipped (securely) to your cluster and set as an environment variable. You’ll have to explicitly modify your code to make use of this token.
You will need to modify your code to use the forwarded OAuth2 token on the cluster.
There’s a CoiledShippedCredentials
class in the coiled
package that subclasses the Google Credentials
class
and automatically handles using and refreshing the shipped OAuth2 token.
Dask DataFrame API accepts a storage_options
kwarg, you can use this to specify the credentials:
from coiled.credentials.google import CoiledShippedCredentials
# the dataframe you want to write to Google Cloud Storage
df = ...
# write dataframe
df.to_parquet(
"gs://gcs-bucket-name-goes-here/path/goes/here",
# explicitly pass credentials using storage_options
storage_options={"token": CoiledShippedCredentials()},
)
If you’re using gcsfs
, you can directly pass the token like this:
from coiled.credentials.google import CoiledShippedCredentials
import gcsfs
fs = gcsfs.GCSFileSystem(
"my-google-cloud-project",
# explicitly pass credentials to GCSFS
token=CoiledShippedCredentials(),
)
# do something with the filesystem
fs.ls("my-bucket")
CoiledShippedCredentials
subclasses the google.oauth2.credentials.Credentials
class, so it can be used directly with Google Python libraries:
from coiled.credentials.google import CoiledShippedCredentials
from google.cloud import storage
def read_object():
storage_client = storage.Client(
project="my-gcp-project",
# explicitly pass credentials to storage client
credentials=CoiledShippedCredentials(),
)
bucket = storage_client.bucket("my-private-bucket")
return storage.Blob("hello.txt", bucket).download_as_string()
text = read_object()
Long-lived (revocable) Application Default Credentials#
If you’re comfortable putting longer-lived credentials on your cloud VMs, Coiled can forward your local Application Default Credentials to the cluster.
The advantage of this is that they’ll automatically be used by libraries like gcsfs
and GDAL
, without you needing to explicitly pass the credentials to those libraries.
The disadvantage is that if these credentials were to be compromised, they’re longer-lived than the temporary OAuth2 tokens. The credentials are sent securely over an encrypted connection directly to your cluster (they do not pass through the Coiled control-plane), and you can manually revoke the credentials if you have any concerns that they’ve been exposed.
There are three required steps:
Configure your local Application Default Credentials
Install the
google-auth
library so Coiled can retrieve your local Application Default CredentialsSend the Application Default Credentials to your Coiled cluster
The easiest way to get Application Default Credentials is by installing gcloud
and running gcloud auth application-default login
:
conda install -c conda-forge google-cloud-sdk
gcloud auth application-default login
(See the Google doc on User credentials provided by using the gcloud CLI for more details).
To send the Application Default Credentials to your Coiled cluster, run this:
import coiled
from coiled.credentials.google import send_application_default_credentials
# create a Coiled cluster, for example...
cluster = coiled.Cluster(...)
# send Application Default Credentials
send_application_default_credentials(cluster)
# Now gcsfs will automatically pick up these credentials.
# For example, `pandas.read_csv("gs://...")` on cluster should work,
# assuming that it works locally with your local Application Default Credentials.
Should you wish to revoke the credentials, you can do this locally
(from the same machine where you ran send_application_default_credentials(cluster)
):
gcloud auth application-default revoke
Service Account#
If you’d rather not use locally configured Application Default Credentials, you can also grant data access using a Google Cloud Service Account.
When you configure Coiled to use your Google Cloud account (coiled setup gcp
), we ask for two service accounts: one which Coiled will use to run instances on your behalf, and also the data access service account which will control what permissions those instances have.
You can give your workers permissions within Google Cloud by assigning permissions to that data access service account.
Service-agnostic authentication#
If your code needs to use passwords or secret tokens to access data, we recommend sending these to the cluster as environment variables that you can read and use in your code.
cluster = coiled.Cluster(...)
cluster.send_private_envs({"AUTH_TOKEN_FOR_CUSTOM_DATABASE": "some-token"})
This method sends the private environment variables directly from your local machine to the cluster over an encrypted connection and are never stored by Coiled except in memory on your cluster.
In code submitted to the cluster, you could then use os.environ.get("AUTH_TOKEN_FOR_CUSTOM_DATABASE")
to retrieve "some-token"
.