Easy Heavyweight Serverless Functions

tl;dr

What is the easiest way to run Python code in the cloud, especially for compute jobs?

We briefly compare common options (Lambda, EC2, Fargate, Modal) and then pitch a new contender: Coiled Run.

../_images/easy-serverless-gpu.svg
../_images/easy-serverless-decorator.svg

Motivation to Run Python Code in the Cloud

We want to run jobs in the cloud for a few common reasons:

  1. Data proximity: Our data lives in the cloud, and we want to process it where it lives to avoid costly egress charges

  2. Speed and Scale: We have many things to process and don’t want to wait

  3. Hardware access: We want GPUs or big-memory machines

  4. Always-on Coordination: We want to respond to actions in the real world, like in a web server or in response to data arriving

This post focuses on the first three use cases which require more heavyweight computing. As an example, maybe we want to load, manipulate, and store many files living in cloud storage:

def process(filename):                         # Expensive function
    df = pd.read_csv(filename)
    df = df[df.value > 0]
    df.to_parquet(filename[:-4] + ".parquet")

filenames = s3.glob("s3://my-bucket/*.csv")    # Lots of files
for filename in filenames:                     # Run many times
    process(filename)

Let’s see what kinds of options we have.

Cloud Infrastructure to Run Single Jobs

In general we have two classes of options:

  1. Big VMs (EC2, ECS) which give you a big VM

    • Pros:

      • Hardware flexibility: you can use any hardware available in the cloud

      • Cheap: typically $0.05 per CPU-hour

    • Cons:

      • Annoying to set up / inaccessible

      • Takes at least a minute to turn on (but more likely an hour of human time)

  2. Serverless technologies (AWS Lambda) which run a function for you, hiding the VM

    • Pros:

      • Much easier to use

      • Turns on quickly (seconds)

    • Cons:

      • Easier, but not yet easy. These are still kinda annoying to set up.

      • Hardware limitations

      • Expensive: typically $0.20 per CPU-hour, can’t use Spot

Most people I see using cloud for ad-hoc computing choose VMs. They spin up a big instance, use it for a while, and then spin it down (hopefully). People who start with serverless tend to move over to VMs, usually due to some flexibility limitation. AWS Lambda is often great, except if any of the following are true:

  • You need a big machine

  • You want accelerated hardware like GPUs

  • You have a large software environment

  • Your functions last tens of minutes

  • Your functions last more than a minute, and you’re cost conscious

Third party orchestrators

Products like Argo, Prefect, Airflow, Coiled+Dask (us!), Kubeflow, Anyscale+Ray, Dagster and more will also happily run jobs for you, but this is as part of a broader abstraction. If you just want to run a single job, they’re all overkill.

The best product experience I’ve seen in this space is Modal Labs. It’s pretty slick. Like the others though, it still has non-trivial ceremony. It also has some other limitations like you have to run in their cloud accounts (so data privacy is a concern), it’s only on one region (so data egress costs can be non-trivial) and they cost $0.20 per CPU-hour (which is a 4x markup). Generally a great product experience though, and we hear only positive things. I encourage people to check them out.

Coiled Run

So now we get to our new solution. There are two APIs, they look like this:

  • CLI:

    $ coiled run echo "Hello, world"
    
  • Python:

    import coiled
    
    @coiled.run(memory="512 GiB", region="us-west-2")
    def f():
       return "Hello, world!"
    
    f()
    

Coiled was originally built to deploy distributed Dask clusters in the cloud, a harder problem than serverless execution. However, the infrastructure we built for Dask transfers over surprisingly well. We started this effort because we saw Coiled users spinning up single-node Dask clusters (rather than hundred-node clusters) to run a single task (rather than millions). We took the hint and built these new APIs on top of the existing infrastructure. Here you go:

Run a script

The easiest thing is to run a standalone script in the cloud. Here we run a script in Frankfurt on a machine with 128 cores.

$ coiled run python myscript.py --region eu-central-1 –vm-type m6i.32xlarge

Optionally specify useful keywords:

$ coiled run python myscript.py \
    --region us-east-1 \
    –vm-type m6i.32xlarge \
$ coiled run python pytorch-train.py \
    --region us-west-1 \
    –-vm-type g5.2xlarge \  # GPU instance type
    --container ...

Doesn’t even need to be a Python script

$ coiled run echo "Hello, world"

Coiled creates a VM, synchronizes all of your local software packages, cloud credentials, files, etc.., runs your script, and then shuts down the VM.

Here is a more fully worked example processing parquet data on S3

Run a single function

Alternatively you can run your Python code locally, but decorate individual functions to run remotely, as with my_function below.

import coiled

@coiled.run(
    memory="256 GiB",
    region="us-east-1",
)
def my_function(...):
    return ...

result = my_function(*args, **kwargs)

Here are some more fully worked examples.

Pros and Cons of Coiled Run

  • Coiled Run: Accidental, but easy and efficient cloud functions

    • Pros:

      • Easiest API we can find

      • Any hardware / architecture (uses EC2 instances)

      • Runs securely in your account

      • Runs in any region, close to your data

      • Base cloud cost (uses EC2 instances)

    • Cons:

      • Minute-long start times for your first function call (but subsequent calls have only millisecond delays)

      • Third-party service (requires separate sign-up/accounting)

This essentially offers the flexibility and cost efficiency of EC2 instances, but without the setup pain.

We think that the UX is smooth, mostly inheriting from other features we built for Dask clusters like the following:

  1. Automatic package synchronization: we transmit the same package versions you have locally to your cloud workers

  2. Credential forwarding: we securely forward your AWS permissions to remote workers (safely using AWS STS)

  3. Cost/user management controls: we help you constrain costs on a per-user basis

  4. General progress / UX polish: things just look nice.

Opportunity

We’re excited about Coiled functions because they are …

  • Easy to use for Python developers unfamiliar with cloud computing

  • Flexible with their ability to use any hardware in any region

  • Cheap being backed by foundational technologies like EC2 Fleets.

Dask is super-powerful (far more powerful than what’s described in this post) but there are lots of folks for whom Dask is overkill. We hope that these simpler interfaces are a good onramp to data-proximate and parallel cloud computing.

Interested in playing? The following should work if you have cloud credentials:

pip install coiled
coiled setup  # this will connect Coiled to your cloud account
coiled run echo "Hello, world"

For more on how to get started, see