Long-running Lambda workloads: Challenges and alternatives#

2025-07-16

6 min read

AWS Lambda excels at handling short-lived, event-driven tasks, though it quickly runs into limits when your workload needs more than 15 minutes, more memory, or specialized hardware like GPUs. For data practitioners working with large datasets or long-running processes, these constraints can turn a powerful tool into a frustrating bottleneck. In this article we’ll explore:

Common AWS tools for running long-duration tasks such as Step Functions, AWS Batch, and running containerized workflows with Fargate.
How alternative options, like Coiled, offer a simpler, more scalable option for running Python in the cloud without the overhead of maintaining cloud infrastructure or managing K8s.

The challenge of long-running Lambda tasks#

AWS Lambda is designed for short-lived, event-driven functions. And for these for small, event-driven tasks it works really well. This includes things like responding to API Gateway requests, parsing log files, and other short-running Python scripts. Because it’s so lightweight and easier to use than other tools in the AWS ecosystem, it’s not uncommon for data practitioners to find themselves in an anti-pattern where they begin to use AWS Lambda for jobs it wasn’t really designed for. This includes things like:

Training long-running machine learning models
ETL pipelines that churn through large volumes of data
Simulations and heavy computations that demand memory, CPU, and time
Any other jobs that exceed the 15-minute timeout, when you’ve likely hit the AWS Lambda Task timed out after 900 seconds error message

At this point, it’s probably time to use something else.

Better alternatives to AWS Lambda for long-running tasks#

In this post, we’ll go through a number of other options for long-running tasks on AWS:

AWS Fargate for running containerized workloads
AWS Step Functions for chaining together steps that involve a mix of short- and long-running processes (including retry logic, branching, waiting).
AWS Batch for long-running, compute-heavy batch jobs.
Coiled as an alternative to any of the above AWS services for easily running Python (or any language, really) on the cloud

Fargate for serverless containers#

Fargate lets you run containers on demand. Similar to AWS Lambda, Fargate abstracts away infrastructure management. You define the CPU, memory, and runtime, and AWS spins up the infrastructure. It’s great for when you’ve already containerized your workloads and they need more than 15 minutes to complete.

Pros:

No servers to manage
Supports tasks of arbitrary length
Integrates well with ECS, EKS, and AWS Batch
Option to scale up/down

Cons:

Slightly more setup than Lambda
Cold starts can still be an issue for short tasks
If you’re not already using Docker, this can add some additional complexity
No GPU support (you’d need to use EC2)

For example, let’s say a data team needs to process a large hourly batch of raw JSON logs, apply transformation logic, and write to a data lake. The entire workflow runs in ~40 minutes. Instead of trying to chunk the job into multiple Lambdas, they could use Fargate to run a single container.

Step Functions for workflow orchestration#

Step Functions let you coordinate multiple AWS tools together when logic, retries, and branching are needed.

Pros:

Visual state machine interface
Built-in retries, waits, error handling
Works with both short and long steps

Cons:

Limited to AWS ecosystem
Costs can climb for high-frequency workflows (pricing is based on state transitions)

For example, let’s say you have a data pipeline that extracts data from an API, cleans and joins multiple sources, runs a forecasting model, and stores the results. You could use Step Functions to orchestrate this across Lambdas for light steps and Fargate tasks for heavy lifting.

AWS Batch for large-scale compute jobs#

AWS Batch is designed for submitting, scheduling, and executing a large volume of jobs. It’s often used for compute-heavy or long-running workloads like scientific simulations or model training. It’s especially useful for array jobs, or when you need to run a bunch of independent jobs in parallel with the same definition. It runs on top of Fargate or EC2. Unlike most other AWS services, you don’t pay an extra cost for using Batch.

Pros:

Fully managed job scheduling and provisioning
Use Spot to reduce cloud costs (but need to manually manage with retries)
Powerful tool for embarrassingly parallel workloads

Cons:

More complex to set up than Lambda or Fargate
Inefficient for low-latency tasks (partially due to scheduler overhead)
Long cold-starts for large container images

A research group might use AWS Batch to run daily simulations that each take 3–5 hours. They can submit hundreds of jobs at once, automatically scaled across hundreds of VMs, to run these in parallel.

Using Coiled instead of AWS services for long-running Lambda workarounds#

While AWS tools like Fargate and Batch are powerful, they often introduce friction for data practitioners. You typically need to write Dockerfiles, manage IAM roles, configure queues, and set up infrastructure. For teams used to working in Jupyter notebooks or Python scripts, this overhead can be a major blocker.

Coiled is an easier alternative to running Python on the cloud. It automates the provisioning of EC2 instances in your AWS account, automatically synchronizes your Python packages, and turns things off when you’re done. It’s easy to go from testing interactively in a notebook to scaling out to cluster of VMs, without needing to build Docker containers or manage Kubernetes.

Feature	AWS Lambda	AWS Fargate	AWS Batch	Coiled
Max execution time	15 minutes	No limit	No limit	No limit
Setup complexity	Low	Medium	High	Low
Container image size limit	10GB	20GB (200GB w/ extra config)	No limit (w/ EC2)	No limit
Containerization required	No	Yes	Yes	No (package sync or Docker)
Built-in parallelism	❌ No	❌ No	✅ Yes	✅ Yes (Dask, Coiled Batch, Coiled Functions)
Interactive development	❌ No	❌ No	❌ No	✅ Yes
Spot instances	❌ No	✅ Yes	✅ Yes	✅ Yes
GPU support	❌ No	❌ No	✅ Yes (w/ EC2)	✅ Yes
Scale-to-zero	✅ Yes	❌ No	✅ Yes	✅ Yes

In the following sections we’ll go through some ways you can use Coiled as an alternative to other AWS services.

Coiled serverless Python functions#

If you have a Python function you’d like to run on the cloud, you can use the @coiled.function decorator. Behind the scenes, Coiled will spin up an EC2 instance in your AWS account, run your function, and then turn things off when you’re done.

More details in Coiled Functions docs.

Data processing

import coiled
import pandas as pd

@coiled.function(
    region="us-west-2",             # Run close to data
)
def process(filename):
    df = pd.read_parquet(filename)  # Read S3 data from EC2
    df = df[df.name == "Alice"]     # Filter data on cloud
    return df                       # Return subset

result = process("s3://my-bucket/data.parquet")  # Runs remotely
print(result)                                    # Runs locally

Model training

import coiled

@coiled.function(
    vm_type="g6.2xlarge",           # Run on a GPU instance
)
def train():
    import torch
    device = torch.device("cuda")
    ...
    return model

model = train()                    # Runs remotely

Parallelism

import coiled
import pandas as pd

@coiled.function(region="us-west-2")  # Run close to the data
def process(filename):
    output_filename = filename[:-4] + ".parquet"
    df = pd.read_csv(filename)
    df.to_parquet(output_filename)
    return output_filename

# result = process("s3://my-bucket/data.parquet")  # one file

results = process.map(filenames)   # many files in parallel
for filename in results:
    print("Finished", filename)

Coiled Batch for single jobs to large-scale parallelism#

Coiled Batch is an easy way to run any script on the cloud, and is especially useful for scaling large, independent tasks (ie, embarrassingly parallel workloads) across hundreds of VMs. You can use Coiled Batch to:

Run an arbitrary bash script 100 times in parallel
Train a PyTorch model on a single GPU
Distributed model training across multiple VMs
Reproject thousands of satellite images with GDAL

To run a batch job, add #COILED comments to your script to specify the cloud resources you want:

Bash Script

Spin up ten cloud VMs with 32 GB of memory to run their own echo command.

my_script.sh#

#!/bin/bash

#COILED ntasks 10
#COILED memory 32GB
#COILED container ubuntu:latest

echo Hello from $COILED_BATCH_TASK_ID

Then launch your script with coiled batch run:

$ coiled batch run my_script.sh

COILED_BATCH_TASK_ID is an identifier unique to each task which runs from “0”, “1”, “2”, …, “9” in this case.

Python Script

Use #COILED comments directly in Python scripts. Drop the container directive to use package sync to automatically replicate your Python environment on remote VMs.

my_script.py#

#COILED n-tasks     10
#COILED memory      8 GiB
#COILED region      us-east-2

import os

print(f"Hello from {os.environ['COILED_BATCH_TASK_ID']}")

and then launch your script with coiled batch run:

$ coiled batch run my_script.py

Under the hood Coiled will:

Inspect your script
Spin up appropriate machines as defined by #COILED comments
Download your software onto them or use the container you specified
Run your script
Shut down the machines

Next steps#

It’s easy to get started with Coiled:

$ pip install coiled
$ coiled quickstart

Learn more: