Long-running Lambda workloads: Challenges and alternatives#
2025-07-16
6 min read
AWS Lambda excels at handling short-lived, event-driven tasks, though it quickly runs into limits when your workload needs more than 15 minutes, more memory, or specialized hardware like GPUs. For data practitioners working with large datasets or long-running processes, these constraints can turn a powerful tool into a frustrating bottleneck. In this article we’ll explore:
Common AWS tools for running long-duration tasks such as Step Functions, AWS Batch, and running containerized workflows with Fargate.
How alternative options, like Coiled, offer a simpler, more scalable option for running Python in the cloud without the overhead of maintaining cloud infrastructure or managing K8s.
The challenge of long-running Lambda tasks#
AWS Lambda is designed for short-lived, event-driven functions. And for these for small, event-driven tasks it works really well. This includes things like responding to API Gateway requests, parsing log files, and other short-running Python scripts. Because it’s so lightweight and easier to use than other tools in the AWS ecosystem, it’s not uncommon for data practitioners to find themselves in an anti-pattern where they begin to use AWS Lambda for jobs it wasn’t really designed for. This includes things like:
Training long-running machine learning models
ETL pipelines that churn through large volumes of data
Simulations and heavy computations that demand memory, CPU, and time
Any other jobs that exceed the 15-minute timeout, when you’ve likely hit the AWS Lambda
Task timed out after 900 seconds
error message
At this point, it’s probably time to use something else.
Better alternatives to AWS Lambda for long-running tasks#
In this post, we’ll go through a number of other options for long-running tasks on AWS:
AWS Fargate for running containerized workloads
AWS Step Functions for chaining together steps that involve a mix of short- and long-running processes (including retry logic, branching, waiting).
AWS Batch for long-running, compute-heavy batch jobs.
Coiled as an alternative to any of the above AWS services for easily running Python (or any language, really) on the cloud
Fargate for serverless containers#
Fargate lets you run containers on demand. Similar to AWS Lambda, Fargate abstracts away infrastructure management. You define the CPU, memory, and runtime, and AWS spins up the infrastructure. It’s great for when you’ve already containerized your workloads and they need more than 15 minutes to complete.
Pros:
No servers to manage
Supports tasks of arbitrary length
Integrates well with ECS, EKS, and AWS Batch
Option to scale up/down
Cons:
Slightly more setup than Lambda
Cold starts can still be an issue for short tasks
If you’re not already using Docker, this can add some additional complexity
No GPU support (you’d need to use EC2)
For example, let’s say a data team needs to process a large hourly batch of raw JSON logs, apply transformation logic, and write to a data lake. The entire workflow runs in ~40 minutes. Instead of trying to chunk the job into multiple Lambdas, they could use Fargate to run a single container.
Step Functions for workflow orchestration#
Step Functions let you coordinate multiple AWS tools together when logic, retries, and branching are needed.
Pros:
Visual state machine interface
Built-in retries, waits, error handling
Works with both short and long steps
Cons:
Limited to AWS ecosystem
Costs can climb for high-frequency workflows (pricing is based on state transitions)
For example, let’s say you have a data pipeline that extracts data from an API, cleans and joins multiple sources, runs a forecasting model, and stores the results. You could use Step Functions to orchestrate this across Lambdas for light steps and Fargate tasks for heavy lifting.
AWS Batch for large-scale compute jobs#
AWS Batch is designed for submitting, scheduling, and executing a large volume of jobs. It’s often used for compute-heavy or long-running workloads like scientific simulations or model training. It’s especially useful for array jobs, or when you need to run a bunch of independent jobs in parallel with the same definition. It runs on top of Fargate or EC2. Unlike most other AWS services, you don’t pay an extra cost for using Batch.
Pros:
Fully managed job scheduling and provisioning
Use Spot to reduce cloud costs (but need to manually manage with retries)
Powerful tool for embarrassingly parallel workloads
Cons:
More complex to set up than Lambda or Fargate
Inefficient for low-latency tasks (partially due to scheduler overhead)
Long cold-starts for large container images
A research group might use AWS Batch to run daily simulations that each take 3–5 hours. They can submit hundreds of jobs at once, automatically scaled across hundreds of VMs, to run these in parallel.
Using Coiled instead of AWS services for long-running Lambda workarounds#
While AWS tools like Fargate and Batch are powerful, they often introduce friction for data practitioners. You typically need to write Dockerfiles, manage IAM roles, configure queues, and set up infrastructure. For teams used to working in Jupyter notebooks or Python scripts, this overhead can be a major blocker.
Coiled is an easier alternative to running Python on the cloud. It automates the provisioning of EC2 instances in your AWS account, automatically synchronizes your Python packages, and turns things off when you’re done. It’s easy to go from testing interactively in a notebook to scaling out to cluster of VMs, without needing to build Docker containers or manage Kubernetes.
Feature |
AWS Lambda |
AWS Fargate |
AWS Batch |
Coiled |
---|---|---|---|---|
Max execution time |
15 minutes |
No limit |
No limit |
No limit |
Setup complexity |
Low |
Medium |
High |
Low |
Container image size limit |
10GB |
20GB (200GB w/ extra config) |
No limit (w/ EC2) |
No limit |
Containerization required |
No |
Yes |
Yes |
No (package sync or Docker) |
Built-in parallelism |
❌ No |
❌ No |
✅ Yes |
✅ Yes (Dask, Coiled Batch, Coiled Functions) |
Interactive development |
❌ No |
❌ No |
❌ No |
✅ Yes |
Spot instances |
❌ No |
✅ Yes |
✅ Yes |
✅ Yes |
GPU support |
❌ No |
❌ No |
✅ Yes (w/ EC2) |
✅ Yes |
Scale-to-zero |
✅ Yes |
❌ No |
✅ Yes |
✅ Yes |
In the following sections we’ll go through some ways you can use Coiled as an alternative to other AWS services.
Coiled serverless Python functions#
If you have a Python function you’d like to run on the cloud, you can use the @coiled.function
decorator. Behind the scenes, Coiled will spin up an EC2 instance in your AWS account, run your function, and then turn things off when you’re done.
More details in Coiled Functions docs.
import coiled
import pandas as pd
@coiled.function(
region="us-west-2", # Run close to data
)
def process(filename):
df = pd.read_parquet(filename) # Read S3 data from EC2
df = df[df.name == "Alice"] # Filter data on cloud
return df # Return subset
result = process("s3://my-bucket/data.parquet") # Runs remotely
print(result) # Runs locally
import coiled
@coiled.function(
vm_type="g6.2xlarge", # Run on a GPU instance
)
def train():
import torch
device = torch.device("cuda")
...
return model
model = train() # Runs remotely
import coiled
import pandas as pd
@coiled.function(region="us-west-2") # Run close to the data
def process(filename):
output_filename = filename[:-4] + ".parquet"
df = pd.read_csv(filename)
df.to_parquet(output_filename)
return output_filename
# result = process("s3://my-bucket/data.parquet") # one file
results = process.map(filenames) # many files in parallel
for filename in results:
print("Finished", filename)
Coiled Batch for single jobs to large-scale parallelism#
Coiled Batch is an easy way to run any script on the cloud, and is especially useful for scaling large, independent tasks (ie, embarrassingly parallel workloads) across hundreds of VMs. You can use Coiled Batch to:
Run an arbitrary bash script 100 times in parallel
To run a batch job, add #COILED
comments to your script to specify the cloud resources you want:
Spin up ten cloud VMs with 32 GB of memory to run their own echo
command.
my_script.sh
##!/bin/bash
#COILED ntasks 10
#COILED memory 32GB
#COILED container ubuntu:latest
echo Hello from $COILED_BATCH_TASK_ID
Then launch your script with coiled batch run
:
$ coiled batch run my_script.sh
COILED_BATCH_TASK_ID
is an identifier unique to each task which runs from “0”, “1”, “2”, …, “9” in this case.
Use #COILED
comments directly in Python scripts. Drop the container directive to use package sync to automatically replicate your Python environment on remote VMs.
my_script.py
##COILED n-tasks 10
#COILED memory 8 GiB
#COILED region us-east-2
import os
print(f"Hello from {os.environ['COILED_BATCH_TASK_ID']}")
and then launch your script with coiled batch run
:
$ coiled batch run my_script.py
Under the hood Coiled will:
Inspect your script
Spin up appropriate machines as defined by
#COILED
commentsDownload your software onto them or use the container you specified
Run your script
Shut down the machines
Next steps#
It’s easy to get started with Coiled:
$ pip install coiled
$ coiled quickstart
Learn more: