Spark#

You can use Coiled to launch Spark clusters.

Note

This is experimental and may be changed or removed without notice.

Example#

Deploy a Spark cluster with a remote Spark connection:

import coiled

cluster = coiled.Cluster(
    n_workers=20,
    region="us-east-2",
    worker_memory="16 GiB",
)

spark = cluster.get_spark()

Then use Spark as normal:

from pyspark.sql.functions import count, hour

# Read cloud parquet data
df = spark.read.parquet("s3a://coiled-data/uber")

# Query data
df = df.withColumn("pickup_hour", hour("pickup_datetime"))
pickup_hour = df.groupBy("pickup_hour").agg(count("*").alias("num_trips")).toPandas()

# Plot results
pickup_hour.sort_values("pickup_hour").plot.bar(x="pickup_hour", y="num_trips")

This uses Spark Connect to connect you to a remote Spark Cluster. Your driver and executors run on the cloud while you run code from wherever you run Python. Spark Connect supports the Spark DataFrame and Spark SQL APIs.

Install#

Deploying Spark on Coiled requires installing a few additional packages:

conda create -n spark -c conda-forge \
    coiled dask>=2023.12.0 pyspark==3.4.1 \
    pyarrow pandas grpcio grpcio-status \
    openjdk~=11.0 protobuf

Limitations#

Spark Connect supports DataFrame and SparkSQL APIs, but not RDD, Streaming or Dataset APIs.
This is experimental and can be changed or removed without warning