Spark

You can use Coiled to launch Spark clusters.

Note

This is experimental and may be changed or removed without notice.

Example

Deploy a Spark cluster with a remote Spark connection:

import coiled

cluster = coiled.Cluster(
    n_workers=20,
    region="us-east-2",
    worker_memory="16 GiB",
)

spark = cluster.get_spark()

Then use Spark as normal:

# Read cloud parquet data
df = spark.read.parquet("s3a://coiled-data/uber")
df.show()

This uses Spark Connect to connect you to a remote Spark Cluster. Your driver and executors run on the cloud while you run code from wherever you run Python. Spark Connect supports the Spark DataFrame and Spark SQL APIs.

Install

Deploying Spark on Coiled requires installing a few additional packages:

conda create -n spark -c conda-forge \
    coiled pyspark==3.4.1 pyarrow pandas \
    grpcio grpcio-status openjdk~=11.0 protobuf

Limitations

  • Spark Connect supports DataFrame and SparkSQL APIs, but not RDD, Streaming or Dataset APIs.

  • This is experimental and can be changed or removed without warning