Spark¶
You can use Coiled to launch Spark clusters.
Note
This is experimental and may be changed or removed without notice.
Example¶
Deploy a Spark cluster with a remote Spark connection:
import coiled
cluster = coiled.Cluster(
n_workers=20,
region="us-east-2",
worker_memory="16 GiB",
)
spark = cluster.get_spark()
Then use Spark as normal:
# Read cloud parquet data
df = spark.read.parquet("s3a://coiled-data/uber")
df.show()
This uses Spark Connect to connect you to a remote Spark Cluster. Your driver and executors run on the cloud while you run code from wherever you run Python. Spark Connect supports the Spark DataFrame and Spark SQL APIs.
Install¶
Deploying Spark on Coiled requires installing a few additional packages:
conda create -n spark -c conda-forge \
coiled pyspark==3.4.1 pyarrow pandas \
grpcio grpcio-status openjdk~=11.0 protobuf
Limitations¶
Spark Connect supports DataFrame and SparkSQL APIs, but not RDD, Streaming or Dataset APIs.
This is experimental and can be changed or removed without warning