Spark#
You can use Coiled to launch Spark clusters.
Note
This is experimental and may be changed or removed without notice.
Install#
Spark requires installing a few additional packages:
conda create -n spark -c conda-forge \
coiled pyspark==3.4.1 pyarrow pandas \
grpcio grpcio-status openjdk~=11.0 protobuf
Example#
Then ask for a Spark cluster and with a remote Spark connection:
import coiled
cluster = coiled.Cluster(
n_workers=20,
region="us-east-2",
)
spark = cluster.get_spark()
# Read cloud parquet data
df = spark.read.parquet("s3a://coiled-data/uber")
df.show()
This uses Spark Connect to connect you to a remote Spark Cluster. Your driver and executors run on the cloud while you run code from wherever you run Python. Spark Connect supports the Spark DataFrame and Spark SQL APIs.
Limitations#
Spark Connect supports DataFrame and SparkSQL APIs, but not RDD, Streaming or Dataset APIs.
This is experimental and can be changed or removed without warning