You can use Coiled to launch Spark clusters.
This is experimental and may be changed or removed without notice.
Spark requires installing a few additional packages:
conda create -n spark -c conda-forge \ coiled pyspark==3.4.1 pyarrow pandas \ grpcio grpcio-status openjdk~=11.0 protobuf
Then ask for a Spark cluster and with a remote Spark connection:
import coiled cluster = coiled.Cluster( n_workers=20, region="us-east-2", ) spark = cluster.get_spark()
# Read cloud parquet data df = spark.read.parquet("s3a://coiled-data/uber") df.show()
This uses Spark Connect to connect you to a remote Spark Cluster. Your driver and executors run on the cloud while you run code from wherever you run Python. Spark Connect supports the Spark DataFrame and Spark SQL APIs.
Spark Connect supports DataFrame and SparkSQL APIs, but not RDD, Streaming or Dataset APIs.
This is experimental and can be changed or removed without warning