Spark on gcloud with jupyter
In Google Kubernetes Engine Set Up we have explained how to install
google-cloud-sdk
and configure a project. Here we explain how to use google storage. Here we show
how to run spark cluster using gcloud.
For that we have DATAPROC.
Create cluster
See https://cloud.google.com/dataproc/docs/quickstarts/quickstart-gcloud
First enable the Dataproc API on https://console.cloud.google.com/flows/enableapi?apiid=dataproc
Set default region
gcloud config set dataproc/region us-east1
Then create a cluster called spark-cluster
with python 3.6:
gcloud dataproc clusters create spark-cluster\
--image-version=1.4 \
See https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-versions for information information about available versions and software installed there.
You can see your cluster at:
https://console.cloud.google.com/dataproc/clusters
or by running
gcloud dataproc clusters list
Delete cluster
gcloud dataproc clusters delete spark-cluster
Create cluster with jupyter notebook
In Google Storage we have explained how to configure google storage. Here we create a bucket for storing notebooks.
One can do this with:
gsutil mb gs://bartek-notebooks/
You can list buckets with
gsutil ls
Now we can create cluster
gcloud beta dataproc clusters create jupyter-cluster \
--optional-components=ANACONDA,JUPYTER \
--image-version=1.4 \
--enable-component-gateway \
--bucket bartek-notebooks
Run notebook
Go to
https://console.cloud.google.com/dataproc/clusters
and select your cluster. Then select Web Interfaces
and then Jupyer
.Then choose new PySpark
On the cluster you have direct access to google cloud storage. You can read, for example, csv
file
like this:
sdf = spark\
.read.option("header", "true")\
.csv("gs://bucket-name-data/test.csv")
Now you can easily modify the code from How to access S3 from pyspark.
At the end do not forget to delete cluster. You will be charged.
Links
- https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook
Updated: 2020-01-02