In Google Kubernetes Engine Set Up we have explained how to install google-cloud-sdk and configure a project. Here we explain how to use google storage. Here we show how to run spark cluster using gcloud. For that we have DATAPROC.

Create cluster

See https://cloud.google.com/dataproc/docs/quickstarts/quickstart-gcloud

First enable the Dataproc API on https://console.cloud.google.com/flows/enableapi?apiid=dataproc

Set default region

gcloud config set dataproc/region us-east1

Then create a cluster called spark-cluster with python 3.6:

gcloud dataproc clusters create spark-cluster\
        --image-version=1.4 \

See https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-versions for information information about available versions and software installed there.

You can see your cluster at:

https://console.cloud.google.com/dataproc/clusters

or by running

gcloud dataproc clusters list

Delete cluster

gcloud dataproc clusters delete spark-cluster

Create cluster with jupyter notebook

In Google Storage we have explained how to configure google storage. Here we create a bucket for storing notebooks.

One can do this with:

gsutil mb gs://bartek-notebooks/

You can list buckets with

gsutil ls

Now we can create cluster

gcloud beta dataproc clusters create jupyter-cluster \
    --optional-components=ANACONDA,JUPYTER \
    --image-version=1.4 \
    --enable-component-gateway \
    --bucket bartek-notebooks

Run notebook

Go to https://console.cloud.google.com/dataproc/clusters and select your cluster. Then select Web Interfaces and then Jupyer.Then choose new PySpark

On the cluster you have direct access to google cloud storage. You can read, for example, csv file like this:

sdf = spark\
    .read.option("header", "true")\
    .csv("gs://bucket-name-data/test.csv")

Now you can easily modify the code from How to access S3 from pyspark.

At the end do not forget to delete cluster. You will be charged.

  • https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook

Updated: 2020-01-02