How to install pyspark

Note We will assume here that you are using Ubuntu with bash. Therefore default shell configuration file is ~/.bashrc. If you are using Mac with standard configuration, then you would need to use ~/.bash_profile. And finally if you are using zshell then you most likely know what to do (use ~/.zshrc or ./.zshenv).

Download spark

First you need to download spark from https://spark.apache.org/downloads.html wherever you want. I’ll download it to ~\programs. Then unpack it there. You can do this, for example by calling:

cd ~/programs
tar zxvf spark-2.4.0-bin-hadoop2.7.tgz 
rm spark-2.4.0-bin-hadoop2.7.tgz 

Then I will create symbolic link to it by calling

ln -s ~/programs/spark-2.4.0-bin-hadoop2.7 ~/programs/spark

In ~/.bashrc you need to add the following lines:

export SPARK_HOME="$HOME/programs/spark"
export PATH=$SPARK_HOME/bin:$PATH

Pyspark

We assume that you are using pyenv. Instruction to install it are here: http://bartek-blog.github.io/python/virtualenv/2018/08/18/Pyenv-and-VirtualEnvs.html

So let’s create environment for pyspark.

pyenv shell 3.6.8
mkvirtualenv py3.6-spark
pip install pyspark jupyter

Now, you can test if you can enter pyspark shell by simply running pyspark.

Mac

On mac we can additionally install numpy, scipy and sklearn form intel.

pip install intel-numpy intel-scipy intel-scikit-learn pandas

Ubuntu

Unfortunatelly they does not work well with Ubuntu and pyspark, so w should install:

pip install pyspark jupyter pandas numpy scipy  scikit-learn jupyter_contrib_nbextensions

Jupyter

If you want directly launch jupyter directly when running pyspark you can add the following lines in ~/.bashrc:

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'