How to install pyspark
Note We will assume here that you are using Ubuntu with bash. Therefore default shell
configuration file is ~/.bashrc. If you are using Mac with standard configuration, then you would
need to use ~/.bash_profile. And finally if you are using zshell then you most likely know what
to do (use ~/.zshrc or ./.zshenv).
Download spark
First you need to download spark from
https://spark.apache.org/downloads.html
wherever you want. I’ll download it to ~\programs. Then unpack it there. You can do this, for
example by calling:
cd ~/programs
tar zxvf spark-2.4.0-bin-hadoop2.7.tgz
rm spark-2.4.0-bin-hadoop2.7.tgz
Then I will create symbolic link to it by calling
ln -s ~/programs/spark-2.4.0-bin-hadoop2.7 ~/programs/spark
In ~/.bashrc you need to add the following lines:
export SPARK_HOME="$HOME/programs/spark"
export PATH=$SPARK_HOME/bin:$PATH
Pyspark
We assume that you are using pyenv. Instruction to install it are here:
http://bartek-blog.github.io/python/virtualenv/2018/08/18/Pyenv-and-VirtualEnvs.html
So let’s create environment for pyspark.
pyenv shell 3.6.8
mkvirtualenv py3.6-spark
pip install pyspark jupyter
Now, you can test if you can enter pyspark shell by simply running pyspark.
Mac
On mac we can additionally install numpy, scipy and sklearn form intel.
pip install intel-numpy intel-scipy intel-scikit-learn pandas
Ubuntu
Unfortunatelly they does not work well with Ubuntu and pyspark, so w should install:
pip install pyspark jupyter pandas numpy scipy scikit-learn jupyter_contrib_nbextensions
Jupyter
If you want directly launch jupyter directly when running pyspark you can add the following lines
in ~/.bashrc:
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'