How to install pyspark
Note We will assume here that you are using Ubuntu with bash
. Therefore default shell
configuration file is ~/.bashrc
. If you are using Mac with standard configuration, then you would
need to use ~/.bash_profile
. And finally if you are using zshell
then you most likely know what
to do (use ~/.zshrc
or ./.zshenv
).
Download spark
First you need to download spark from
https://spark.apache.org/downloads.html
wherever you want. I’ll download it to ~\programs
. Then unpack it there. You can do this, for
example by calling:
cd ~/programs
tar zxvf spark-2.4.0-bin-hadoop2.7.tgz
rm spark-2.4.0-bin-hadoop2.7.tgz
Then I will create symbolic link to it by calling
ln -s ~/programs/spark-2.4.0-bin-hadoop2.7 ~/programs/spark
In ~/.bashrc
you need to add the following lines:
export SPARK_HOME="$HOME/programs/spark"
export PATH=$SPARK_HOME/bin:$PATH
Pyspark
We assume that you are using pyenv
. Instruction to install it are here:
http://bartek-blog.github.io/python/virtualenv/2018/08/18/Pyenv-and-VirtualEnvs.html
So let’s create environment for pyspark.
pyenv shell 3.6.8
mkvirtualenv py3.6-spark
pip install pyspark jupyter
Now, you can test if you can enter pyspark shell by simply running pyspark
.
Mac
On mac we can additionally install numpy
, scipy
and sklearn
form intel.
pip install intel-numpy intel-scipy intel-scikit-learn pandas
Ubuntu
Unfortunatelly they does not work well with Ubuntu and pyspark, so w should install:
pip install pyspark jupyter pandas numpy scipy scikit-learn jupyter_contrib_nbextensions
Jupyter
If you want directly launch jupyter directly when running pyspark
you can add the following lines
in ~/.bashrc
:
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'