Install and Run SparkR - easy way
Requirements
First, you must have R and java installed. This is a bit out the scope of this note, but let me cover it a bit.
Installing Java on Ubuntu:
sudo add-apt-repository ppa:webupd8team/java
sudo apt update
sudo apt install oracle-java8-installer
You can manage java version by calling
sudo update-alternatives --config java
and editing:
sudo nano /etc/environment
by adding line
JAVA_HOME="/usr/lib/jvm/java-8-oracle"
Then test it by calling
source /etc/environment
echo $JAVA_HOME
java -version
Installing R on Ubuntu
https://launchpad.net/~marutter/+archive/ubuntu/c2d4u
sudo add-apt-repository ppa:marutter/c2d4u
sudo apt update
sudo apt install r-base r-base-dev
Installing Rstudio on Ubuntu
sudo apt-get install gdebi-core
wget https://download1.rstudio.org/rstudio-xenial-1.1.453-amd64.deb
sudo gdebi rstudio-xenial-1.1.453-amd64.deb
rm rstudio-xenial-1.1.453-amd64.deb
Installing rJava
Next, you have to see if you can install R interface to java. This can be done by calling
sudo R CMD javareconf
Now you can install rJava
in R :
install.packages("rJava")
Depending or your Operating System this can produce an error, and some
extra action would be required. Please google something like rJava
install error ubuntu
etc.
Test it in R by calling, for example:
library(rJava)
.jinit()
Double <- J("java.lang.Double")
d <- new(Double, "10.2")
d
## [1] "Java-Object{10.2}"
R and rJava on Mac
https://github.com/snowflakedb/dplyr-snowflakedb/wiki/Configuring-R-rJava-RJDBC-on-Mac-OS-X
Other packages
You can also install few packages we may use in presentation. Simply call:
install.packages(c("rmarkdown", "ggplot2", "magrittr", "whisker", "data.table"))
rmarkdown
: package that makes possible to create this documentggplot2
: popular graphing package,magrittr
: nicer functions chainingwhisker
: it’s Movemberdata.table
,reshape2
: an alternative for “data.frame”
Download Spark
The easiest way to downloading spark is getting pre-bulid version from [http://spark.apache.org/downloads.html].
Mac
https://www.r-bloggers.com/six-lines-to-install-and-start-sparkr-on-mac-os-x-yosemite/
SparkR test drive
The way we use SparkR here is far from being en example of best practice. Not only because it does not run on more than one computer, but also because we isolate the SparkR package from other packages by hardcoding library path. This should help you set up you Spark(R) fast for test drive.
Lets say you have downloaded and uncompress it to the folder
/Users/bartek/programs/spark-2.3.0-bin-hadoop2.7
So every time you see this path please change it to the one you have
(windows users have to probably change also slashes /
to backslashes
\
and add something like /C/
)
Then try to run the following code in R:
spark_path <- '/Users/bartek/programs/spark-2.3.0-bin-hadoop2.7'
if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
Sys.setenv(SPARK_HOME = spark_path)
}
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
##
## Attaching package: 'SparkR'
## The following objects are masked from 'package:stats':
##
## cov, filter, lag, na.omit, predict, sd, var, window
## The following objects are masked from 'package:base':
##
## as.data.frame, colnames, colnames<-, drop, endsWith,
## intersect, rank, rbind, sample, startsWith, subset, summary,
## transform, union
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "2g"))
## Spark package found in SPARK_HOME: /Users/bartek/programs/spark-2.3.0-bin-hadoop2.7
## Launching java with spark-submit command /Users/bartek/programs/spark-2.3.0-bin-hadoop2.7/bin/spark-submit --driver-memory "2g" sparkr-shell /var/folders/w1/mtb5t1yd28l4kwz656bvhtmw0000gn/T//RtmpecBSLg/backend_port59ea6ee936dd
## Java ref type org.apache.spark.sql.SparkSession id 1
library('magrittr')
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:SparkR':
##
## not
library('ggplot2')
df <- as.DataFrame(mtcars)
model <- glm(mpg ~ wt, data = df, family = "gaussian")
summary(model)
##
## Deviance Residuals:
## (Note: These are approximate quantiles with relative error <= 0.01)
## Min 1Q Median 3Q Max
## -4.5432 -2.6110 -0.2001 1.2973 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 0.000e+00
## wt -5.3445 0.5591 -9.559 1.294e-10
##
## (Dispersion parameter for gaussian family taken to be 9.277398)
##
## Null deviance: 1126.05 on 31 degrees of freedom
## Residual deviance: 278.32 on 30 degrees of freedom
## AIC: 166
##
## Number of Fisher Scoring iterations: 1
predictions <- predict(model, newData = df)
class(predictions)
## [1] "SparkDataFrame"
## attr(,"package")
## [1] "SparkR"
predictions %>%
select("wt", "mpg", "prediction") %>%
collect %>%
ggplot() + geom_point(aes(wt, prediction - mpg)) +
geom_hline(yintercept=0) + theme_bw()
Sparklyr
One can also install sparklyr. First we install devtools.
install.packages("devtools")
devtools::install_github("hadley/devtools") ## for latest version
devtools::install_github("rstudio/sparklyr")
devtools::install_github("tidyverse/dplyr")
Check available versions:
library(sparklyr)
spark_available_versions()
## spark
## 1 1.6.3
## 2 1.6.2
## 3 1.6.1
## 4 1.6.0
## 5 2.0.0
## 6 2.0.1
## 7 2.0.2
## 8 2.1.0
## 9 2.1.1
## 10 2.2.0
## 11 2.2.1
## 12 2.3.0
spark_install(version = "2.3.0")
sparkR in sparklyr
library(sparklyr)
sc <- spark_connect(master = "local")
spark_path = sc$spark_home
spark_disconnect(sc)
Sys.setenv(
SPARK_HOME=spark_path
)
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "2g"))
## Java ref type org.apache.spark.sql.SparkSession id 1
Contact
Updated 2018-08-19