How to deploy Amazon EMR
What is EMR?
Elastic MapReduce (EMR) is service that allows to deploy Hadoop cluster on Amazon cloud.
Here we explain how to deploy EMR.
Data
Here we are using data from Kaggle competition.
https://www.kaggle.com/c/competitive-data-science-predict-future-sales
We assume they are copied into bucket s3://bartek-ml-course
. You can do this as the follows.
First download files to dirctory predict_future_sales
and then you can run:
for f in predict_future_sales/*;\
do aws --profile=myaws s3 cp $f s3://bartek-ml-course/predict_future_sales/;
done
Of course you have to replace s3://bartek-ml-course
by the bucket you have created.
We assume that you have installed AWS CLI. If not please refer to AWS-CLI-And-S3.
Create cluster
Go to https://console.aws.amazon.com/elasticmapreduce and click Create cluster
.
Then
Connect
In order to connect you need master ip address:
and then create Security Group:
Now we need to assign security group to master.
From terminal execute:
ssh -i ~/.ssh/barteks-aws.pem hadoop@ec2-XX-XX-XX-XX.compute-1.amazonaws.com
where XX.XX.XX.XX
is master’s ip address.