What is EMR?

Elastic MapReduce (EMR) is service that allows to deploy Hadoop cluster on Amazon cloud.

Here we explain how to deploy EMR.

Data

Here we are using data from Kaggle competition. https://www.kaggle.com/c/competitive-data-science-predict-future-sales We assume they are copied into bucket s3://bartek-ml-course. You can do this as the follows. First download files to dirctory predict_future_sales and then you can run:

for f in predict_future_sales/*;\
do aws  --profile=myaws s3 cp $f s3://bartek-ml-course/predict_future_sales/;
done

Of course you have to replace s3://bartek-ml-course by the bucket you have created.

We assume that you have installed AWS CLI. If not please refer to AWS-CLI-And-S3.

Create cluster

Go to https://console.aws.amazon.com/elasticmapreduce and click Create cluster.

Then png

Connect

In order to connect you need master ip address: png

and then create Security Group: png png png

Now we need to assign security group to master. png png png png png

From terminal execute:

ssh -i ~/.ssh/barteks-aws.pem hadoop@ec2-XX-XX-XX-XX.compute-1.amazonaws.com

where XX.XX.XX.XX is master’s ip address.

png