How to access S3 from pyspark
Running pyspark
I assume that you have installed pyspak
somehow similar to the guide here.
http://bartek-blog.github.io/python/spark/pyspark/2019/03/27/how-to-install-pyspark.html
Then you should start pyspark
with
pyspark --packages=org.apache.hadoop:hadoop-aws:2.7.3
Code
Read aws configuration
For more details how to configure AWS access see http://bartek-blog.github.io/s3/cli/aws/python/boto3/2018/09/10/AWS-CLI-And-S3.html
import configparser
aws_profile = "myaws"
config = configparser.ConfigParser()
config.read(os.path.expanduser("~/.aws/credentials"))
access_id = config.get(aws_profile, "aws_access_key_id")
access_key = config.get(aws_profile, "aws_secret_access_key")
configure hadoop
hadoop_conf = spark._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoop_conf.set("fs.s3n.awsAccessKeyId", access_id)
hadoop_conf.set("fs.s3n.awsSecretAccessKey", access_key)
Read data
sdf = spark.read.option("header", "true").csv("s3n://bartek-ml-course/predict_future_sales/sales_train.csv.gz")
sdf.printSchema()
root
|-- date: string (nullable = true)
|-- date_block_num: string (nullable = true)
|-- shop_id: string (nullable = true)
|-- item_id: string (nullable = true)
|-- item_price: string (nullable = true)
|-- item_cnt_day: string (nullable = true)
Write data
import pyspark.sql.functions as F
sdf.groupBy("date").agg(F.sum(F.col('item_cnt_day')).alias("items"))\
.repartition(1)\
.write.mode("overwrite")\
.parquet("s3n://bartek-ml-course/predict_future_sales-aggregations/daily-total-sales")
spark.read.parquet("s3n://bartek-ml-course/predict_future_sales-aggregations/daily-total-sales").show()
+----------+------+
| date| items|
+----------+------+
|16.02.2013|6643.0|
|09.02.2014|4646.0|
|01.09.2014|2887.0|
|18.10.2014|5001.0|
|27.06.2015|2563.0|
|17.09.2015|1887.0|
|29.04.2013|2771.0|
|12.04.2013|3947.0|
|18.09.2014|2441.0|
|15.08.2015|2201.0|
|28.10.2015|3593.0|
|05.02.2013|3302.0|
|21.09.2013|6698.0|
|31.05.2014|5395.0|
|02.11.2014|4390.0|
|08.07.2015|1905.0|
|13.09.2015|2660.0|
|06.10.2015|1343.0|
|13.06.2013|3399.0|
|22.02.2014|8472.0|
+----------+------+
only showing top 20 rows