AWS CLI, S3 And Boto3
Amazon S3
What it is S3
Amazon S3 (Simple Storage Service) is a Amazon’s service for storing files. It is simple in a sense that one store data using the follwing:
- bucket: place to store. Its name is unique for all S3 users, which means that there cannot exist two buckets with the same name even if they are private for to different users.
- key: a unique (for a bucket) name that link to the sotred object. It is common to use path like syntax to group objects.
- object: any file (text or binary). It can be partitioned.
Sign up
First go to https://s3.console.aws.amazon.com/s3
and sign up for S3. You can also try to create a bucket, upload files etc. Here we will explain how to use it porogramatically.
Data
But first let’s get data we are going to use here. We take the dataset train.csv
from https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge.
We locally store in data
directory.
Sampling data
We also sample this dataset in order to have one more example (and faster execution).
import numpy as np
import pandas as pd
np.random.seed(10)
comments = pd.read_csv("data/train.csv")
nrows = comments.shape[0]
comments.iloc[np.random.choice(range(nrows), 10000, replace=False)]\
.to_csv("data/train_sample10000.csv", index=False)
comments.iloc[np.random.choice(range(nrows), 1000, replace=False)]\
.to_csv("data/train_sample1000.csv", index=False)
comments.iloc[np.random.choice(range(nrows), 100, replace=False)]\
.to_csv("data/train_sample100.csv", index=False)
comments10 = comments.iloc[np.random.choice(range(nrows), 10, replace=False)]
comments10.to_csv("data/train_sample10.csv", index=False)
comments10
id | comment_text | toxic | severe_toxic | obscene | threat | insult | identity_hate | |
---|---|---|---|---|---|---|---|---|
58764 | 9d5dbcb8a5b4ffe7 | Excuse me? \n\nHi there. This is . I was just ... | 0 | 0 | 0 | 0 | 0 | 0 |
131811 | c14eac99440f267c | Millionaire is at GAN... \n\n…and the review h... | 0 | 0 | 0 | 0 | 0 | 0 |
88460 | eca71b12782e19dd | SHUT yOUR bUTT \n\nThats right, i siad it. I h... | 1 | 0 | 1 | 1 | 0 | 0 |
116091 | 6cb62773403858a4 | "\n I agree. Remove. flash; " | 0 | 0 | 0 | 0 | 0 | 0 |
42014 | 7013c411cfcfc56a | OK, I will link them on the talk page - could ... | 0 | 0 | 0 | 0 | 0 | 0 |
49713 | 84ee5646920773c5 | err... What exactly happens with Serviceman? | 0 | 0 | 0 | 0 | 0 | 0 |
103293 | 28ca8dcc0b342980 | i am a newbe i dont even know how to type on t... | 0 | 0 | 0 | 0 | 0 | 0 |
95607 | ffb366cd60c48f56 | "\nAbsolutely agree. No relevance to either hi... | 0 | 0 | 0 | 0 | 0 | 0 |
83139 | de66043ff744144b | Thats what I think did i changed plot to story... | 0 | 0 | 0 | 0 | 0 | 0 |
90771 | f2d6367d798492d9 | "I will improve references. Again, please do n... | 0 | 0 | 0 | 0 | 0 | 0 |
Installing AWS Command Line Interface and boto
In order to install boto (Python interface to Amazon Web Service) and AWS Command Line Interface (CLI) type:
pip install boto3
pip install awscli
Then in your home directory create file ~/.aws/credentials
with the following:
[myaws]
aws_access_key_id = YOUR_ACCESS_KEY
aws_secret_access_key = YOUR_SECRET_KEY
If you add these configuration as [default]
, you won’t need to add --profile myaws
in CLI commands in Section CLI Basic Commands.
Where to get credentials from
- Go to https://console.aws.amazon.com/console/home and log in
- Click on USER NAME (right top) and select
My Security Credentials
. - Click on
+ Access keys (access key ID and secret access key)
and then onCreate New Acess Key
. 4 ChooseShow access key
.
CLI Basic Commands
List buckets
aws --profile myaws s3 ls
List all buckets
aws --profile myaws s3 ls
Create buckers
aws --profile myaws s3 mb s3://barteks-toxic-comments
Warning The bucket namespace is shared by all users of the system so you need to change the name.
Upload and download files
Upload
aws --profile myaws s3 cp data/train.csv s3://barteks-toxic-comments
aws --profile myaws s3 cp data/train_sample10000.csv s3://barteks-toxic-comments/sample/
aws --profile myaws s3 cp data/train_sample1000.csv s3://barteks-toxic-comments/sample/
aws --profile myaws s3 cp data/train_sample100.csv s3://barteks-toxic-comments/sample/
aws --profile myaws s3 cp data/train_sample10.csv s3://barteks-toxic-comments/sample/
The last 4 commands can be done in shell calling:
for f in data/train_sample1*.csv; do aws --profile myaws s3 cp $f s3://barteks-toxic-comments/sample/; done
Download
aws --profile myaws s3 cp s3://barteks-toxic-comments/sample/train_sample10.csv data/train_copy_sample10.csv
List files in path
aws --profile myaws s3 ls s3://barteks-toxic-comments/
aws --profile myaws s3 ls s3://barteks-toxic-comments/sample/
Remove file(s)
aws --profile myaws s3 rm s3://barteks-toxic-comments/sample/train_sample2.csv
aws --profile myaws s3 rm s3://barteks-toxic-comments/sample/ --recursive
Delete bucket
For deleting a bucket use
aws --profile myaws s3 rb s3://barteks-toxic-comments
in order to delete non empty backet use --force
option.
In order to empty a backet use
aws --profile myaws s3 rm s3://barteks-toxic-comments/ --recursive
What Boto is
Boto is a Python package that provides interfaces to Amazon Web Services. Here we are focused on its application to S3.
Creating S3 Resource
We start using boto3 by creating S3 resorce object.
import boto3
session = boto3.Session(profile_name='myaws')
s3 = session.resource('s3')
From evironment variables
If your credentials are stored as evirionment variables AWS_SECRET_KEY_ID
and AWS_SECRET_ACCESS_KEY
then you can do the following:
import os
aws_access_key_id = os.environ.get('AWS_SECRET_KEY_ID')
aws_secret_access_key = s.environ.get('AWS_SECRET_ACCESS_KEY')
session = boto3.Session(
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key)
List buckets
list(s3.buckets.all())
[s3.Bucket(name='barteks'),
s3.Bucket(name='barteks-mess-nlp'),
s3.Bucket(name='barteks-toxic-comments'),
s3.Bucket(name='barteks-toxic-comments-stats'),
s3.Bucket(name='edreams2018')]
Create a bucket
Warning As before, bucket’s namespace is shared, so the following command may not poroduce a bucket if a bucket with the name exists.
#s3.create_bucket(
# ACL='public-read',
# Bucket="barteks-toxic-comments-stats")
And you have the followng Access Control List (ACL) options while creating it:
- `‘private’,
- ‘public-read’,
- ‘public-read-write’,
- ‘authenticated-read’`.
Deleting
#bucket = s3.Bucket('barteks-toxic-comments-stats')
#bucket.delete()
List keys in the bucket
bucket = s3.Bucket('barteks-toxic-comments')
objs = [obj for obj in bucket.objects.all()]
objs
[s3.ObjectSummary(bucket_name='barteks-toxic-comments', key='sample/train_sample10.csv'),
s3.ObjectSummary(bucket_name='barteks-toxic-comments', key='sample/train_sample100.csv'),
s3.ObjectSummary(bucket_name='barteks-toxic-comments', key='sample/train_sample1000.csv'),
s3.ObjectSummary(bucket_name='barteks-toxic-comments', key='sample/train_sample10000.csv'),
s3.ObjectSummary(bucket_name='barteks-toxic-comments', key='train.csv')]
[obj.key for obj in bucket.objects.filter(Prefix="sample/")]
['sample/train_sample10.csv',
'sample/train_sample100.csv',
'sample/train_sample1000.csv',
'sample/train_sample10000.csv']
The object of class ObjectSummary
has to properties Bucket
(that returns Bucket object), bucket_name
and key
that return strings.
objs[0].Bucket(), objs[0].bucket_name, objs[0].key
(s3.Bucket(name='barteks-toxic-comments'),
'barteks-toxic-comments',
'sample/train_sample10.csv')
Filter keys and sort them
objects = [obj for obj in bucket.objects.filter(Prefix="sample/")]
objects.sort(key=lambda obj: obj.key, reverse=True)
objects
[s3.ObjectSummary(bucket_name='barteks-toxic-comments', key='sample/train_sample10000.csv'),
s3.ObjectSummary(bucket_name='barteks-toxic-comments', key='sample/train_sample1000.csv'),
s3.ObjectSummary(bucket_name='barteks-toxic-comments', key='sample/train_sample100.csv'),
s3.ObjectSummary(bucket_name='barteks-toxic-comments', key='sample/train_sample10.csv')]
Download file
bucket = s3.Bucket('barteks-toxic-comments')
bucket.download_file('sample/train_sample10.csv', "data/train_copy2_sample10.csv")
Transform to pandas.DataFrame
One way to do this is to download the file and open it with pandas.read_csv
method. If we do not want to do this we have to read it a buffer and open it from there. In order to do this we need to use low level interaction.
import io
obj = s3.Object('barteks-toxic-comments', 'sample/train_sample100.csv').get()
comments100 = pd.read_csv(io.BytesIO(obj['Body'].read()))
comments100.head()
id | comment_text | toxic | severe_toxic | obscene | threat | insult | identity_hate | |
---|---|---|---|---|---|---|---|---|
0 | 2e9c4b5d271ed9e2 | From McCrillis Nsiah=\n\nI'm welcome again aft... | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 717f6930af943c80 | "\n\n Invitation \n I'd like to invite you to... | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 6fbf60373657a531 | "=Tropical Cyclone George=====\nNamed George, ... | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 9deaefedc0fcb51f | No. I agree with BenBuff91 statement. The AFDI... | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 345bedef916b9f9e | . It seems the typical paranoid and prejudiced... | 0 | 0 | 0 | 0 | 0 | 0 |
Another way, using higher level download_fileobj
requires transform bytes streaiming into text streaming.
f = io.BytesIO()
bucket.download_fileobj('sample/train_sample10.csv', f)
f.seek(0)
pd.read_csv(io.TextIOWrapper(f, encoding='utf-8'))
id | comment_text | toxic | severe_toxic | obscene | threat | insult | identity_hate | |
---|---|---|---|---|---|---|---|---|
0 | 9d5dbcb8a5b4ffe7 | Excuse me? \n\nHi there. This is . I was just ... | 0 | 0 | 0 | 0 | 0 | 0 |
1 | c14eac99440f267c | Millionaire is at GAN... \n\n…and the review h... | 0 | 0 | 0 | 0 | 0 | 0 |
2 | eca71b12782e19dd | SHUT yOUR bUTT \n\nThats right, i siad it. I h... | 1 | 0 | 1 | 1 | 0 | 0 |
3 | 6cb62773403858a4 | "\n I agree. Remove. flash; " | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 7013c411cfcfc56a | OK, I will link them on the talk page - could ... | 0 | 0 | 0 | 0 | 0 | 0 |
5 | 84ee5646920773c5 | err... What exactly happens with Serviceman? | 0 | 0 | 0 | 0 | 0 | 0 |
6 | 28ca8dcc0b342980 | i am a newbe i dont even know how to type on t... | 0 | 0 | 0 | 0 | 0 | 0 |
7 | ffb366cd60c48f56 | "\nAbsolutely agree. No relevance to either hi... | 0 | 0 | 0 | 0 | 0 | 0 |
8 | de66043ff744144b | Thats what I think did i changed plot to story... | 0 | 0 | 0 | 0 | 0 | 0 |
9 | f2d6367d798492d9 | "I will improve references. Again, please do n... | 0 | 0 | 0 | 0 | 0 | 0 |
Upload file
stat_bucket = s3.Bucket("barteks-toxic-comments-stats")
comments100stat = \
comments100.groupby(["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"])\
.count().reset_index()
comments100stat.to_csv("data/train_sample100stat.csv", index=False)
stat_bucket.upload_file("data/train_sample100stat.csv", 'sample/train_sample100stat.csv')
list(bucket.objects.all())
[s3.ObjectSummary(bucket_name='barteks-toxic-comments', key='sample/train_sample10.csv'),
s3.ObjectSummary(bucket_name='barteks-toxic-comments', key='sample/train_sample100.csv'),
s3.ObjectSummary(bucket_name='barteks-toxic-comments', key='sample/train_sample1000.csv'),
s3.ObjectSummary(bucket_name='barteks-toxic-comments', key='sample/train_sample10000.csv'),
s3.ObjectSummary(bucket_name='barteks-toxic-comments', key='train.csv')]
With buffer
import io
f = io.StringIO()
comments100stat.to_csv(f, index=False)
stat_bucket.upload_fileobj(f, 'sample/train_sample100stat_copy.csv')
list(bucket.objects.all())
[s3.ObjectSummary(bucket_name='barteks-toxic-comments', key='sample/train_sample10.csv'),
s3.ObjectSummary(bucket_name='barteks-toxic-comments', key='sample/train_sample100.csv'),
s3.ObjectSummary(bucket_name='barteks-toxic-comments', key='sample/train_sample1000.csv'),
s3.ObjectSummary(bucket_name='barteks-toxic-comments', key='sample/train_sample10000.csv'),
s3.ObjectSummary(bucket_name='barteks-toxic-comments', key='train.csv')]
Delete
obj = s3.Object('barteks-toxic-comments', 'sample/train_copy2_sample10.csv')
obj.delete()
{'ResponseMetadata': {'HTTPHeaders': {'date': 'Fri, 02 Nov 2018 15:39:39 GMT',
'server': 'AmazonS3',
'x-amz-id-2': 'CSAuR7e4fWUqg2YuQ8i3gkca1/wGN56Fv3Mt7//D1VmwVm7M2a94FHrJhS0ks4yRFxuPyCB6B8U=',
'x-amz-request-id': '80F7365FBF37C732'},
'HTTPStatusCode': 204,
'HostId': 'CSAuR7e4fWUqg2YuQ8i3gkca1/wGN56Fv3Mt7//D1VmwVm7M2a94FHrJhS0ks4yRFxuPyCB6B8U=',
'RequestId': '80F7365FBF37C732',
'RetryAttempts': 0}}
S3 client: low level access
s3_client = session.client('s3')
Access through http(s)
Change Access Control
obj = s3.Object('barteks-toxic-comments-stats', 'sample/train_sample100stat_copy.csv')
obj.Acl().put(ACL='public-read')
{'ResponseMetadata': {'HTTPHeaders': {'content-length': '0',
'date': 'Fri, 02 Nov 2018 15:39:39 GMT',
'server': 'AmazonS3',
'x-amz-id-2': 'n/UeTtw/7MUHgi1tBDFBeJ7mVoyjcenZekIC+qgNQ9izGyTeEAY+PZ9IAJ77g/39EOFSHgI46rY=',
'x-amz-request-id': '76736BA5657E239C'},
'HTTPStatusCode': 200,
'HostId': 'n/UeTtw/7MUHgi1tBDFBeJ7mVoyjcenZekIC+qgNQ9izGyTeEAY+PZ9IAJ77g/39EOFSHgI46rY=',
'RequestId': '76736BA5657E239C',
'RetryAttempts': 0}}
Uri
There are two formats of uri:
http(s)://s3.amazonaws.com/<bucket>/<object>
http(s)://<bucket>.s3.amazonaws.com/<object>
Example
https://s3.amazonaws.com/barteks-toxic-comments-stats/sample/train_sample100stat_copy.csv
Streaming with smart_open
Install
pip install smart_open
from smart_open import smart_open
comments1000 = \
pd.read_csv(
smart_open(
's3://barteks-toxic-comments/sample/train_sample1000.csv', 'rb',
profile_name='myaws'))
comments1000_stat =\
comments1000.groupby(["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"])\
.count().reset_index()
comments1000_stat.head()
toxic | severe_toxic | obscene | threat | insult | identity_hate | id | comment_text | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 894 | 894 |
1 | 0 | 0 | 0 | 0 | 1 | 0 | 4 | 4 |
2 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |
3 | 0 | 0 | 1 | 0 | 0 | 0 | 3 | 3 |
4 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 |
Passing session
pd.read_csv(smart_open(
's3://barteks-toxic-comments/sample/train_sample100.csv', 'rb',
s3_session=session)
).head()
id | comment_text | toxic | severe_toxic | obscene | threat | insult | identity_hate | |
---|---|---|---|---|---|---|---|---|
0 | 2e9c4b5d271ed9e2 | From McCrillis Nsiah=\n\nI'm welcome again aft... | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 717f6930af943c80 | "\n\n Invitation \n I'd like to invite you to... | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 6fbf60373657a531 | "=Tropical Cyclone George=====\nNamed George, ... | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 9deaefedc0fcb51f | No. I agree with BenBuff91 statement. The AFDI... | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 345bedef916b9f9e | . It seems the typical paranoid and prejudiced... | 0 | 0 | 0 | 0 | 0 | 0 |
It is smart enough to recognize from where it has to read
pd.read_csv(smart_open(
'data/train_sample100.csv', 'rb',
s3_session=session)
).head()
id | comment_text | toxic | severe_toxic | obscene | threat | insult | identity_hate | |
---|---|---|---|---|---|---|---|---|
0 | 2e9c4b5d271ed9e2 | From McCrillis Nsiah=\n\nI'm welcome again aft... | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 717f6930af943c80 | "\n\n Invitation \n I'd like to invite you to... | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 6fbf60373657a531 | "=Tropical Cyclone George=====\nNamed George, ... | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 9deaefedc0fcb51f | No. I agree with BenBuff91 statement. The AFDI... | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 345bedef916b9f9e | . It seems the typical paranoid and prejudiced... | 0 | 0 | 0 | 0 | 0 | 0 |
Writing
with smart_open('s3://barteks-toxic-comments-stats/sample/train_sample1000stat123.csv', 'w',
profile_name='myaws') as fout:
comments1000_stat.to_csv(fout, index=False)
import pickle
class Model:
def __init__(self):
self.attr = 123
model = Model()
with smart_open("s3://barteks-toxic-comments-stats/models/model.pickle", 'wb',
profile_name='myaws') as f:
pickle.dump(model, f, pickle.HIGHEST_PROTOCOL)
list(stat_bucket.objects.all())
[s3.ObjectSummary(bucket_name='barteks-toxic-comments-stats', key='sample/train_sample1000stat.csv'),
s3.ObjectSummary(bucket_name='barteks-toxic-comments-stats', key='sample/train_sample1000stat.csv.gzip'),
s3.ObjectSummary(bucket_name='barteks-toxic-comments-stats', key='sample/train_sample1000stat123.csv'),
s3.ObjectSummary(bucket_name='barteks-toxic-comments-stats', key='sample/train_sample1000stat2.csv.gzip'),
s3.ObjectSummary(bucket_name='barteks-toxic-comments-stats', key='sample/train_sample100stat.csv'),
s3.ObjectSummary(bucket_name='barteks-toxic-comments-stats', key='sample/train_sample100stat_copy.csv')]
with smart_open("s3://barteks-toxic-comments-stats/models/model.pickle", 'rb',
profile_name='myaws') as f:
model = pickle.load(f)
print(model.attr)
123
Links:
- https://github.com/boto/boto3
- https://boto3.amazonaws.com/v1/documentation/api/latest/index.html
- https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge
Last update: 2018-11-03