Amazon S3

What it is S3

Amazon S3 (Simple Storage Service) is a Amazon’s service for storing files. It is simple in a sense that one store data using the follwing:

  • bucket: place to store. Its name is unique for all S3 users, which means that there cannot exist two buckets with the same name even if they are private for to different users.
  • key: a unique (for a bucket) name that link to the sotred object. It is common to use path like syntax to group objects.
  • object: any file (text or binary). It can be partitioned.

Sign up

First go to https://s3.console.aws.amazon.com/s3

and sign up for S3. You can also try to create a bucket, upload files etc. Here we will explain how to use it porogramatically.

Data

But first let’s get data we are going to use here. We take the dataset train.csv from https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge. We locally store in data directory.

Sampling data

We also sample this dataset in order to have one more example (and faster execution).

import numpy as np
import pandas as pd
np.random.seed(10)
comments = pd.read_csv("data/train.csv")
nrows = comments.shape[0]
comments.iloc[np.random.choice(range(nrows), 10000, replace=False)]\
    .to_csv("data/train_sample10000.csv", index=False)
comments.iloc[np.random.choice(range(nrows), 1000, replace=False)]\
    .to_csv("data/train_sample1000.csv", index=False)
comments.iloc[np.random.choice(range(nrows), 100, replace=False)]\
    .to_csv("data/train_sample100.csv", index=False)
comments10 = comments.iloc[np.random.choice(range(nrows), 10, replace=False)]
comments10.to_csv("data/train_sample10.csv", index=False)
comments10
id comment_text toxic severe_toxic obscene threat insult identity_hate
58764 9d5dbcb8a5b4ffe7 Excuse me? \n\nHi there. This is . I was just ... 0 0 0 0 0 0
131811 c14eac99440f267c Millionaire is at GAN... \n\n…and the review h... 0 0 0 0 0 0
88460 eca71b12782e19dd SHUT yOUR bUTT \n\nThats right, i siad it. I h... 1 0 1 1 0 0
116091 6cb62773403858a4 "\n I agree. Remove. flash; " 0 0 0 0 0 0
42014 7013c411cfcfc56a OK, I will link them on the talk page - could ... 0 0 0 0 0 0
49713 84ee5646920773c5 err... What exactly happens with Serviceman? 0 0 0 0 0 0
103293 28ca8dcc0b342980 i am a newbe i dont even know how to type on t... 0 0 0 0 0 0
95607 ffb366cd60c48f56 "\nAbsolutely agree. No relevance to either hi... 0 0 0 0 0 0
83139 de66043ff744144b Thats what I think did i changed plot to story... 0 0 0 0 0 0
90771 f2d6367d798492d9 "I will improve references. Again, please do n... 0 0 0 0 0 0

Installing AWS Command Line Interface and boto

In order to install boto (Python interface to Amazon Web Service) and AWS Command Line Interface (CLI) type:

pip install boto3
pip install awscli

Then in your home directory create file ~/.aws/credentials with the following:

[myaws]
aws_access_key_id = YOUR_ACCESS_KEY
aws_secret_access_key = YOUR_SECRET_KEY

If you add these configuration as [default], you won’t need to add --profile myaws in CLI commands in Section CLI Basic Commands.

Where to get credentials from

  1. Go to https://console.aws.amazon.com/console/home and log in
  2. Click on USER NAME (right top) and select My Security Credentials.
  3. Click on + Access keys (access key ID and secret access key) and then on Create New Acess Key. 4 Choose Show access key.

CLI Basic Commands

List buckets

aws --profile myaws s3 ls

List all buckets

aws --profile myaws s3 ls 

Create buckers

aws --profile myaws s3 mb s3://barteks-toxic-comments

Warning The bucket namespace is shared by all users of the system so you need to change the name.

Upload and download files

Upload

aws --profile myaws s3 cp data/train.csv s3://barteks-toxic-comments
aws --profile myaws s3 cp data/train_sample10000.csv s3://barteks-toxic-comments/sample/
aws --profile myaws s3 cp data/train_sample1000.csv s3://barteks-toxic-comments/sample/
aws --profile myaws s3 cp data/train_sample100.csv s3://barteks-toxic-comments/sample/
aws --profile myaws s3 cp data/train_sample10.csv s3://barteks-toxic-comments/sample/

The last 4 commands can be done in shell calling:

for f in data/train_sample1*.csv; do aws --profile myaws s3 cp $f s3://barteks-toxic-comments/sample/; done

Download

aws --profile myaws s3 cp s3://barteks-toxic-comments/sample/train_sample10.csv data/train_copy_sample10.csv

List files in path

aws --profile myaws s3 ls s3://barteks-toxic-comments/
aws --profile myaws s3 ls s3://barteks-toxic-comments/sample/

Remove file(s)

aws --profile myaws s3 rm s3://barteks-toxic-comments/sample/train_sample2.csv
aws --profile myaws s3 rm s3://barteks-toxic-comments/sample/ --recursive

Delete bucket

For deleting a bucket use

aws --profile myaws s3 rb  s3://barteks-toxic-comments

in order to delete non empty backet use --force option.

In order to empty a backet use

aws --profile myaws s3 rm s3://barteks-toxic-comments/ --recursive

What Boto is

Boto is a Python package that provides interfaces to Amazon Web Services. Here we are focused on its application to S3.

Creating S3 Resource

We start using boto3 by creating S3 resorce object.

import boto3
session = boto3.Session(profile_name='myaws')
s3 = session.resource('s3')

From evironment variables

If your credentials are stored as evirionment variables AWS_SECRET_KEY_ID and AWS_SECRET_ACCESS_KEY then you can do the following:

import os
aws_access_key_id = os.environ.get('AWS_SECRET_KEY_ID')
aws_secret_access_key = s.environ.get('AWS_SECRET_ACCESS_KEY')
session = boto3.Session(
    aws_access_key_id=aws_access_key_id, 
    aws_secret_access_key=aws_secret_access_key)

List buckets

list(s3.buckets.all())
[s3.Bucket(name='barteks'),
 s3.Bucket(name='barteks-mess-nlp'),
 s3.Bucket(name='barteks-toxic-comments'),
 s3.Bucket(name='barteks-toxic-comments-stats'),
 s3.Bucket(name='edreams2018')]

Create a bucket

Warning As before, bucket’s namespace is shared, so the following command may not poroduce a bucket if a bucket with the name exists.

#s3.create_bucket(
#    ACL='public-read',
#    Bucket="barteks-toxic-comments-stats")

And you have the followng Access Control List (ACL) options while creating it:

  • `‘private’,
  • ‘public-read’,
  • ‘public-read-write’,
  • ‘authenticated-read’`.

Deleting

#bucket = s3.Bucket('barteks-toxic-comments-stats')
#bucket.delete()

List keys in the bucket

bucket = s3.Bucket('barteks-toxic-comments')
objs = [obj for obj in bucket.objects.all()]
objs
[s3.ObjectSummary(bucket_name='barteks-toxic-comments', key='sample/train_sample10.csv'),
 s3.ObjectSummary(bucket_name='barteks-toxic-comments', key='sample/train_sample100.csv'),
 s3.ObjectSummary(bucket_name='barteks-toxic-comments', key='sample/train_sample1000.csv'),
 s3.ObjectSummary(bucket_name='barteks-toxic-comments', key='sample/train_sample10000.csv'),
 s3.ObjectSummary(bucket_name='barteks-toxic-comments', key='train.csv')]
[obj.key for obj in bucket.objects.filter(Prefix="sample/")]
['sample/train_sample10.csv',
 'sample/train_sample100.csv',
 'sample/train_sample1000.csv',
 'sample/train_sample10000.csv']

The object of class ObjectSummary has to properties Bucket (that returns Bucket object), bucket_name and key that return strings.

objs[0].Bucket(), objs[0].bucket_name, objs[0].key
(s3.Bucket(name='barteks-toxic-comments'),
 'barteks-toxic-comments',
 'sample/train_sample10.csv')

Filter keys and sort them

objects = [obj for obj in bucket.objects.filter(Prefix="sample/")]
objects.sort(key=lambda obj: obj.key, reverse=True)
objects
[s3.ObjectSummary(bucket_name='barteks-toxic-comments', key='sample/train_sample10000.csv'),
 s3.ObjectSummary(bucket_name='barteks-toxic-comments', key='sample/train_sample1000.csv'),
 s3.ObjectSummary(bucket_name='barteks-toxic-comments', key='sample/train_sample100.csv'),
 s3.ObjectSummary(bucket_name='barteks-toxic-comments', key='sample/train_sample10.csv')]

Download file

bucket = s3.Bucket('barteks-toxic-comments')
bucket.download_file('sample/train_sample10.csv', "data/train_copy2_sample10.csv")

Transform to pandas.DataFrame

One way to do this is to download the file and open it with pandas.read_csv method. If we do not want to do this we have to read it a buffer and open it from there. In order to do this we need to use low level interaction.

import io
obj = s3.Object('barteks-toxic-comments', 'sample/train_sample100.csv').get()
comments100 = pd.read_csv(io.BytesIO(obj['Body'].read()))
comments100.head()
id comment_text toxic severe_toxic obscene threat insult identity_hate
0 2e9c4b5d271ed9e2 From McCrillis Nsiah=\n\nI'm welcome again aft... 0 0 0 0 0 0
1 717f6930af943c80 "\n\n Invitation \n I'd like to invite you to... 0 0 0 0 0 0
2 6fbf60373657a531 "=Tropical Cyclone George=====\nNamed George, ... 0 0 0 0 0 0
3 9deaefedc0fcb51f No. I agree with BenBuff91 statement. The AFDI... 0 0 0 0 0 0
4 345bedef916b9f9e . It seems the typical paranoid and prejudiced... 0 0 0 0 0 0

Another way, using higher level download_fileobj requires transform bytes streaiming into text streaming.

f = io.BytesIO()
bucket.download_fileobj('sample/train_sample10.csv', f)
f.seek(0)
pd.read_csv(io.TextIOWrapper(f, encoding='utf-8'))
id comment_text toxic severe_toxic obscene threat insult identity_hate
0 9d5dbcb8a5b4ffe7 Excuse me? \n\nHi there. This is . I was just ... 0 0 0 0 0 0
1 c14eac99440f267c Millionaire is at GAN... \n\n…and the review h... 0 0 0 0 0 0
2 eca71b12782e19dd SHUT yOUR bUTT \n\nThats right, i siad it. I h... 1 0 1 1 0 0
3 6cb62773403858a4 "\n I agree. Remove. flash; " 0 0 0 0 0 0
4 7013c411cfcfc56a OK, I will link them on the talk page - could ... 0 0 0 0 0 0
5 84ee5646920773c5 err... What exactly happens with Serviceman? 0 0 0 0 0 0
6 28ca8dcc0b342980 i am a newbe i dont even know how to type on t... 0 0 0 0 0 0
7 ffb366cd60c48f56 "\nAbsolutely agree. No relevance to either hi... 0 0 0 0 0 0
8 de66043ff744144b Thats what I think did i changed plot to story... 0 0 0 0 0 0
9 f2d6367d798492d9 "I will improve references. Again, please do n... 0 0 0 0 0 0

Upload file

stat_bucket = s3.Bucket("barteks-toxic-comments-stats")
comments100stat = \
    comments100.groupby(["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"])\
    .count().reset_index()
comments100stat.to_csv("data/train_sample100stat.csv", index=False)
stat_bucket.upload_file("data/train_sample100stat.csv", 'sample/train_sample100stat.csv')
list(bucket.objects.all())
[s3.ObjectSummary(bucket_name='barteks-toxic-comments', key='sample/train_sample10.csv'),
 s3.ObjectSummary(bucket_name='barteks-toxic-comments', key='sample/train_sample100.csv'),
 s3.ObjectSummary(bucket_name='barteks-toxic-comments', key='sample/train_sample1000.csv'),
 s3.ObjectSummary(bucket_name='barteks-toxic-comments', key='sample/train_sample10000.csv'),
 s3.ObjectSummary(bucket_name='barteks-toxic-comments', key='train.csv')]

With buffer

import io
f = io.StringIO()
comments100stat.to_csv(f, index=False)
stat_bucket.upload_fileobj(f, 'sample/train_sample100stat_copy.csv')
list(bucket.objects.all())
[s3.ObjectSummary(bucket_name='barteks-toxic-comments', key='sample/train_sample10.csv'),
 s3.ObjectSummary(bucket_name='barteks-toxic-comments', key='sample/train_sample100.csv'),
 s3.ObjectSummary(bucket_name='barteks-toxic-comments', key='sample/train_sample1000.csv'),
 s3.ObjectSummary(bucket_name='barteks-toxic-comments', key='sample/train_sample10000.csv'),
 s3.ObjectSummary(bucket_name='barteks-toxic-comments', key='train.csv')]

Delete

obj = s3.Object('barteks-toxic-comments', 'sample/train_copy2_sample10.csv')
obj.delete()
{'ResponseMetadata': {'HTTPHeaders': {'date': 'Fri, 02 Nov 2018 15:39:39 GMT',
   'server': 'AmazonS3',
   'x-amz-id-2': 'CSAuR7e4fWUqg2YuQ8i3gkca1/wGN56Fv3Mt7//D1VmwVm7M2a94FHrJhS0ks4yRFxuPyCB6B8U=',
   'x-amz-request-id': '80F7365FBF37C732'},
  'HTTPStatusCode': 204,
  'HostId': 'CSAuR7e4fWUqg2YuQ8i3gkca1/wGN56Fv3Mt7//D1VmwVm7M2a94FHrJhS0ks4yRFxuPyCB6B8U=',
  'RequestId': '80F7365FBF37C732',
  'RetryAttempts': 0}}

S3 client: low level access

s3_client = session.client('s3')

Access through http(s)

Change Access Control

obj = s3.Object('barteks-toxic-comments-stats', 'sample/train_sample100stat_copy.csv')
obj.Acl().put(ACL='public-read')
{'ResponseMetadata': {'HTTPHeaders': {'content-length': '0',
   'date': 'Fri, 02 Nov 2018 15:39:39 GMT',
   'server': 'AmazonS3',
   'x-amz-id-2': 'n/UeTtw/7MUHgi1tBDFBeJ7mVoyjcenZekIC+qgNQ9izGyTeEAY+PZ9IAJ77g/39EOFSHgI46rY=',
   'x-amz-request-id': '76736BA5657E239C'},
  'HTTPStatusCode': 200,
  'HostId': 'n/UeTtw/7MUHgi1tBDFBeJ7mVoyjcenZekIC+qgNQ9izGyTeEAY+PZ9IAJ77g/39EOFSHgI46rY=',
  'RequestId': '76736BA5657E239C',
  'RetryAttempts': 0}}

Uri

There are two formats of uri:

http(s)://s3.amazonaws.com/<bucket>/<object>
http(s)://<bucket>.s3.amazonaws.com/<object>

Example

https://s3.amazonaws.com/barteks-toxic-comments-stats/sample/train_sample100stat_copy.csv

Streaming with smart_open

Install

pip install smart_open
from smart_open import smart_open

comments1000 = \
    pd.read_csv(
        smart_open(
            's3://barteks-toxic-comments/sample/train_sample1000.csv', 'rb', 
            profile_name='myaws'))
    
comments1000_stat =\
    comments1000.groupby(["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"])\
    .count().reset_index()
comments1000_stat.head()
toxic severe_toxic obscene threat insult identity_hate id comment_text
0 0 0 0 0 0 0 894 894
1 0 0 0 0 1 0 4 4
2 0 0 0 1 0 0 1 1
3 0 0 1 0 0 0 3 3
4 0 0 1 0 1 0 1 1

Passing session

pd.read_csv(smart_open(
    's3://barteks-toxic-comments/sample/train_sample100.csv', 'rb', 
        s3_session=session)
).head()
id comment_text toxic severe_toxic obscene threat insult identity_hate
0 2e9c4b5d271ed9e2 From McCrillis Nsiah=\n\nI'm welcome again aft... 0 0 0 0 0 0
1 717f6930af943c80 "\n\n Invitation \n I'd like to invite you to... 0 0 0 0 0 0
2 6fbf60373657a531 "=Tropical Cyclone George=====\nNamed George, ... 0 0 0 0 0 0
3 9deaefedc0fcb51f No. I agree with BenBuff91 statement. The AFDI... 0 0 0 0 0 0
4 345bedef916b9f9e . It seems the typical paranoid and prejudiced... 0 0 0 0 0 0

It is smart enough to recognize from where it has to read

pd.read_csv(smart_open(
    'data/train_sample100.csv', 'rb', 
    s3_session=session)
).head()
id comment_text toxic severe_toxic obscene threat insult identity_hate
0 2e9c4b5d271ed9e2 From McCrillis Nsiah=\n\nI'm welcome again aft... 0 0 0 0 0 0
1 717f6930af943c80 "\n\n Invitation \n I'd like to invite you to... 0 0 0 0 0 0
2 6fbf60373657a531 "=Tropical Cyclone George=====\nNamed George, ... 0 0 0 0 0 0
3 9deaefedc0fcb51f No. I agree with BenBuff91 statement. The AFDI... 0 0 0 0 0 0
4 345bedef916b9f9e . It seems the typical paranoid and prejudiced... 0 0 0 0 0 0

Writing

with smart_open('s3://barteks-toxic-comments-stats/sample/train_sample1000stat123.csv', 'w', 
               profile_name='myaws') as fout:
    comments1000_stat.to_csv(fout, index=False)
import pickle
class Model:

    def __init__(self):
        self.attr = 123
        
model = Model()

with smart_open("s3://barteks-toxic-comments-stats/models/model.pickle", 'wb', 
               profile_name='myaws') as f:
    pickle.dump(model, f, pickle.HIGHEST_PROTOCOL)
    
list(stat_bucket.objects.all())
[s3.ObjectSummary(bucket_name='barteks-toxic-comments-stats', key='sample/train_sample1000stat.csv'),
 s3.ObjectSummary(bucket_name='barteks-toxic-comments-stats', key='sample/train_sample1000stat.csv.gzip'),
 s3.ObjectSummary(bucket_name='barteks-toxic-comments-stats', key='sample/train_sample1000stat123.csv'),
 s3.ObjectSummary(bucket_name='barteks-toxic-comments-stats', key='sample/train_sample1000stat2.csv.gzip'),
 s3.ObjectSummary(bucket_name='barteks-toxic-comments-stats', key='sample/train_sample100stat.csv'),
 s3.ObjectSummary(bucket_name='barteks-toxic-comments-stats', key='sample/train_sample100stat_copy.csv')]
with smart_open("s3://barteks-toxic-comments-stats/models/model.pickle", 'rb', 
               profile_name='myaws') as f:
    model = pickle.load(f)
print(model.attr)
123
  • https://github.com/boto/boto3
  • https://boto3.amazonaws.com/v1/documentation/api/latest/index.html
  • https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge

Last update: 2018-11-03