How to read data (images) faster from AWS S3 buckets?

Question

I wrote the following code to load images from my S3 bucket, do some preliminary preprocessing, and read them into a numpy array:

from scipy.misc import imresize
from scipy.misc import imread
import numpy as np
import boto3
import tempfile
import matplotlib.image as mpimg
import matplotlib.pyplot as plt

temp = []
s3 = boto3.resource('s3', region_name='ap-northeast-2')  # This is the nearest AWS region to my location

role = get_execution_role()
bucket = s3.Bucket('my-bucket')

for img_name in X:
    obj = bucket.Object('ImageFolder/'+img_name)
    obj.download_file(img_name)
    img = mpimg.imread(img_name)
    img = imresize(img, (32, 32))
    img = img.astype('float32')
    temp.append(img)

X = np.stack(temp)

But it is taking forever to do this. There are about 20000 images, and it took about 3 hours to finish loading them into temp ! And at the time of posting this question, it was in the process of putting temp into the numpy array X , which I suspect might take anything from 1-2 hours. That means this whole process takes around 5 hours to complete, while it only took less than a minute in my local system (a run-of-the-mill dual core 2.2 GHz CPU, no GPU)! So, how do I make it faster? And is it possible to do this as fast as in my local system?

Answer 1

I think you can use https://docs.aws.amazon.com/vpc/latest/userguide/vpc-endpoints.html VPC Endpoints for S3.

Then traffic between your VPC and the other service does not leave the Amazon network, like access S3 bucket via internal network.

https://docs.aws.amazon.com/vpc/latest/userguide/vpc-endpoints-s3.html

How to read data (images) faster from AWS S3 buckets?

Question

1 answers

solution1
0 2020-10-12 10:28:58

How to read data (images) faster from AWS S3 buckets?

Question

1 answers

solution1 0 2020-10-12 10:28:58

solution1
0 2020-10-12 10:28:58