简体   繁体   中英

How to read data (images) faster from AWS S3 buckets?

I wrote the following code to load images from my S3 bucket, do some preliminary preprocessing, and read them into a numpy array:

from scipy.misc import imresize
from scipy.misc import imread
import numpy as np
import boto3
import tempfile
import matplotlib.image as mpimg
import matplotlib.pyplot as plt

temp = []
s3 = boto3.resource('s3', region_name='ap-northeast-2')  # This is the nearest AWS region to my location

role = get_execution_role()
bucket = s3.Bucket('my-bucket')

for img_name in X:
    obj = bucket.Object('ImageFolder/'+img_name)
    obj.download_file(img_name)
    img = mpimg.imread(img_name)
    img = imresize(img, (32, 32))
    img = img.astype('float32')
    temp.append(img)

X = np.stack(temp)

But it is taking forever to do this. There are about 20000 images, and it took about 3 hours to finish loading them into temp ! And at the time of posting this question, it was in the process of putting temp into the numpy array X , which I suspect might take anything from 1-2 hours. That means this whole process takes around 5 hours to complete, while it only took less than a minute in my local system (a run-of-the-mill dual core 2.2 GHz CPU, no GPU)! So, how do I make it faster? And is it possible to do this as fast as in my local system?

I think you can use https://docs.aws.amazon.com/vpc/latest/userguide/vpc-endpoints.html VPC Endpoints for S3.

Then traffic between your VPC and the other service does not leave the Amazon network, like access S3 bucket via internal network.

https://docs.aws.amazon.com/vpc/latest/userguide/vpc-endpoints-s3.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM