I wrote the following code to load images from my S3 bucket, do some preliminary preprocessing, and read them into a numpy array:
from scipy.misc import imresize
from scipy.misc import imread
import numpy as np
import boto3
import tempfile
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
temp = []
s3 = boto3.resource('s3', region_name='ap-northeast-2') # This is the nearest AWS region to my location
role = get_execution_role()
bucket = s3.Bucket('my-bucket')
for img_name in X:
obj = bucket.Object('ImageFolder/'+img_name)
obj.download_file(img_name)
img = mpimg.imread(img_name)
img = imresize(img, (32, 32))
img = img.astype('float32')
temp.append(img)
X = np.stack(temp)
But it is taking forever to do this. There are about 20000 images, and it took about 3 hours to finish loading them into temp
! And at the time of posting this question, it was in the process of putting temp
into the numpy array X
, which I suspect might take anything from 1-2 hours. That means this whole process takes around 5 hours to complete, while it only took less than a minute in my local system (a run-of-the-mill dual core 2.2 GHz CPU, no GPU)! So, how do I make it faster? And is it possible to do this as fast as in my local system?
I think you can use https://docs.aws.amazon.com/vpc/latest/userguide/vpc-endpoints.html
VPC Endpoints for S3.
Then traffic between your VPC and the other service does not leave the Amazon network, like access S3 bucket via internal network.
https://docs.aws.amazon.com/vpc/latest/userguide/vpc-endpoints-s3.html
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.