简体   繁体   中英

Bulk download of web images

I have about 600k images url (in a list) and I would like to achieve the following :

  • Download all of them
  • generate a thumbnail of specific dimensions
  • upload them to Amazon s3

I have estimated my images average to about 1mb which would be about 600gb of data transfer for downloads. I don't believe my laptop and my Internet connection can take it.

Which way should I go? I'd like preferably to have a solution that minimizes the cost.

I was thinking of a Python script or a JavaScript job, run in parrallel if possible to minimize the time needed

Thanks!

I'd suggest spinning up one or more EC2 instances and running your thumbnail job there. You'll eliminate almost all lot of the bandwidth costs (free from ec2 instances in the right region to s3), and certainly the transfer speed will be faster within the AWS network.

For 600K files to process, you may want to consider loading each of those 'jobs' into an SQS queue, and then have multiple EC2 instances polling the queue for 'work to do' - this will allow you to spin up as many ec2 instances as you want to run in parallel and distribute the work.

However, the work to setup the queue may or may not be worth it depending on how often you need to do this, and how quickly it needs to finish - ie if this is a one time thing, and you can wait a week for it to finish, a single instance plugging away may suffice.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM