简体   繁体   中英

Best approach to get large data from MongoDB

I have a database, over 2 million records. Each record contains a URL to an image which I need to download and store to AWS S3 .

Rather than downloading each one, one at a time and then uploading one at a time, is there a better approach to deal with this?

I am using Python and therefore pymongo currently.

for item in itemsCursor: 
    download_image(item['imageurl')

def download_image(item):
   name = 'example.jpg'
   response = requests.get(url)
   img = Image.open(StringIO(response.content))
   img.save('temp.jpg', "JPEG")
   s3.meta.client.upload_file('temp.jpg', 'bucket', name)

The best way to do this would be to do batches and multithreading. I've solved similar problems by adding a field with a datestamp or boolean indicating that a particular item has been processed (or perhaps in this case, a link to it's file ID or URL on AWS) and writing a client script or application that will pick one or a batch of items that needs to be processed and churn through them.

Of course, be sure that threads or other computers also running the script don't trip over each other by making a certain value or even a separate field indicate that a thread has claimed a particular record and is in the process of processing it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM