简体   繁体   English

从MongoDB获取大数据的最佳方法

[英]Best approach to get large data from MongoDB

I have a database, over 2 million records. 我有一个数据库,超过200万条记录。 Each record contains a URL to an image which I need to download and store to AWS S3 . 每条记录都包含一个图像的URL,我需要下载并存储到AWS S3

Rather than downloading each one, one at a time and then uploading one at a time, is there a better approach to deal with this? 而不是一次下载一个,然后一次上传一个,是否有更好的方法来处理这个问题?

I am using Python and therefore pymongo currently. 我目前正在使用Python,因此使用pymongo。

for item in itemsCursor: 
    download_image(item['imageurl')

def download_image(item):
   name = 'example.jpg'
   response = requests.get(url)
   img = Image.open(StringIO(response.content))
   img.save('temp.jpg', "JPEG")
   s3.meta.client.upload_file('temp.jpg', 'bucket', name)

The best way to do this would be to do batches and multithreading. 执行此操作的最佳方法是进行批处理和多线程处理。 I've solved similar problems by adding a field with a datestamp or boolean indicating that a particular item has been processed (or perhaps in this case, a link to it's file ID or URL on AWS) and writing a client script or application that will pick one or a batch of items that needs to be processed and churn through them. 我通过添加一个带有日期戳或布尔值的字段来解决类似的问题,该字段指示特定项目已被处理(或者在这种情况下,可能是在AWS上链接到它的文件ID或URL)并编写将要编写的客户端脚本或应用程序选择一个或一批需要处理的项目并通过它们进行流失。

Of course, be sure that threads or other computers also running the script don't trip over each other by making a certain value or even a separate field indicate that a thread has claimed a particular record and is in the process of processing it. 当然,确保运行脚本的线程或其他计算机不会通过创建特定值或甚至单独的字段来表示线程已声明特定记录并且正在处理它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM