简体   繁体   English

EC2使用python从MongoDB查询大数据失败

[英]EC2 querying large data from MongoDB using python is failing

Currently I have setup MongoDB on an EC2 with Amazon Linux. 目前,我已经在带有Amazon Linux的EC2上设置了MongoDB。 It has around 1M documents. 它大约有1M个文档。

On the same EC2, I used pymongo db.collection.find({}, {'attribute_1':1}) to query the all the attribute_1 in all documents. 在同一EC2上,我使用pymongo db.collection.find({},{'attribute_1':1})查询所有文档中的所有attribute_1。

The problem is, after iterating and retrieving around 200,000 documents, my python code just stop working. 问题是,在迭代并检索了大约200,000个文档之后,我的python代码才停止工作。

It does not show any error (I did try catch). 它没有显示任何错误(我确实尝试过捕获)。 In mongodb log also doesn't show any specific error. 在mongodb中,日志也不显示任何特定错误。

I highly suspect it because of the EC2 network bandwidth, however, I tried to split the documents in batches, with 100,000 documents per batch. 由于EC2网络带宽的缘故,我对此表示高度怀疑,但是,我尝试分批拆分文档,每批100,000个文档。 And it still not works. 而且仍然无法正常工作。 It just automatically break at around 200,000 documents. 它只是自动中断约200,000个文档。 The code is as below: 代码如下:

        count = db.collection.count()
        page = int(ceil(count/100000.0))
        result = []
        i = 0
        for p in range(0, page):
            temp = db.collection.find({}, {'attribute_1':1})[p*100000:p*100000+100000]
            for t in temp:
                result.append(t['attribute_1'])
                i = i+1
                print i

I tried EC2 log also and found nothing weird. 我也尝试了EC2日志,但没有发现任何异常。 The EC2 continued to work normally after the break (I still could access the command line, cd, ls etc.) My EC2 instance is c3.2xlarge. 休息后,EC2继续正常工作(我仍然可以访问命令行,cd,ls等。)我的EC2实例为c3.2xlarge。 I currently stuck with this for few days, any help is appreciated. 我目前坚持了几天,对您的帮助表示感谢。 Thanks in advance. 提前致谢。

Update: After searching for system log, I found these: 更新:搜索系统日志后,我发现了这些:

Apr 22 10:12:53 ip-xxx kernel: [ 8774.975653] Out of memory: Kill process 3709 (python) score 509 or sacrifice child
Apr 22 10:12:53 ip-xxx kernel: [ 8774.978941] Killed process 3709 (python) total-vm:8697496kB, anon-rss:8078912kB, file-rss:48kB

My EC2 instance already has 15 GB RAMs. 我的EC2实例已经有15 GB的RAM。 The Attribute_1 is a python list of words. Attribute_1是一个python单词列表。 Each Attribute_1 consists quite a lot amount of elements (words). 每个Attribute_1包含很多元素(单词)。 Is there anyway for me to fix this problem? 反正我有解决此问题的方法吗?

You appear to be creating a very large list result and that has exceeded the available memory in the instance. 您似乎正在创建一个非常大的列表result ,并且超出了实例中的可用内存。 Generally this will indicate that you need to re-design some part of your system so that only the data you really need is required to be processed by python. 通常,这表明您需要重新设计系统的某些部分,以便仅需要真正需要的数据才需要由python处理。 A few options: 一些选择:

  • pymongo's find returns a cursor - maybe you don't actually need the list at all pymongo的find返回一个游标 -也许您实际上根本不需要该列表
  • Process information about the data as it is inserted and store in another collection 在插入数据并存储在另一个集合中时处理有关数据的信息
  • Use queries and aggregation to return what you require from the db in the format you need it 使用查询和聚合以所需的格式从数据库返回所需的内容
  • Split the processing across multiple machines 将处理分散到多台机器上

There are other approaches but an error like this should lead you to ask yourself "Do I need all of this data in a python list?" 还有其他方法,但是这样的错误会使您问自己:“我是否需要python列表中的所有这些数据?”

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM