简体   繁体   English

当有数百万条记录时,Mongo计数真的很慢

[英]Mongo count really slow when there are millions of records

//FAST
db.datasources.find().count()
12036788

//SLOW    
db.datasources.find({nid:19882}).count()
10161684

Index on nid 关于nid的索引

Any way to make the second query faster? 有什么办法让第二个查询更快? (It is taking about 8 seconds) (需要大约8秒钟)

Count queries, indexed or otherwise, are slow due to the fact that MongoDB still has to do a full b-tree walk to find the appropriate number of documents that match your criteria. 由于MongoDB仍然需要执行完整的b-tree遍历来查找符合条件的适当数量的文档,因此计数查询(索引或其他方式)很慢。 The reason for this is that the MongoDB b-tree structure is not "counted" meaning each node does not store information about the amount of elements in the node/subtree. 其原因是MongoDB b-tree结构未被“计数”,这意味着每个节点不存储有关节点/子树中元素数量的信息。

The issue is reported here https://jira.mongodb.org/browse/SERVER-1752 and there is currently no workaround to improve performance other than manually maintaining a counter for that collection which obviously comes with a few downsides. 这个问题在这里报告https://jira.mongodb.org/browse/SERVER-1752并且目前没有解决方法来提高性能,除了手动维护该集合的计数器,这显然有一些缺点。

Also note that the db.col.count() version (so no criteria) can take a big shortcut and doesn't actually perform a query hence it's speed. 另请注意,db.col.count()版本(因此没有条件)可以占用大的快捷方式,并且实际上不执行查询,因此速度很快。 That said it does not always report the same value as a count query would that should return all elements (it won't be in sharded environments with high write throughput for example). 也就是说它并不总是报告与计数查询相同的值,它应该返回所有元素(例如,它不会在具有高写入吞吐量的分片环境中)。 Up for debate whether or not that's a bug. 争论是否是一个错误。 I think it is. 我觉得是这样的。

Note that in 2.3+ a significant optimization was introduced that should (and does) improve performance of counts on indexed fields. 请注意,在2.3+中引入了一个重要的优化,它应该(并确实)提高索引字段计数的性能。 See : https://jira.mongodb.org/browse/SERVER-7745 请参阅: https//jira.mongodb.org/browse/SERVER-7745

As @Remon said, count() has to scan all the documents matching the query/filter. 正如@Remon所说,count()必须扫描与查询/过滤器匹配的所有文档。 It is O(n) where n is the number of documents that will match the index, or the number of documents in the collection if the field is not indexed. 它是O(n),其中n是与索引匹配的文档数,如果字段未编入索引,则为集合中的文档数。

In such cases, you typically want to revisit your requirement. 在这种情况下,您通常希望重新审视您的要求。 Do you really need a precise number for the result 10161684? 你真的需要一个精确的数字10161684吗? If the precision is important, you should keep a separate counter for the particular query. 如果精度很重要,则应为特定查询保留单独的计数器。

But in most cases, precision is not important. 但在大多数情况下,精确度并不重要。 It's one of the two: 这是两个中的一个:

  • You don't care whether it's 10 million or 10.2 million, but the order of magnitude is important, ie, you care about whether it's 8 million or 10 million. 你不关心它是1000万还是1020万,但数量级是重要的,即你关心它是800万还是1000万。
  • You only care about the precise number if it's a small one. 如果它是一个小的,你只关心精确的数字。 Ie, you're interested to know that there are 44 results or 72. But once it goes beyond, say, 1000, you can just say 'More than 1000 objects' found to the user. 也就是说,你有兴趣知道有44个结果或72个。但是一旦它超出了1000个,你就可以说用户找到了“超过1000个对象”。

In my apps, I found that the second option is what I want. 在我的应用程序中,我发现第二个选项是我想要的。 So, I limit the count() query as well, so that the counting stops when it reaches a limit. 因此,我也限制了count()查询,以便计数在达到限制时停止。 Like so: 像这样:

db.datasources.find({nid: 19882}).limit(1000).count(true)

To the user, I display '1000 or more results found' if the count is 1000, otherwise, I display the exact number. 对于用户,如果计数为1000,则显示“找到1000个或更多结果”,否则,我显示确切的数字。

As for the first option ... I haven't thought of a neat solution yet. 至于第一种选择......我还没有想到一个简洁的解决方案。

It has to look through every field of every document for the second. 它必须查看每个文档的每个字段。 You could index nid to make the count faster. 您可以索引nid以使计数更快。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM