简体   繁体   English

提高mongodb全扫描查询性能:复制还是分片?

[英]Improve mongodb full-scan query performance: replication or sharding?

We are currently facing a situation where we can't avoid doing a collection full-scan. 当前,我们面临着无法避免对集合进行全扫描的情况。 We have already optimize the query and the data structure but we would like to go further and take full advantage of sharding and replication. 我们已经优化了查询和数据结构,但我们想走得更远,并充分利用分片和复制的优势。

Configuration 组态

- mongodb version 3.2
- monogo-java-driver 3.2
- storageEngine: wiredTiger
- compression level: snappy
- database size : 6GB

Documents structure: 文件结构:

individuals collection 个人收藏

{
    "_id": 1, 
    "name": "randomName1", 
    "info": {...}
}, 
{
    "_id": 2, 
    "name": "randomName2", 
    "info": {...}
},
[...]
{
    "_id": 15000, 
    "name": "randomName15000", 
    "info": {...}
}

values collection 价值观集合

{
    "_id": ObjectId("5804d7a41da35c2e06467911"),
    "pos": NumberLong("2090845886852"),
    "val": 
        [0, 0, 1, 0, 1, ... 0, 1]
},
{
    "_id": ObjectId("5804d7a41da35c2e06467912"),
    "pos": NumberLong("2090845886857"),
    "val": 
        [1, 1, 1, 0, 1, ... 0, 0]
}

The "val" array contain an element for each individual (so the length of the array is up to 15000). “ val”数组包含每个元素的一个元素(因此数组的长度最多为15000)。 The id of the individual is it's corresponding index in the "val" array. 个人的ID是它在“ val”数组中的对应索引。

Query 询问

The query is to find documents from values collection where the sum of val[individual._id] is above a specific treshold for a list of idividuals. 查询是从值集合中查找文档,其中val [individual._id]的总和高于特定个体列表的特定阈值。 We can't just pre-compute the sum of the array since the list of individuals wanted change during runtime (we may want to get the result for only the first 2000 individuals for example). 我们不能仅仅预先计算数组的总和,因为要在运行时更改的个人列表(例如,我们可能只想获取前2000个个人的结果)。 This query use the aggregation framework. 该查询使用聚合框架。

What we're currently doing: 我们目前正在做什么:

We split the query in 100-500 subqueries and run them 5 by 5 in parallel . 我们将查询分为100-500个子查询,并以5乘5并行运行它们。

The first subquery would be the same query for documents where pos > 0 and pos < 50000, the second for documents where pos > 50000 and pos < 100000 ect... 第一个子查询将对pos> 0且pos <50000的文档进行相同的查询,第二个子查询将对pos> 50000且pos <100000等的文档进行相同的查询。

We would like to be able to run more subqueries in the same time, but we're facing performance loss when running more than 5 on a single mongod instance. 我们希望能够同时运行更多的子查询,但是当在单个mongod实例上运行5个以上时,我们将面临性能损失。

So the question is : should we go for replication or for sharding (or for both) in order to run the maximum number of subqueries in the same time ? 因此,问题是 :为了同时运行最大数量的子查询,我们应该进行复制还是分片(或同时进行分片)? How could we configure mongodb to dispatch subqueries among replica/shards as best as possible? 我们如何配置mongodb以在副本/分片之间尽可能好地调度子查询?

edit: let's assume that the query is already fully optimized ! 编辑:让我们假设查询已经完全优化!

Replication is something used for data redundancy and high availability, so if you are trying to improve performance on a query I think we can rule that out as an option right away. 复制是一种用于数据冗余和高可用性的东西,因此,如果您想提高查询的性能,我认为我们可以立即排除这种情况。

Sharding may be an option, but I think that the next step for you would be to post your explain for the query and see if anyone can make suggestions for improving performance. 分片也许是一种选择,但是我认为下一步是发布查询的解释,看看是否有人可以提出建议来提高性能。 It's possible there is some tuning your could do that you missed, or perhaps you would see performance gains by upgrading the current MongoDB server's RAM or CPU. 可能会做一些您可能错过的调整,或者通过升级当前的MongoDB服务器的RAM或CPU可以看到性能提升。

In short, I would suggest posting your explain before going to all the effort of sharding. 简而言之,我建议您在进行所有分片工作之前先发布您的解释。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM