[英]Improving MongoDB Aggregation Pipeline Performance
I am using the MongoDB aggregation pipeline, to generate reports. 我正在使用MongoDB聚合管道来生成报告。 Here are some quick key infos first: 以下是一些快速关键信息:
Machine: 8-Core-CPU, 16 GB RAM
OS: Ubuntu 16.04.1 LTS
MongoDB Version: 3.2.11
Mongo PHP Adapter: 1.6.14
PHP Version: 5.6.30
Amount of documents to aggregate: ~ 10+ million
My aggregation pipeline code - which I write and execute in PHP - looks like this: 我的聚合管道代码(我用PHP编写和执行)如下所示:
// create indexes (ASC and DESC for each aggregation key)
$mongoCollection->createIndex('foo' => 1);
$mongoCollection->createIndex('foo' => -1);
$mongoCollection->createIndex('bar' => 1);
$mongoCollection->createIndex('bar' => -1);
// prepare aggregation (1. group, 2. sort)
$aggregationPipeline = [
[
'$group' => [
'_id' => [
'$foo' => 'foo',
'$bar' => 'bar
],
'count' => [
'$sum' => 1
]
]
],
[
'$sort' => [
'count' => -1
]
]
];
// run aggregation
$mongoCollection->aggregate($aggregationPipeline);
The problem is: the aggregation is not fast enough! 问题是:聚合不够快! Depending on how many fields I aggregate (in my example there are only 2) the process takes about 90 seconds, often longer. 根据我聚合的字段数(在我的示例中只有2个字段),此过程大约需要90秒,通常会更长。
My goal is: Improving the performance of the aggregations! 我的目标是:改善聚合的性能!
My questions : 我的问题 :
PHP 的PHP
Indexes 指标
CPU-Cores CPU核
Sharding 分片
Thank you very much in advance for any comments, suggestions, critisism and questions! 非常感谢您的任何评论,建议,批评和疑问!
Have no clue, but I imagine it plays a very minor role here. 毫无头绪,但我想它在这里只扮演很小的角色。 The cardinality of foo and bar - how many documents / bytes the aggregation returns may also have an impact. foo和bar的基数-聚合返回的文档/字节数也可能会产生影响。
First, there is no sense in having both ascending and descending indexes for a single field. 首先,对于单个字段同时具有升序和降序索引是没有意义的。 Second, indexes can be used for a $match
aggregation pipeline, but are useless when it comes o $group
operations. 其次,索引可用于$match
聚合管道,但在$group
操作中无用。 So your notion is right, indexes can't help you here. 因此,您的想法是正确的,索引在这里无济于事。 You are doing a full scan. 您正在进行全面扫描。
You can't run aggregation operation in parallel. 您不能并行运行聚合操作。 You can technically achieve it by controlling the aggregation from outside, by breaking it maybe to sub task. 从技术上讲,您可以通过从外部控制聚合(可能将其拆分为子任务)来实现。 But since you are doing a full scan - again - it's no good in your case. 但是,由于您要进行全面扫描-再一次-这对您而言并不好。
There's no point in having multiple shards on the same machine, competing over the same hardware. 在同一台计算机上拥有多个分片,争夺同一硬件是没有意义的。 You add shards by utilizing more hardware. 您可以通过利用更多的硬件来添加分片。
You are very limited in resources here 16 GB RAM and 100 M documents. 这里的资源非常有限,只有16 GB RAM和100 M文档。 This is probably not enough, especially if your documents are not tiny and you have to go to disk in order to process more documents. 这可能还不够,尤其是如果您的文档不是很小并且您必须进入磁盘以处理更多文档时。 I would check the IO utilization during aggregation, and how the WiredTiger cache is behaving (assuming you are using WiredTiger). 我将检查聚合期间的IO利用率,以及WiredTiger缓存的行为(假设您使用的是WiredTiger)。
In summary, it's probably your limited resources. 总之,这可能是您有限的资源。 Client / driver has probably little impact on slowness. 客户端/驱动程序对速度的影响可能很小。 Start by running and explain()
to your aggregation, while observing how your RAM, disk are behaving. 首先,运行并为聚合explain()
,同时观察RAM,磁盘的行为。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.