简体   繁体   English

改善MongoDB聚合管道性能

[英]Improving MongoDB Aggregation Pipeline Performance

I am using the MongoDB aggregation pipeline, to generate reports. 我正在使用MongoDB聚合管道来生成报告。 Here are some quick key infos first: 以下是一些快速关键信息:

Machine: 8-Core-CPU, 16 GB RAM
OS: Ubuntu 16.04.1 LTS

MongoDB Version: 3.2.11
Mongo PHP Adapter: 1.6.14
PHP Version: 5.6.30

Amount of documents to aggregate: ~ 10+ million

My aggregation pipeline code - which I write and execute in PHP - looks like this: 我的聚合管道代码(我用PHP编写和执行)如下所示:

// create indexes (ASC and DESC for each aggregation key)
$mongoCollection->createIndex('foo' => 1);
$mongoCollection->createIndex('foo' => -1);
$mongoCollection->createIndex('bar' => 1);
$mongoCollection->createIndex('bar' => -1);

// prepare aggregation (1. group, 2. sort)
$aggregationPipeline = [
    [
        '$group' => [
            '_id' => [
                '$foo' => 'foo',
                '$bar' => 'bar
            ],
            'count' => [
                '$sum' => 1
            ]
        ]
    ],
    [
        '$sort' => [
            'count' => -1
        ]
    ]
];


// run aggregation
$mongoCollection->aggregate($aggregationPipeline);

The problem is: the aggregation is not fast enough! 问题是:聚合不够快! Depending on how many fields I aggregate (in my example there are only 2) the process takes about 90 seconds, often longer. 根据我聚合的字段数(在我的示例中只有2个字段),此过程大约需要90秒,通常会更长。

My goal is: Improving the performance of the aggregations! 我的目标是:改善聚合的性能!

My questions : 我的问题

  1. PHP 的PHP

    • As stated I use PHP to control and run my aggregations. 如前所述,我使用PHP来控制和运行聚合。
    • => Is it bad practice to run aggregations from PHP, utilizing the PHP-Mongo-Classes and -Methods? =>使用PHP-Mongo-Classes和-Methods从PHP运行聚合是一种不好的做法吗?
    • => Would upgrading to PHP 7.x improve Mongo aggregation performance? =>升级到PHP 7.x会改善Mongo聚合性能吗?
  2. Indexes 指标

    • I have the subjective feeling, that adding DESC and ASC indexes (see example code) is NOT improving performance. 我有主观感觉,添加DESC和ASC索引(请参见示例代码)并不能提高性能。 With or without adding indexes the runtime seems to be almost identical. 有或没有添加索引,运行时似乎几乎是相同的。
    • => Is it possible that adding indexes does NOT improve the performance significantly? =>添加索引是否可能不会显着提高性能?
  3. CPU-Cores CPU核

    • While the aggregation runs, I can observe, that only ONE CORE of the CPU is being used. 在聚合运行时,我可以观察到,仅使用了一个内核的CPU。
    • => How can I manage/achieve that the aggregation pipeline uses ALL CPU-CORES simultaneously? =>如何管理/实现聚合管道同时使用所有CPU-CORES?
  4. Sharding 分片

    • I read about MongoDB sharding and that might be another possible way to improve Mongo overall performance. 我阅读了有关MongoDB分片的信息,这可能是提高Mongo整体性能的另一种可能的方法。
    • => Is it possible and does it make sense to setup/configure sharding on a SINGLE machine? =>是否可以在单台计算机上设置/配置分片?

Thank you very much in advance for any comments, suggestions, critisism and questions! 非常感谢您的任何评论,建议,批评和疑问!

  1. PHP 的PHP

Have no clue, but I imagine it plays a very minor role here. 毫无头绪,但我想它在这里只扮演很小的角色。 The cardinality of foo and bar - how many documents / bytes the aggregation returns may also have an impact. foo和bar的基数-聚合返回的文档/字节数也可能会产生影响。

  1. Indexes 指标

First, there is no sense in having both ascending and descending indexes for a single field. 首先,对于单个字段同时具有升序和降序索引是没有意义的。 Second, indexes can be used for a $match aggregation pipeline, but are useless when it comes o $group operations. 其次,索引可用于$match聚合管道,但在$group操作中无用。 So your notion is right, indexes can't help you here. 因此,您的想法是正确的,索引在这里无济于事。 You are doing a full scan. 您正在进行全面扫描。

  1. CPU-Cores CPU核

You can't run aggregation operation in parallel. 您不能并行运行聚合操作。 You can technically achieve it by controlling the aggregation from outside, by breaking it maybe to sub task. 从技术上讲,您可以通过从外部控制聚合(可能将其拆分为子任务)来实现。 But since you are doing a full scan - again - it's no good in your case. 但是,由于您要进行全面扫描-再一次-这对您而言并不好。

  1. Sharding 分片

There's no point in having multiple shards on the same machine, competing over the same hardware. 在同一台计算机上拥有多个分片,争夺同一硬件是没有意义的。 You add shards by utilizing more hardware. 您可以通过利用更多的硬件来添加分片。

You are very limited in resources here 16 GB RAM and 100 M documents. 这里的资源非常有限,只有16 GB RAM和100 M文档。 This is probably not enough, especially if your documents are not tiny and you have to go to disk in order to process more documents. 这可能还不够,尤其是如果您的文档不是很小并且您必须进入磁盘以处理更多文档时。 I would check the IO utilization during aggregation, and how the WiredTiger cache is behaving (assuming you are using WiredTiger). 我将检查聚合期间的IO利用率,以及WiredTiger缓存的行为(假设您使用的是WiredTiger)。

In summary, it's probably your limited resources. 总之,这可能是您有限的资源。 Client / driver has probably little impact on slowness. 客户端/驱动程序对速度的影响可能很小。 Start by running and explain() to your aggregation, while observing how your RAM, disk are behaving. 首先,运行并为聚合explain() ,同时观察RAM,磁盘的行为。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM