简体   繁体   English

MongoDB聚合似乎很慢

[英]MongoDB Aggregation seems very slow

I have a mongodb instance running with the following stats: 我有一个运行以下统计信息的mongodb实例:

{
    "db" : "s",
    "collections" : 4,
    "objects" : 1.23932e+008,
    "avgObjSize" : 239.9999891553412400,
    "dataSize" : 29743673136.0000000000000000,
    "storageSize" : 32916655936.0000000000000000,
    "numExtents" : 39,
    "indexes" : 3,
    "indexSize" : 7737839984.0000000000000000,
    "fileSize" : 45009076224.0000000000000000,
    "nsSizeMB" : 16,
    "dataFileVersion" : {
        "major" : 4,
        "minor" : 5
    },
    "extentFreeList" : {
        "num" : 0,
        "totalSize" : 0
    },
    "ok" : 1.0000000000000000
}

I'm trying to run the following query: 我正在尝试运行以下查询:

db.getCollection('tick_data').aggregate(
    [       
        {$group: {_id: "$ccy",min:{$first: "$date_time"},max:{$last: "$date_time"}}}

    ]
)

And I have the following index set-up in the collection: 并且我在集合中具有以下索引设置:

{
    "ccy" : 1,
    "date_time" : 1
}

The query takes 510 seconds to run, which feels like it's extremely slow even though the collection is fairly large (~120 million documents). 该查询需要510秒才能运行,即使集合非常大(约1.2亿个文档),这也感觉非常慢。 Is there a simple way for me to make this faster? 有什么简单的方法可以使我更快吗?

Every document has the structure: 每个文档都具有以下结构:

{
    "_id" : ObjectId("56095bd7b2fc3e36d8d6ed52"),
    "bid_volume" : "6.00",
    "date_time" : ISODate("2007-01-01T00:00:07.904Z"),
    "ccy" : "USDNOK",
    "bid" : 6.2271700000000001,
    "ask_volume" : "6.00",
    "ask" : 6.2357699999999996
}

Results of explain: 解释结果:

{
    "stages" : [ 
        {
            "$cursor" : {
                "query" : {},
                "fields" : {
                    "ccy" : 1,
                    "date_time" : 1,
                    "_id" : 0
                },
                "plan" : {
                    "cursor" : "BasicCursor",
                    "isMultiKey" : false,
                    "scanAndOrder" : false,
                    "allPlans" : [ 
                        {
                            "cursor" : "BasicCursor",
                            "isMultiKey" : false,
                            "scanAndOrder" : false
                        }
                    ]
                }
            }
        }, 
        {
            "$group" : {
                "_id" : "$ccy",
                "min" : {
                    "$first" : "$date_time"
                },
                "max" : {
                    "$last" : "$date_time"
                }
            }
        }
    ],
    "ok" : 1.0000000000000000
}

Thanks 谢谢

As mentionned already by @Blakes Seven, $group cannot use indexes. 正如@Blakes Seven所提到的那样,$ group无法使用索引。 See this topic . 请参阅本主题

Thus, your query is already optimal. 因此,您的查询已经是最佳的。 A possible way to optimise this usecase is to pre-calculate and persist the data in a side collection. 优化此用例的一种可能方法是预先计算数据并将其持久保存在边收集中。

You could try this data structure : 您可以尝试以下数据结构:

{
  "_id" : ObjectId("560a5139b56a71ea60890201"),
  "ccy" : "USDNOK",
  "date_time_first" : ISODate("2007-01-01T00:00:07.904Z"),
  "date_time_last" : ISODate("2007-09-09T00:00:07.904Z")
}

Querying this can be done in milliseconds instead of 500+ seconds and you can benefit from indexes. 查询可以以毫秒为单位,而不是500+秒,您可以从索引中受益。

Then of course, each time you add, update or delete a document from the main collection, you would need to update the side collection. 然后,当然,每次您添加,更新或删除主集合中的文档时,都需要更新副集合。

Depending on how badly you need the data to be "fresh", you could also choose to skip this "live update process" and regenerate entirely the side collection only once a day with a batch and keep in mind that your data may not be "fresh". 根据您需要多么“新鲜”数据的严重性,您还可以选择跳过此“实时更新过程”,并且每天仅一次批处理完全重新生成边收集,并记住您的数据可能不是“新鲜”。

Another problem you could fix : Your server definitely needs more RAM & CPU. 您可以解决的另一个问题:您的服务器肯定需要更多的RAM和CPU。 Your working set probably doesn't fit in RAM, especially with this kind of aggregations. 您的工作集可能不适合RAM,尤其是采用这种聚合时。

Also, you can probably make good use of an SSD and I would STRONGLY recommand using a 3 nodes Replicaset instead of a single instance for production. 另外,您可能可以充分利用SSD,并且我强烈建议使用3节点Replicaset而不是单个实例进行生产。

In the end I wrote a function which takes 0.002 seconds to run. 最后,我编写了一个需要0.002秒运行的函数。

function() {
    var results = {}
    var ccys = db.tick_data.distinct("ccy");
    ccys.forEach(function(ccy)
        {
            var max_results = []
            var min_results = []

            db.tick_data.find({"ccy":ccy},{"date_time":1,"_id":0}).sort({"date_time":1}).limit(1).forEach(function(v){min_results.push(v.date_time)})
            db.tick_data.find({"ccy":ccy},{"date_time":1,"_id":0}).sort({"date_time":-1}).limit(1).forEach(function(v){max_results.push(v.date_time)})

            var max = max_results[0]
            var min = min_results[0]
            results[ccy]={"max_date_time":max,"min_date_time":min}
        }
    )
    return results
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM