简体   繁体   中英

MongoDB Aggregation seems very slow

I have a mongodb instance running with the following stats:

{
    "db" : "s",
    "collections" : 4,
    "objects" : 1.23932e+008,
    "avgObjSize" : 239.9999891553412400,
    "dataSize" : 29743673136.0000000000000000,
    "storageSize" : 32916655936.0000000000000000,
    "numExtents" : 39,
    "indexes" : 3,
    "indexSize" : 7737839984.0000000000000000,
    "fileSize" : 45009076224.0000000000000000,
    "nsSizeMB" : 16,
    "dataFileVersion" : {
        "major" : 4,
        "minor" : 5
    },
    "extentFreeList" : {
        "num" : 0,
        "totalSize" : 0
    },
    "ok" : 1.0000000000000000
}

I'm trying to run the following query:

db.getCollection('tick_data').aggregate(
    [       
        {$group: {_id: "$ccy",min:{$first: "$date_time"},max:{$last: "$date_time"}}}

    ]
)

And I have the following index set-up in the collection:

{
    "ccy" : 1,
    "date_time" : 1
}

The query takes 510 seconds to run, which feels like it's extremely slow even though the collection is fairly large (~120 million documents). Is there a simple way for me to make this faster?

Every document has the structure:

{
    "_id" : ObjectId("56095bd7b2fc3e36d8d6ed52"),
    "bid_volume" : "6.00",
    "date_time" : ISODate("2007-01-01T00:00:07.904Z"),
    "ccy" : "USDNOK",
    "bid" : 6.2271700000000001,
    "ask_volume" : "6.00",
    "ask" : 6.2357699999999996
}

Results of explain:

{
    "stages" : [ 
        {
            "$cursor" : {
                "query" : {},
                "fields" : {
                    "ccy" : 1,
                    "date_time" : 1,
                    "_id" : 0
                },
                "plan" : {
                    "cursor" : "BasicCursor",
                    "isMultiKey" : false,
                    "scanAndOrder" : false,
                    "allPlans" : [ 
                        {
                            "cursor" : "BasicCursor",
                            "isMultiKey" : false,
                            "scanAndOrder" : false
                        }
                    ]
                }
            }
        }, 
        {
            "$group" : {
                "_id" : "$ccy",
                "min" : {
                    "$first" : "$date_time"
                },
                "max" : {
                    "$last" : "$date_time"
                }
            }
        }
    ],
    "ok" : 1.0000000000000000
}

Thanks

As mentionned already by @Blakes Seven, $group cannot use indexes. See this topic .

Thus, your query is already optimal. A possible way to optimise this usecase is to pre-calculate and persist the data in a side collection.

You could try this data structure :

{
  "_id" : ObjectId("560a5139b56a71ea60890201"),
  "ccy" : "USDNOK",
  "date_time_first" : ISODate("2007-01-01T00:00:07.904Z"),
  "date_time_last" : ISODate("2007-09-09T00:00:07.904Z")
}

Querying this can be done in milliseconds instead of 500+ seconds and you can benefit from indexes.

Then of course, each time you add, update or delete a document from the main collection, you would need to update the side collection.

Depending on how badly you need the data to be "fresh", you could also choose to skip this "live update process" and regenerate entirely the side collection only once a day with a batch and keep in mind that your data may not be "fresh".

Another problem you could fix : Your server definitely needs more RAM & CPU. Your working set probably doesn't fit in RAM, especially with this kind of aggregations.

Also, you can probably make good use of an SSD and I would STRONGLY recommand using a 3 nodes Replicaset instead of a single instance for production.

In the end I wrote a function which takes 0.002 seconds to run.

function() {
    var results = {}
    var ccys = db.tick_data.distinct("ccy");
    ccys.forEach(function(ccy)
        {
            var max_results = []
            var min_results = []

            db.tick_data.find({"ccy":ccy},{"date_time":1,"_id":0}).sort({"date_time":1}).limit(1).forEach(function(v){min_results.push(v.date_time)})
            db.tick_data.find({"ccy":ccy},{"date_time":1,"_id":0}).sort({"date_time":-1}).limit(1).forEach(function(v){max_results.push(v.date_time)})

            var max = max_results[0]
            var min = min_results[0]
            results[ccy]={"max_date_time":max,"min_date_time":min}
        }
    )
    return results
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM