简体   繁体   中英

How to create dynamic amount of (empty) buckets in MongoDB's bucketAuto aggregation function?

I store metadata about files in a MongoDB database. One property is the filesize in bytes which I use for a histogram about file sizes. An example document looks like this:

{
    "_id" : ObjectId("5c52366eeb3cae00c3896b89"),
    "doc_uuid" : "bfa2734a-a262-4b14-a03f-45108ae59fde",
    "files" : [
        {
            "uuid" : "7eca2b9d-61a6-4993-99d1-b23fa0a27197",
            "filesize" : 1391908,
            ...
        },
        {
            "uuid" : "c1277835-ce41-4057-a1ae-d67cc0aa7552",
            "filesize" : 4977756,
            ...
        },
    ]
}

I want to create buckets for filesizes of 2^n bytes. For example:

{"_id" : { "min": 0, "max": 1}, "count": 12},
{"_id" : { "min": 1, "max": 2}, "count": 1},
{"_id" : { "min": 2, "max": 4}, "count": 0},
{"_id" : { "min": 4, "max": 8}, "count": 145},

To archive this, I currently create an aggregation pipeline that looks like this:

db.repositories.aggregate([
  {"$match": {doc_uuid:{$in:["bfa2734a-a262-4b14-a03f-45108ae59fde"]}}},
  {'$unwind': '$files'},
  {'$bucketAuto':
    {'groupBy': '$files.filesize',
      buckets:16,
      granularity: "POWERSOF2"
    }
}])

which works fine. This is an example of some real data I have:

{ "_id" : { "min" : 8192, "max" : 16384 }, "count" : 16 }
{ "_id" : { "min" : 16384, "max" : 2097152 }, "count" : 1 }
{ "_id" : { "min" : 2097152, "max" : 8388608 }, "count" : 1 }

There are two questions I have about this:

  1. Because buckets is a required parameter (even if granularity="POWERSOF2" is set), I do not know which is the ideal value for buckets because I do not know the amount of buckets. Is it a good strategy to set the amount of buckets to a really high value (eg 1024 because it is unlikely, that I encounter a file with a filesize >= 2^1024 bytes) or is there a ways to distinguish the amount of buckets dynamically?
  2. If you look at my real data example you can see that there are only buckets with min/max/count values present where at least one document exists in a bucket. Is it possible to create buckets with empty values as well so that for instance {"_id" : {"min": 4096, "max": 8192}, "count": 0} is in the result set as well?

And a side-question: How does MongoDB handle values which have a value of exactly 2^n, eg 1024? Do those values appear in two result sets (in this case in {"min": 512, "max": 1024} and in {"min": 1024, "max": 2048} )? If so, is it possible to create disjunct buckets?

Your first question seems to suggest that you don't actually want to use $bucketAuto but just $bucket . The whole point of bucketAuto is that it automatically determines the bucket boundaries, based on a desired count. In your case it seems that you have a sense of what you want your bucket boundaries to be, and would like to leave the number of buckets unspecified.

If you go with this option, then that answers your second question as well: with fixed bucket boundaries some buckets may end up being empty.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM