简体   繁体   English

如何在 MongoDB 的 bucketAuto 聚合函数中创建动态数量的(空)桶?

[英]How to create dynamic amount of (empty) buckets in MongoDB's bucketAuto aggregation function?

I store metadata about files in a MongoDB database.我将有关文件的元数据存储在 MongoDB 数据库中。 One property is the filesize in bytes which I use for a histogram about file sizes.一个属性是以字节为单位的文件大小,我将其用于有关文件大小的直方图。 An example document looks like this:示例文档如下所示:

{
    "_id" : ObjectId("5c52366eeb3cae00c3896b89"),
    "doc_uuid" : "bfa2734a-a262-4b14-a03f-45108ae59fde",
    "files" : [
        {
            "uuid" : "7eca2b9d-61a6-4993-99d1-b23fa0a27197",
            "filesize" : 1391908,
            ...
        },
        {
            "uuid" : "c1277835-ce41-4057-a1ae-d67cc0aa7552",
            "filesize" : 4977756,
            ...
        },
    ]
}

I want to create buckets for filesizes of 2^n bytes.我想为 2^n 字节的文件大小创建存储桶。 For example:例如:

{"_id" : { "min": 0, "max": 1}, "count": 12},
{"_id" : { "min": 1, "max": 2}, "count": 1},
{"_id" : { "min": 2, "max": 4}, "count": 0},
{"_id" : { "min": 4, "max": 8}, "count": 145},

To archive this, I currently create an aggregation pipeline that looks like this:为了存档,我目前创建了一个聚合管道,如下所示:

db.repositories.aggregate([
  {"$match": {doc_uuid:{$in:["bfa2734a-a262-4b14-a03f-45108ae59fde"]}}},
  {'$unwind': '$files'},
  {'$bucketAuto':
    {'groupBy': '$files.filesize',
      buckets:16,
      granularity: "POWERSOF2"
    }
}])

which works fine.这工作正常。 This is an example of some real data I have:这是我拥有的一些真实数据的示例:

{ "_id" : { "min" : 8192, "max" : 16384 }, "count" : 16 }
{ "_id" : { "min" : 16384, "max" : 2097152 }, "count" : 1 }
{ "_id" : { "min" : 2097152, "max" : 8388608 }, "count" : 1 }

There are two questions I have about this:关于这个我有两个问题:

  1. Because buckets is a required parameter (even if granularity="POWERSOF2" is set), I do not know which is the ideal value for buckets because I do not know the amount of buckets.因为buckets是必选参数(即使设置了granularity="POWERSOF2" ),我不知道buckets的理想值是哪个,因为我不知道bucket的数量。 Is it a good strategy to set the amount of buckets to a really high value (eg 1024 because it is unlikely, that I encounter a file with a filesize >= 2^1024 bytes) or is there a ways to distinguish the amount of buckets dynamically?将存储桶的数量设置为非常高的值是一个好策略(例如 1024,因为我遇到文件大小 >= 2^1024 字节的文件不太可能)还是有办法区分存储桶的数量动态?
  2. If you look at my real data example you can see that there are only buckets with min/max/count values present where at least one document exists in a bucket.如果您查看我的真实数据示例,您会发现只有具有 min/max/count 值的存储桶,其中存储桶中至少存在一个文档。 Is it possible to create buckets with empty values as well so that for instance {"_id" : {"min": 4096, "max": 8192}, "count": 0} is in the result set as well?是否也可以创建具有空值的存储桶,例如{"_id" : {"min": 4096, "max": 8192}, "count": 0}也在结果集中?

And a side-question: How does MongoDB handle values which have a value of exactly 2^n, eg 1024?还有一个附带问题:MongoDB 如何处理恰好为 2^n 的值,例如 1024? Do those values appear in two result sets (in this case in {"min": 512, "max": 1024} and in {"min": 1024, "max": 2048} )?这些值是否出现在两个结果集中(在本例中为{"min": 512, "max": 1024}{"min": 1024, "max": 2048} )? If so, is it possible to create disjunct buckets?如果是这样,是否可以创建分离的存储桶?

Your first question seems to suggest that you don't actually want to use $bucketAuto but just $bucket .您的第一个问题似乎表明您实际上并不想使用$bucketAuto而只是$bucket The whole point of bucketAuto is that it automatically determines the bucket boundaries, based on a desired count. bucketAuto 的全部意义在于它根据所需的计数自动确定桶边界。 In your case it seems that you have a sense of what you want your bucket boundaries to be, and would like to leave the number of buckets unspecified.在您的情况下,您似乎知道您希望存储桶的边界是什么,并且希望不指定存储桶的数量。

If you go with this option, then that answers your second question as well: with fixed bucket boundaries some buckets may end up being empty.如果您选择此选项,那么这也回答了您的第二个问题:在固定存储桶边界的情况下,某些存储桶可能最终为空。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM