[英]How to create dynamic amount of (empty) buckets in MongoDB's bucketAuto aggregation function?
I store metadata about files in a MongoDB database.我将有关文件的元数据存储在 MongoDB 数据库中。 One property is the filesize in bytes which I use for a histogram about file sizes.
一个属性是以字节为单位的文件大小,我将其用于有关文件大小的直方图。 An example document looks like this:
示例文档如下所示:
{
"_id" : ObjectId("5c52366eeb3cae00c3896b89"),
"doc_uuid" : "bfa2734a-a262-4b14-a03f-45108ae59fde",
"files" : [
{
"uuid" : "7eca2b9d-61a6-4993-99d1-b23fa0a27197",
"filesize" : 1391908,
...
},
{
"uuid" : "c1277835-ce41-4057-a1ae-d67cc0aa7552",
"filesize" : 4977756,
...
},
]
}
I want to create buckets for filesizes of 2^n bytes.我想为 2^n 字节的文件大小创建存储桶。 For example:
例如:
{"_id" : { "min": 0, "max": 1}, "count": 12},
{"_id" : { "min": 1, "max": 2}, "count": 1},
{"_id" : { "min": 2, "max": 4}, "count": 0},
{"_id" : { "min": 4, "max": 8}, "count": 145},
To archive this, I currently create an aggregation pipeline that looks like this:为了存档,我目前创建了一个聚合管道,如下所示:
db.repositories.aggregate([
{"$match": {doc_uuid:{$in:["bfa2734a-a262-4b14-a03f-45108ae59fde"]}}},
{'$unwind': '$files'},
{'$bucketAuto':
{'groupBy': '$files.filesize',
buckets:16,
granularity: "POWERSOF2"
}
}])
which works fine.这工作正常。 This is an example of some real data I have:
这是我拥有的一些真实数据的示例:
{ "_id" : { "min" : 8192, "max" : 16384 }, "count" : 16 }
{ "_id" : { "min" : 16384, "max" : 2097152 }, "count" : 1 }
{ "_id" : { "min" : 2097152, "max" : 8388608 }, "count" : 1 }
There are two questions I have about this:关于这个我有两个问题:
buckets
is a required parameter (even if granularity="POWERSOF2"
is set), I do not know which is the ideal value for buckets
because I do not know the amount of buckets.buckets
是必选参数(即使设置了granularity="POWERSOF2"
),我不知道buckets
的理想值是哪个,因为我不知道bucket的数量。 Is it a good strategy to set the amount of buckets to a really high value (eg 1024 because it is unlikely, that I encounter a file with a filesize >= 2^1024 bytes) or is there a ways to distinguish the amount of buckets dynamically?{"_id" : {"min": 4096, "max": 8192}, "count": 0}
is in the result set as well?{"_id" : {"min": 4096, "max": 8192}, "count": 0}
也在结果集中? And a side-question: How does MongoDB handle values which have a value of exactly 2^n, eg 1024?还有一个附带问题:MongoDB 如何处理恰好为 2^n 的值,例如 1024? Do those values appear in two result sets (in this case in
{"min": 512, "max": 1024}
and in {"min": 1024, "max": 2048}
)?这些值是否出现在两个结果集中(在本例中为
{"min": 512, "max": 1024}
和{"min": 1024, "max": 2048}
)? If so, is it possible to create disjunct buckets?如果是这样,是否可以创建分离的存储桶?
Your first question seems to suggest that you don't actually want to use $bucketAuto
but just $bucket
.您的第一个问题似乎表明您实际上并不想使用
$bucketAuto
而只是$bucket
。 The whole point of bucketAuto is that it automatically determines the bucket boundaries, based on a desired count. bucketAuto 的全部意义在于它根据所需的计数自动确定桶边界。 In your case it seems that you have a sense of what you want your bucket boundaries to be, and would like to leave the number of buckets unspecified.
在您的情况下,您似乎知道您希望存储桶的边界是什么,并且希望不指定存储桶的数量。
If you go with this option, then that answers your second question as well: with fixed bucket boundaries some buckets may end up being empty.如果您选择此选项,那么这也回答了您的第二个问题:在固定存储桶边界的情况下,某些存储桶可能最终为空。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.