简体   繁体   English

MongoDB和rmongodb。 获取查找的大小而不是返回所有结果

[英]MongoDB and rmongodb. Get size of find instead of returning all results

I have a MongoDB collection with >100k documents (this number will keep growing). 我有一个MongoDB集合,其中包含超过10万个文档(这个数字将不断增长)。 Each document has a few fields that are a single value, and about 50 fields that are each an array of length 1000. I am analyzing results in R using rmongodb. 每个文档都有几个字段,这些字段是单个值,大约有50个字段,每个字段都是长度为1000的数组。我正在使用rmongodb分析R中的结果。

In rmongodb I am using mongo.find.all() with query set to some combination of criteria to search for, and fields set to a subset of the fields to return. 在rmongodb中,我使用mongo.find.all() ,将查询设置为要搜索的条件的某种组合,并将字段设置为要返回的字段的子集。 The equivalent in the mongo shell would be something like: mongo shell中的等效项如下所示:

db.collection.find({query1 : "value1", query2 : "value2"},{field1 : 1, field2 : 1, field3 : 1})

This returns a data.frame of the results, which I do some post-processing on and end up with a data.table. 这将返回结果的data.frame,我对其进行一些后处理,最后得到一个data.table。

What I would like to do is add some safeguards to the query. 我想做的就是为查询添加一些保护措施。 If the query is broad, and the fields returned are many of the larger array fields, the resulting data.table can be in the tens of GB. 如果查询范围很广,并且返回的字段是许多较大的数组字段,则结果data.table的大小可能为数十GB。 This might be what is expected, but I would like to add some flags or error checking so that someone doesn't accidentally try to return hundreds of GB at once. 这可能是预期的结果,但是我想添加一些标志或错误检查,以免有人意外地一次返回数百GB。

I know I can get a count of the number of documents that match a query ( mongo.count in rmongodb, db.collection.find({...},{...}).count() in the shell). 我知道我可以计算出与查询匹配的文档数量( mongo.count中的mongo.count, mongo.count中的db.collection.find({...},{...}).count() )。 I can also get an average document size ( db.collection.stats().avgObjSize ). 我还可以获得平均文档大小( db.collection.stats().avgObjSize )。

What I do not know how to do, nor do I know if it is possible, is to get the size (in MB, not number) of a find before the find is actually returned. 我不知道怎么做,也不知道是否可能,是在实际返回查找之前获取查找的大小(以MB为单位,而不是数字)。 Since I am often returning only a subset of the fields, the count and avgObjSize don't give me a very accurate estimate of how big the resulting data.table will be. 由于我经常只返回字段的子集,因此count和avgObjSize不能使我非常准确地估计结果data.table的大小。 The size would need to take into account both the query and the fields. 该大小将需要同时考虑查询和字段。

Is there a command like db.collection.find({},{}).sizeOf() that would return the size in MB of my find(query,fields)? 是否有类似db.collection.find({},{}).sizeOf()的命令,该命令将返回我的find(query,fields)的大小(以MB为单位)? The only options I can see are count() and size() both of which return the number of documents. 我可以看到的唯一选项是count()size() ,它们都返回文档的数量。

You can iterate through cursor manually (as it done in mongo.cursor.to.list ) and iteratively check the size of the resulting object. 您可以手动遍历游标(就像在mongo.cursor.to.list中所做的那样 ),并迭代检查结果对象的大小。 Something like this: 像这样:

LIMIT = 1024 * 1024 * 1024
res_size = 0
mongo.cursor.to.list_with_check <- function (cursor, 
                                             keep.ordering = TRUE, 
                                             limit = LIMIT) {
    # make environment to avoid extra copies
    e <- new.env(parent = emptyenv())
    i <- 1
    while (mongo.cursor.next(cursor) && res_size < limit) {
        val = mongo.bson.to.list(mongo.cursor.value(cursor))
        res_size = res_size + object.size(val)
        assign(x = as.character(i),
               value = val, envir = e)
        i <- i + 1
    }
    # convert back to list
    res <- as.list(e)
    if (isTRUE(keep.ordering)) setNames(res[order(as.integer(names(res)))], NULL)
    else setNames(res, NULL)
}

After that you can convert it into data.table via data.table::rbindlist() . 之后,您可以通过data.table::rbindlist()将其转换为data.table

You can write script for this flexibility required in this situation: (I assume that you wanna return 1GB maximum) 您可以编写脚本来实现这种情况下所需的灵活性:(我假设您要返回的最大容量为1GB)

    //limit 1GB
    var mbLimit = 1024*1024;
    //find number to show and round it to an int
    var numberShow = (mbLimit/db.restaurants.stats().avrObjSize) | 0;
    //limit the query
    db.restaurants.find({
       {query1 : "value1", query2 : "value2"},{field1 : 1, field2 : 1, field3 : 1}
        }).limit(numberShow)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM