MongoDB and rmongodb. Get size of find instead of returning all results

Question

I have a MongoDB collection with >100k documents (this number will keep growing). Each document has a few fields that are a single value, and about 50 fields that are each an array of length 1000. I am analyzing results in R using rmongodb.

In rmongodb I am using mongo.find.all() with query set to some combination of criteria to search for, and fields set to a subset of the fields to return. The equivalent in the mongo shell would be something like:

db.collection.find({query1 : "value1", query2 : "value2"},{field1 : 1, field2 : 1, field3 : 1})

This returns a data.frame of the results, which I do some post-processing on and end up with a data.table.

What I would like to do is add some safeguards to the query. If the query is broad, and the fields returned are many of the larger array fields, the resulting data.table can be in the tens of GB. This might be what is expected, but I would like to add some flags or error checking so that someone doesn't accidentally try to return hundreds of GB at once.

I know I can get a count of the number of documents that match a query ( mongo.count in rmongodb, db.collection.find({...},{...}).count() in the shell). I can also get an average document size ( db.collection.stats().avgObjSize ).

What I do not know how to do, nor do I know if it is possible, is to get the size (in MB, not number) of a find before the find is actually returned. Since I am often returning only a subset of the fields, the count and avgObjSize don't give me a very accurate estimate of how big the resulting data.table will be. The size would need to take into account both the query and the fields.

Is there a command like db.collection.find({},{}).sizeOf() that would return the size in MB of my find(query,fields)? The only options I can see are count() and size() both of which return the number of documents.

Answer 1

You can iterate through cursor manually (as it done in mongo.cursor.to.list ) and iteratively check the size of the resulting object. Something like this:

LIMIT = 1024 * 1024 * 1024
res_size = 0
mongo.cursor.to.list_with_check <- function (cursor, 
                                             keep.ordering = TRUE, 
                                             limit = LIMIT) {
    # make environment to avoid extra copies
    e <- new.env(parent = emptyenv())
    i <- 1
    while (mongo.cursor.next(cursor) && res_size < limit) {
        val = mongo.bson.to.list(mongo.cursor.value(cursor))
        res_size = res_size + object.size(val)
        assign(x = as.character(i),
               value = val, envir = e)
        i <- i + 1
    }
    # convert back to list
    res <- as.list(e)
    if (isTRUE(keep.ordering)) setNames(res[order(as.integer(names(res)))], NULL)
    else setNames(res, NULL)
}

After that you can convert it into data.table via data.table::rbindlist() .

Answer 2

You can write script for this flexibility required in this situation: (I assume that you wanna return 1GB maximum)

    //limit 1GB
    var mbLimit = 1024*1024;
    //find number to show and round it to an int
    var numberShow = (mbLimit/db.restaurants.stats().avrObjSize) | 0;
    //limit the query
    db.restaurants.find({
       {query1 : "value1", query2 : "value2"},{field1 : 1, field2 : 1, field3 : 1}
        }).limit(numberShow)

MongoDB and rmongodb. Get size of find instead of returning all results

Question

2 answers

solution1
1 ACCPTED 2015-11-20 09:23:37

solution2
0 2015-11-19 17:46:23

MongoDB and rmongodb. Get size of find instead of returning all results

Question

2 answers

solution1 1 ACCPTED 2015-11-20 09:23:37

solution2 0 2015-11-19 17:46:23

solution1
1 ACCPTED 2015-11-20 09:23:37

solution2
0 2015-11-19 17:46:23