简体   繁体   中英

Mongo DB MapReduce in PHP

First of all it's my first time in Mongo...

Concept:

  1. A user is able to describe an image in natural language.
  2. Divide the user input and store the words he described in a Collection called words.
  3. Users must be able to go through the most used words and add those words to their description.
  4. The system will use the most used words (for all users) and use those words to describe the image.

My words document (currently) is as follows (example)

{
"date": "date it was inserted"
"reported": 0,
"image_id": "image id"
"image_name": "image name"
"user": "user _id"
"word": "awesome"
}

The words will be duplicated so that each word can be associated to a user...

Problem : I need to perform a Mongo query to allow me to know the most used words (to describe an image) that were not created by a given user. (to meet point 3. above)

I've seen MapReduce algorithm, but from what I read there are a couple of issues with it:

  1. Can't sort results (I can order from the most used to less used)
  2. In millions of documents it can have a large processing time.
  3. Can't limit the number of the results returned

I've thought about running a task at a given time each day to store on a document (in a different collection) the list the rank of words that a given user hasn't used to describe the given image. I would have to limit this to 300 results or something ( any idea on a proper limit?? ) Something like:

{
user_id: "the user id"
[
{word: test, count: 1000},
{word: test2, count: 980},
{word: etc, count: 300}
]
}

Problems I see with this solution are:

  1. Results would have quite a delay which is not desirable.
  2. Server loads while generating this documents for all users can spike (I actually know very little about this in Mongo so this is just an assumption)

Maybe my approach doesn't make any sense... And maybe my lack of experience in Mongo is pointing me at the wrong "schema design".

Any idea of what could be a good approach for this kind of problem?

Sorry for the big post and thanks for your time and help!

João

As already mentioned you could use the group command which is easy to use, but you will need to sort the result on the client side. Also the result is returned as a single BSON object and for this reason must be fairly small – less than 10,000 keys, else you will get an exception.

Code example based on your data structure:

db.words.group({
    key : {"word" : true},
    initial: {count : 0},
    reduce: function(obj, prev) { prev.count++},
    cond: {"user" :{ $ne : "USERNAME_TO_IGNORE"}}
})

Another option is to use the new Aggregation framework , which will be released in the 2.2 version. Something like that should work.

db.words.aggregate({
   $match : { "user" : { "$ne" : "USERNAME_TO_IGNORE"} },
   $group : {
     _id : "$word",
     count: { $sum : 1}
   }
})

Or you can still use MapReduce. Actually you can limit and sort the output, because the result is an collection. Just use .sort() and .limit() on the output. Also you can use the incremental map-reduce output option, which will help you solve your performance issues. Have a look at the out parameter in MapReduce .

Bellow it's an example, which use the incremental feature to merge the existing collection with new data in a words_usage collection:

m = function() { 
   emit(this.word, {count: 1}); 
};


r = function( key , values ){
     var sum = 0;
     values.forEach(function(doc) {
          sum += doc.count;
     });
     return {count: sum};
 };

db.runCommand({
    mapreduce : "words", 
    map : m,
    reduce : r,
    out : { reduce: "words_usage"},
    query : <query filter object>
})

# retrieve the top 10 words
db.words_usage.find().sort({"value.count" : -1}).sort({"value.count" : -1}).limit(10)

I guess you can run the above MapReduce command in a cron every few minutes/hours, depends how accurate results you want. For the update query criteria you can use the words documents creation date.

Once you have the system top words collection you can build per user top words or just compute them in real time (depends on the system size).

The group function is supposed to be a simpler version of MapReduce . You could use it like this to get a sum for each word:

db.coll.group(
           {key: { a:true, b:true },
            cond: { active:1 },
            reduce: function(obj,prev) { prev.csum += obj.c; },
            initial: { csum: 0 }
            });

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM