简体   繁体   中英

Find the most occurrence words exist in a field Mongodb

I have a collection A and array B which structure as below:

A :

{
    "_id" : ObjectId("5160757496cc6207a37ff778"),
    "name" : "Pomegranate Yogurt Bowl",
    "description" : "A simple breakfast bowl made with Greek yogurt, fresh pomegranate juice, puffed quinoa cereal, toasted sunflower seeds, and honey."
},
{
  "_id": ObjectId("5160757596cc62079cc2db18"),
  "name": "Krispy Easter Eggs",
  "description": "Imagine the Easter Bunny laying an egg.     Wait. That’s not anatomically possible.     And anyway, the Easter Bunny is a b..."
}

B :

var names = ["egg", "garlic", "cucumber", "kale", "pomegranate", "sunflower", "fish", "pork", "apple", "sunflower", "strawberry", "banana"]

My goal is to return one document from A which has the most occurrence words exist in array B . In this case it should return the first one "_id" : ObjectId("5160757496cc6207a37ff778") .

I'm not sure how to go about to solve this:

This doesn't work:

db.A.find({
    "description": {
      "$in": names
    }
  }, function(err, data) {
    if (err) console.log(err);
    console.log(data);
  });

It depends on the sort of "words" you want to throw at this, and whether they are things considered "stop words" such as "a" , "the" , "with" etc or if the count of those things really dont matter.

If they don't matter, then consider a $text index and search.

First index:

db.A.createIndex({ "name": "text", "description": "text" })

And then just construct the search:

var words = [
  "egg", "garlic", "cucumber", "kale", "pomegranate",
  "sunflower", "fish", "pork", "apple", "sunflower",
  "strawberry", "banana"
];

var search = words.join(" ")

db.A.find(
    { "$text": { "$search": search } },
    { "score": { "$meta": "textScore" } }
).sort({ "score": { "$meta": "textScore" }}).limit(1)

Returns the first document like this:

{
    "_id" : ObjectId("5160757496cc6207a37ff778"),
    "name" : "Pomegranate Yogurt Bowl",
    "description" : "A simple breakfast bowl made with Greek yogurt, fresh pomegranate juice, puffed quinoa cereal, toasted sunflower seeds, and honey.",
    "score" : 1.7291666666666665
}

On the other hand if you need to count "stop words" then a mapReduce can find the result for you:

db.A.mapReduce(
  function() {
    var words = [
      "egg", "garlic", "cucumber", "kale", "pomegranate",
      "sunflower", "fish", "pork", "apple", "sunflower",
      "strawberry", "banana"
    ];

    var count = 0;

    var fulltext = this.name.toLowerCase() + " " + this.description.toLowerCase();

    // Increment count by number of matches
    words.forEach(function(word) {
      count += ( fulltext.match(new RegExp(word,"ig")) || [] ).length;
    });

    emit(null,{ count: count, doc: this });

  },
  function(key,values) {
    // Sort largest first, return first
    return values.sort(function(a,b) {
      return a.count < b.count;
    })[0];
  },
  { "out": { "inline": 1 } }
)

With a result:

{
    "_id" : null,
    "value" : {
        "count" : 4,
        "doc" : {
            "_id" : ObjectId("5160757496cc6207a37ff778"),
            "name" : "Pomegranate Yogurt Bowl",
            "description" : "A simple breakfast bowl made with Greek yogurt, fresh pomegranate juice, puffed quinoa cereal, toasted sunflower seeds, and honey."
        }
    }
}

So the "text" index approach is "weighting" by the number of matches and then only returning the largest weighted match.

The mapReduce operation goes though each document and works out a score. Then the "reducer" sorts out results and just keeps the one with the highest score.

Note the "reducer" can be called many times, so this "does not" attempt to sort all the documents in the collection at once. But it is still truly "brute force".

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM