简体   繁体   English

MongoDB在列表中检查多个正则表达式匹配项以进行免费文本搜索

[英]MongoDB checking for multiple regex matches inside a list for free text search

I am setting up a mongoDB db to allow (simple) keyword searching using multikeys as recommended here . 我建立的MongoDB数据库允许使用multikeys建议(简单)关键字搜索这里 A record looks similar too: 一条记录也看起来类似:

{ title: { title: "A river runs through", _keywords: ["a","river","runs","through"] ) , ... }

I using nodejs server side, so am using javascript. 我在使用nodejs服务器端,所以也在使用javascript。 The following query will match (this was run in the mongo terminal): 以下查询将匹配(此查询在mongo终端中运行):

> db.torrents_sorted.find({'title._keywords' : {"$all" : ["river","the"]} }).count()
210

However, these do not: 但是,这些不会:

> db.torrents_sorted.find({'title._keywords' : {"$all" : ["/river/i","/the/i"]} }).count()
0

> db.torrents_sorted.find({'title._keywords' : {"$all" : [{ "$regex" : "river", "$options" : "i" },{ "$regex" : "the", "$options" : "i" }]} }).count()
0

Using a single regex (without using $and or $all) does match: 使用单个正则表达式(不使用$ and或$ all)匹配:

db.torrents_sorted.find({'title._keywords' : { "$regex" : "river", "$options" : "i" } }).count() 1461 db.torrents_sorted.find({'title._keywords':{“ $ regex”:“河流”,“ $ options”:“ i”}})。count()1461

Interestingly, using python and pymongo to compile the regular expressions does work: 有趣的是,使用python和pymongo编译正则表达式确实有效:

>>> db.torrents_sorted.find({'title._keywords': { '$all': [re.compile('river'), re.compile('the')]}}).count();
236

I am not necessarily looking for a solution that uses regexes, however it is required that keywords are matched on shorter strings so "riv" matches "river", which seems ideal for regexes (or LIKE in sql). 我并不一定要寻找使用正则表达式的解决方案,但是需要在较短的字符串上匹配关键字,以便“ riv”与“ river”匹配,这似乎是正则表达式(或sql中的LIKE)的理想选择。

My next idea is to try passing in a javascript function that performs the regex matching on the list, or perhaps passing in a seperate function for each regex (this does seem to scream hack at me :), although I'm guessing this would be slower and performance is very important. 我的下一个想法是尝试传递一个执行列表中正则表达式匹配的javascript函数,或者为每个正则表达式传递一个单独的函数(这似乎确实让我大叫:),尽管我猜这可能是速度较慢,性能非常重要。

您可能要使用$ and运算符。

Ok, I have an answer, that is kinda interesting in a different way. 好的,我有一个答案,这是另一种有趣的方式。 The bug I was experiencing with regexes exists in version 1.8 of mongodb and has been solved, it is shown here . 我用正则表达式经历存在错误MongoDB中的1.8版本,并且已经解决了,它显示在这里

Sadly the hosting company looking after the db atm are not able to offer version 2.0, and the $and keyword was added in version 2.0, although thanks for the debug help Samarth. 遗憾的是,托管数据库atm的托管公司无法提供2.0版,并且$ and关键字已添加到2.0版中,尽管感谢调试帮助Samarth。

So instead I have written a javascript function to perform the regex matching: 因此,我写了一个JavaScript函数来执行正则表达式匹配:

function () {
  var rs = [RegExp(".*river.*"), RegExp(".*runs.*")];

  for(var j = 0; j < rs.length; j++) {
    var val = false;
    for (var i = 0; !val && i < this.title._keywords.length; i++)
      val = rs[j].test(this.title._keywords[i]);

    if(!val) return false;
  }
  return true;
}

This runs in O(n^2) time (not very cool), but will fail in linear time, if the first regex does not match on any on the keywords (since I am looking for a disjunction). 这在O(n ^ 2)时间(不是很酷)中运行,但是如果第一个正则表达式与关键字上的任何正则表达式都不匹配(因为我正在寻找析取符),则它将在线性时间内失败。

Any input on optimising this would be greatly appreciated, although if this is the best solution I can find for 1.8, I may have to find somewhere else to store my db in the near future, ;). 关于优化此设置的任何意见将不胜感激,尽管如果这是我能找到的1.8的最佳解决方案,则在不久的将来,我可能不得不找到其他地方来存储我的数据库;)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM