简体   繁体   English

如何在mongodb(pymongo)中查询所有关键字都存在于字段中的文档?

[英]How to query documents in mongodb (pymongo) where all keywords exist in a field?

I have a list of keywords: 我有一个关键字列表:

keywords = ['word1', 'word2', 'word3']

For now I query for only 1 keyword like this: 现在,我只查询这样的1个关键字:

collection.find({'documenttextfield': {'$regex': ' '+keyword+' '}})

I'm in no way a guru in regex so i do the reggae with spaces on the side of the keyword to find exact match. 我绝对不是正则表达式的专家,所以我在雷鬼摇摆乐与关键字的侧面上的空格来查找完全匹配。

But what i want now is, having that keywords list, to query the documents and find those which have each of the keywords from the list in the documenttextfield . 但是,我现在想要的是具有该keywords列表,以查询文档并在documenttextfield中从列表中查找具有每个关键字的documenttextfield

I have some ideas of how to do this, but they are all a bit too complex and I feel I'm missing something... 我对如何执行此操作有一些想法,但是它们都太复杂了,我觉得我缺少了一些东西...

Consider using a text index with a $text search . 考虑在$text search中使用文本索引 It might be a far better solution than using regular expressions. 这可能比使用正则表达式更好。 However, text search returns documents based on a scoring-algorithm, so you might get some results which don't have all the keywords you are looking for. 但是,文本搜索会根据评分算法返回文档,因此您可能会得到一些结果,这些结果并没有包含您要查找的所有关键字。

If you can't or don't want to add a text index to this field, using a single regular expression would be quite a pain because you don't know the order in which these words appear. 如果您不能或不想在此字段中添加文本索引,则使用单个正则表达式会很麻烦,因为您不知道这些单词的出现顺序。 I don't claim it is impossible to write, but you will end up with a horrible abomination even for regex standards. 我并不是说不可能写,但是即使对于正则表达式标准,您也将遭受可怕的可憎。 It would be far easier to use the regex operator multiple time by using the $and operator. 通过使用$and运算符多次使用regex运算符会容易得多。

Also, using a space as delimeter is going to fail when the word is at the beginning or end of the string or followed by a period or comma. 此外,当单词在字符串的开头或结尾或后跟句号或逗号时,使用空格作为分隔符将失败。 Use the word-boundary token ( \\b ) instead. 请改用单词边界标记( \\b )。

collection.find(
    { $and : [
              {'documenttextfield': {'$regex': '\b' +keyword1+'\b'}},
              {'documenttextfield': {'$regex': '\b' +keyword2+'\b'}},
              {'documenttextfield': {'$regex': '\b' +keyword3+'\b'}},
         ]
    });

Keep in mind that this is a really slow query, because it will run these three regular expressions on every single document of the collection. 请记住,这是一个非常缓慢的查询,因为它将在集合的每个文档上运行这三个正则表达式。 When this is a performance-critical query, seriously consider if a text index really won't do. 当这是对性能至关重要的查询时,请认真考虑文本索引是否真的不可行。 Failing this, the last straw to grasp would be to extract any keywords from the documenttextfield field someone could search for (which might be every unique word in it) into a new array-field documenttextfield_keywords , create a normal index on that field, and search on that field with the $all operator (no regular expression required in that case). 失败的话,最后一根稻草将是从有人可以搜索的documenttextfield字段中提取任何关键字(可能是其中的每个唯一单词)到一个新的数组字段documenttextfield_keywords ,在该字段上创建一个普通索引,然后进行搜索在该字段上使用$all运算符 (在这种情况下,不需要正则表达式)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM