简体   繁体   English

如何使用pymongo在na数组中搜索值

[英]How to search for a value inside na array with pymongo

I am coding a Web Crawler, now it is already working and I want to make a get_inverted_index function. 我正在编写一个Web爬网程序,现在它已经可以工作了,我想创建一个get_inverted_index函数。 Therefore, I have two collections: lexicon and documents. 因此,我有两个集合:词典和文档。 Inside each document of the documents lexicon, I have an array named words, which has the id and the font size of each word in each document (page). 在文档词典的每个文档中,我都有一个名为word的数组,该数组具有每个文档(页面)中每个单词的id和字体大小。 My next step would be iterate over the words and look for the documents who has each specific word, but I cannot see how to write the query for this request. 我的下一步将是遍历单词并查找包含每个特定单词的文档,但是我看不到如何为该请求编写查询。 I have tried the following code snippet: 我已经尝试了以下代码片段:

k = {}
for word in self.lexicon.find():
    s = set()
    for page in self.documents.find({'words' : {'$in' : word['_id'}}):

But this query did not work properly. 但是此查询无法正常工作。 As an example, one entry from my lexicon collection: 例如,我的词典集合中的一个条目:

{
    "_id": {
        "$oid": "54723c55b59c44a167ed3424"
    },
    "word": "google"
}

And an example from my documents collection: 还有我的文档集中的一个示例:

{
    "_id": {
        "$oid": "54723c54b59c44a167ed3423"
    },
    "url": "http://www.google.com",
    "words": [
        [
            {
                "$oid": "54723c55b59c44a167ed3424"
            },
            7
        ],
        [
            {
                "$oid": "54723c55b59c44a167ed3425"
            },
            2
        ],
        [
            {
                "$oid": "54723c55b59c44a167ed3428"
            },
            0
        ],
        [
            {
                "$oid": "54723c55b59c44a167ed342b"
            },
            0
        ],
        [
            {
                "$oid": "54723c56b59c44a167ed342e"
            },
            0
        ],
        [
            {
                "$oid": "54723c5eb59c44a167ed3477"
            },
            0
        ]
    ]
}

@Edit @编辑

I have tried with regex as well, but with no success: (For testing the expression) 我也尝试过使用正则表达式,但是没有成功:(用于测试表达式)

for page in documents.find({'words' : [ObjectId('547244abb59c44a167ed4a84'), {"$regex": "*"}]}):
    print page

Also

for page in documents.find({'words' : [{'$in' : ObjectId('547244abb59c44a167ed4a84')}, {'$regex': '*'}]}):

    print page

That is a really unfortunate choice of schema for the documents collection. 对于文档收集,这确实是一个不幸的模式选择。

You say that you have an array named words which has the id and the font size of each word in each document. 您说您有一个名为words的数组,该数组具有每个文档中每个单词的id和字体大小。 Unfortunately, you have this id and font size as another array. 不幸的是,您具有此ID和字体大小作为另一个数组。 What would make sense would be to have the id and font size as named fields in a subdocument. 有意义的是将id和字体大小作为子文档中的命名字段。 To put it in more Pythonic terms, you want a list of dictionaries, not a list of lists. 换句话说,您需要一个字典列表,而不是列表列表。

{  "_id":   <id here>,
   "url": "http://www.google.com",
   "words": [
       { "id":<id>, "fs":7 },
       { "id":<id>, "fs":2 }
   ]
}

This will make it simple to query via documents.find({"words.id":<id>}) query. 这将使通过documents.find({"words.id":<id>})查询的查询变得简单。 In addition, if you happen to want to track other things about each word, it won't be a mystery what that second number means. 此外,如果您碰巧想跟踪每个单词的其他内容,那么第二个数字的含义就不是一个谜。

While you can contrive to make a query which happens to return what you want for the schema you have, it's really not a very good fit to what it's describing. 尽管您可以尝试进行查询以返回所要拥有的架构所需的内容,但实际上并不太适合它所描述的内容。 However, if you are determined to stay with your current structure, the proper way to query it would be 但是,如果你下定决心留在当前的结构,正确的方法来查询这将是

documents.find({'words':{'$elemMatch':{'0':word['_id']}}})

rather than using double $elemMatch, this syntax specifically looks for array element whose first element matches the _id in question. 而不是使用双$ elemMatch,此语法专门查找第一个元素与所讨论的_id匹配的数组元素。

looks like you need to search the documents collection on a level deeper. 看起来您需要在更高级别上搜索文档集合。

As of now you search for the element 到目前为止,您正在搜索元素

{
    "$oid": "54723c55b59c44a167ed3424"
}

And the $in operator of your documents collection compares it to the list elements such as: 文档集合的$ in运算符会将其与列表元素进行比较,例如:

[
    {
        "$oid": "54723c55b59c44a167ed3424"
    },
    7
]

which obviously aren't the same ever. 这显然是不一样的。 Unfortunately I don't have a mongodb to test any but maybe that tip helps you a little to improve your query. 不幸的是,我没有mongodb可以进行任何测试,但是也许该技巧对您有所改善。

EDIT: Found an older question here regarding a similar problem maybe that'll helps. 编辑: 在这里找到一个类似问题的老问题也许会有所帮助。 According to that post something like the following works: 根据该帖子,类似以下内容的内容:

for page in documents.find({'words':{$elemMatch:{$elemMatch:{$in:[word['_id']]}}}})

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM