如何索引 PyMongo 中已知字段的未知字段？

Question

I am trying to find unique words in millions of tweets and also I want to keep where each word appears.我试图在数百万条推文中找到独特的词，而且我想保留每个词出现的位置。 In addition to that, I am also grouping the words by their initial.除此之外，我还按单词的首字母对单词进行分组。 Here is a sample code:这是一个示例代码：

from pymongo import UpdateOne
# connect to db stuff
for word in words: # this is actually not the real loop I've used but it fits for this example
    # assume tweet_id's and position is calculated here
    initial = word[0]
    ret = {"tweet_id": tweet_id, "pos": (beg, end)} # additional information about word
    command = UpdateOne({"initial": initial}, {"$inc": {"count": 1}, "$push": {"words.%s" % word: ret}}, upsert=True)
    commands.append(command)
    if len(commands) % 1000 == 0:
        db.tweet_words.bulk_write(commands, ordered=False)
        commands = []

However, this is way slow to analyze all those tweets.然而，分析所有这些推文的速度很慢。 I am guessing that my problem occurs because I don't use an index on words field.我猜我的问题发生是因为我没有在words字段上使用索引。

Here is an sample output of a document:这是文档的示例输出：

{
    initial: "t"
    count: 3,
    words: {
        "the": [{"tweet_id": <some-tweet-id>, "pos": (2, 5)}, 
                {"tweet_id": <some-other-tweet-id>, "pos": (9, 12)}]
        "turkish": [{"tweet_id": <some-tweet-id>, "pos": (5, 11)}]
    }
}

I've tried to create indexes using the following codes (unsuccessfully):我尝试使用以下代码创建索引（未成功）：

db.tweet_words.create_index([("words.$**", pymongo.TEXT)])

or或者

db.tweet_words.create_index([("words", pymongo.HASHED)])

I've got errors like add index fails, too many indexes for twitter.tweet_words or key too large to index .我有一些错误，比如add index fails, too many indexes for twitter.tweet_words key too large to index或key too large to index 。 Is there a way to do this with indexes?有没有办法用索引做到这一点？ Or should change my approach the problem (maybe redesign the db)?或者应该改变我的方法来解决问题（也许重新设计数据库）？

Answer 1

To be indexed, you need to keep your dynamic data in the values of the objects, not the keys.要建立索引，您需要将动态数据保存在对象的值中，而不是键中。 So I'd suggest you rework your schema to look like:所以我建议你重新设计你的架构，看起来像：

{
    initial: "t"
    count: 3,
    words: [
        {value: "the", tweets: [{"tweet_id": <some-tweet-id>, "pos": (2, 5)}, 
                                {"tweet_id": <some-other-tweet-id>, "pos": (9, 12)}]},
        {value: "turkish", tweets: [{"tweet_id": <some-tweet-id>, "pos": (5, 11)}]}
    ]
}

Which you could then index as:然后您可以将其索引为：

db.tweet_words.create_index([("words.value", pymongo.TEXT)])

如何索引 PyMongo 中已知字段的未知字段？

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-12-22 21:29:45

如何索引 PyMongo 中已知字段的未知字段？

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-12-22 21:29:45

解决方案1
1 已采纳 2018-12-22 21:29:45