简体   繁体   English

如何索引 PyMongo 中已知字段的未知字段?

[英]How to index unknown fields of a known field in PyMongo?

I am trying to find unique words in millions of tweets and also I want to keep where each word appears.我试图在数百万条推文中找到独特的词,而且我想保留每个词出现的位置。 In addition to that, I am also grouping the words by their initial.除此之外,我还按单词的首字母对单词进行分组。 Here is a sample code:这是一个示例代码:

from pymongo import UpdateOne
# connect to db stuff
for word in words: # this is actually not the real loop I've used but it fits for this example
    # assume tweet_id's and position is calculated here
    initial = word[0]
    ret = {"tweet_id": tweet_id, "pos": (beg, end)} # additional information about word
    command = UpdateOne({"initial": initial}, {"$inc": {"count": 1}, "$push": {"words.%s" % word: ret}}, upsert=True)
    commands.append(command)
    if len(commands) % 1000 == 0:
        db.tweet_words.bulk_write(commands, ordered=False)
        commands = []

However, this is way slow to analyze all those tweets.然而,分析所有这些推文的速度很慢。 I am guessing that my problem occurs because I don't use an index on words field.我猜我的问题发生是因为我没有在words字段上使用索引。

Here is an sample output of a document:这是文档的示例输出:

{
    initial: "t"
    count: 3,
    words: {
        "the": [{"tweet_id": <some-tweet-id>, "pos": (2, 5)}, 
                {"tweet_id": <some-other-tweet-id>, "pos": (9, 12)}]
        "turkish": [{"tweet_id": <some-tweet-id>, "pos": (5, 11)}]
    }
}

I've tried to create indexes using the following codes (unsuccessfully):我尝试使用以下代码创建索引(未成功):

db.tweet_words.create_index([("words.$**", pymongo.TEXT)])

or或者

db.tweet_words.create_index([("words", pymongo.HASHED)])

I've got errors like add index fails, too many indexes for twitter.tweet_words or key too large to index .我有一些错误,比如add index fails, too many indexes for twitter.tweet_words key too large to indexkey too large to index Is there a way to do this with indexes?有没有办法用索引做到这一点? Or should change my approach the problem (maybe redesign the db)?或者应该改变我的方法来解决问题(也许重新设计数据库)?

To be indexed, you need to keep your dynamic data in the values of the objects, not the keys.要建立索引,您需要将动态数据保存在对象的值中,而不是键中。 So I'd suggest you rework your schema to look like:所以我建议你重新设计你的架构,看起来像:

{
    initial: "t"
    count: 3,
    words: [
        {value: "the", tweets: [{"tweet_id": <some-tweet-id>, "pos": (2, 5)}, 
                                {"tweet_id": <some-other-tweet-id>, "pos": (9, 12)}]},
        {value: "turkish", tweets: [{"tweet_id": <some-tweet-id>, "pos": (5, 11)}]}
    ]
}

Which you could then index as:然后您可以将其索引为:

db.tweet_words.create_index([("words.value", pymongo.TEXT)])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM