[英]How to index unknown fields of a known field in PyMongo?
I am trying to find unique words in millions of tweets and also I want to keep where each word appears.我试图在数百万条推文中找到独特的词,而且我想保留每个词出现的位置。 In addition to that, I am also grouping the words by their initial.
除此之外,我还按单词的首字母对单词进行分组。 Here is a sample code:
这是一个示例代码:
from pymongo import UpdateOne
# connect to db stuff
for word in words: # this is actually not the real loop I've used but it fits for this example
# assume tweet_id's and position is calculated here
initial = word[0]
ret = {"tweet_id": tweet_id, "pos": (beg, end)} # additional information about word
command = UpdateOne({"initial": initial}, {"$inc": {"count": 1}, "$push": {"words.%s" % word: ret}}, upsert=True)
commands.append(command)
if len(commands) % 1000 == 0:
db.tweet_words.bulk_write(commands, ordered=False)
commands = []
However, this is way slow to analyze all those tweets.然而,分析所有这些推文的速度很慢。 I am guessing that my problem occurs because I don't use an index on
words
field.我猜我的问题发生是因为我没有在
words
字段上使用索引。
Here is an sample output of a document:这是文档的示例输出:
{
initial: "t"
count: 3,
words: {
"the": [{"tweet_id": <some-tweet-id>, "pos": (2, 5)},
{"tweet_id": <some-other-tweet-id>, "pos": (9, 12)}]
"turkish": [{"tweet_id": <some-tweet-id>, "pos": (5, 11)}]
}
}
I've tried to create indexes using the following codes (unsuccessfully):我尝试使用以下代码创建索引(未成功):
db.tweet_words.create_index([("words.$**", pymongo.TEXT)])
or或者
db.tweet_words.create_index([("words", pymongo.HASHED)])
I've got errors like add index fails, too many indexes for twitter.tweet_words
or key too large to index
.我有一些错误,比如
add index fails, too many indexes for twitter.tweet_words
key too large to index
或key too large to index
。 Is there a way to do this with indexes?有没有办法用索引做到这一点? Or should change my approach the problem (maybe redesign the db)?
或者应该改变我的方法来解决问题(也许重新设计数据库)?
To be indexed, you need to keep your dynamic data in the values of the objects, not the keys.要建立索引,您需要将动态数据保存在对象的值中,而不是键中。 So I'd suggest you rework your schema to look like:
所以我建议你重新设计你的架构,看起来像:
{
initial: "t"
count: 3,
words: [
{value: "the", tweets: [{"tweet_id": <some-tweet-id>, "pos": (2, 5)},
{"tweet_id": <some-other-tweet-id>, "pos": (9, 12)}]},
{value: "turkish", tweets: [{"tweet_id": <some-tweet-id>, "pos": (5, 11)}]}
]
}
Which you could then index as:然后您可以将其索引为:
db.tweet_words.create_index([("words.value", pymongo.TEXT)])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.