使用 Python 在数百万个文档中找到最常见的句子/短语

Question

I have about 5 million documents.我有大约 500 万份文件。 A document is composed of many sentences, and may be about one to five pages long.一份文件由许多句子组成，可能大约有 1 到 5 页长。 Each document is a text file.每个文档都是一个文本文件。

I have to find the most common sentences / phrases (at least 5 words long) among all the documents.我必须在所有文档中找到最常见的句子/短语（至少 5 个单词长）。 How should I achieve this?我应该如何实现这一目标？

Answer 1

For exactly 5-word-long phrases, this is relatively simple Python (which may require lots of memory).对于恰好 5 个字长的短语，这是相对简单的 Python（可能需要大量内存）。 For variably-longer phrases, it's a bit more complicated – & may need additional clarification about what kinds of longer phrases you'd want to find.对于可变长的短语，它有点复杂——并且可能需要额外说明你想要找到什么样的较长的短语。

For the 5-word (aka '5-gram') case:对于 5 个字（又名“5-gram”）的情况：

In one pass over the corpus, you generate all 5-grams , & tally their occurrences (say into a Counter ), then report the top-N.在对语料库的一次遍历中，您生成所有 5-gram ，并统计它们的出现次数（比如一个Counter ），然后报告前 N 个。

For example, let's assume docs is a Python sequence of all your tokenized texts, where each individual item is a list-of-string-words.例如，让我们假设docs是所有标记化文本的 Python 序列，其中每个单独的项目都是一个字符串列表。 Then:然后：

from collections import Counter

ngram_size = 5
tallies = Counter()

for doc in docs:
    for i in range(0, len(doc)-4):
        ngram = doc[i:i+5]
        tallies[ngram] += 1

# show the 10 most-common n-grams
print(tallies.most_common(10))

If you then wanted to also consider variably longer phrases, it's a little trickier – but note any such phrase would have to start with some of the 5-grams you'd already found.如果您还想考虑可变长的短语，那就有点棘手了——但请注意，任何此类短语都必须从您已经找到的一些 5-gram 开始。

So you could consider gradually repeating the above, for 6-grams, 7-grams, etc.所以你可以考虑逐渐重复以上，对于 6 克、7 克等。

But to optimize for memory/effort, you could add a step to ignore all n-grams that don't already start with one of the top-N candidates you chose from an earlier run.但是为了优化内存/工作量，您可以添加一个步骤来忽略所有尚未从您从早期运行中选择的前 N 个候选之一开始的 n-gram。 (For example, in a 6-gram run, the += line above would be conditional on the 6-gram starting-with one of the few 5-grams you've already considered to be of interest.) （例如，在 6-gram 运行中，上面的+=行将以 6-gram 为条件 - 从您已经认为感兴趣的少数 5-gram 之一开始。）

And further, you'd stop looking for ever-longer n-grams when (for example) the count of top 8-grams is already below the relevant top-N counts of shorter n-grams.此外，当（例如）前 8 克的计数已经低于较短 n 克的相关前 N 计数时，您将停止寻找更长的 n 克。 (That is, when any further longer n-grams are assured of being less-frequent than your top-N of interest.) （也就是说，当任何更长的 n-gram 被确保比您感兴趣的前 N 个出现频率更低时。）

使用 Python 在数百万个文档中找到最常见的句子/短语

问题描述

1 个解决方案

解决方案1
0 2021-10-26 18:09:18

使用 Python 在数百万个文档中找到最常见的句子/短语

问题描述

1 个解决方案

解决方案1 0 2021-10-26 18:09:18

解决方案1
0 2021-10-26 18:09:18