简体   繁体   中英

Find the most common sentences/phrases among millions of documents using Python

I have about 5 million documents. A document is composed of many sentences, and may be about one to five pages long. Each document is a text file.

I have to find the most common sentences / phrases (at least 5 words long) among all the documents. How should I achieve this?

For exactly 5-word-long phrases, this is relatively simple Python (which may require lots of memory). For variably-longer phrases, it's a bit more complicated – & may need additional clarification about what kinds of longer phrases you'd want to find.

For the 5-word (aka '5-gram') case:

In one pass over the corpus, you generate all 5-grams , & tally their occurrences (say into a Counter ), then report the top-N.

For example, let's assume docs is a Python sequence of all your tokenized texts, where each individual item is a list-of-string-words. Then:

from collections import Counter

ngram_size = 5
tallies = Counter()

for doc in docs:
    for i in range(0, len(doc)-4):
        ngram = doc[i:i+5]
        tallies[ngram] += 1

# show the 10 most-common n-grams
print(tallies.most_common(10))

If you then wanted to also consider variably longer phrases, it's a little trickier – but note any such phrase would have to start with some of the 5-grams you'd already found.

So you could consider gradually repeating the above, for 6-grams, 7-grams, etc.

But to optimize for memory/effort, you could add a step to ignore all n-grams that don't already start with one of the top-N candidates you chose from an earlier run. (For example, in a 6-gram run, the += line above would be conditional on the 6-gram starting-with one of the few 5-grams you've already considered to be of interest.)

And further, you'd stop looking for ever-longer n-grams when (for example) the count of top 8-grams is already below the relevant top-N counts of shorter n-grams. (That is, when any further longer n-grams are assured of being less-frequent than your top-N of interest.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM