简体   繁体   中英

How to get sum of word frequencies by sentence in a document?

I have a small article (document), and I have gotten the word frequency of all tokens in this document. Now, I hope to break the document into sentences, and get the score for each sentence. 'Score' is defined as the sum of the word frequencies for each word in the sentence.

For instance, with a short article as follows:

article = 'We encourage you to take time to read and understand the below information. The first section will help make sure that your investment objectives are still aligned with your current strategy.'

I get the frequency of words as such:

words = nltk.tokenize.word_tokenize(article)
fdist = FreqDist(words)

The solution must be simple, like a lookup back into the tokens of the articles to get the score, but I can't seem to figure it out. Ideally the output would be something like sentScore = [7,5] so that I could easily pick out the top n sentences. In this case sentScore' is just the sum of word frequencies for each sentence (two sentences here)

Edit: I need these counts summed together at the sentence level, and I'm currently splitting sentences using

sentences = tokenize.sent_tokenize(article)

which is smart to work around period-punctuation cases. Essentially, the frequencies should be calculated at the article level, and then go through at the sentence level by summing the individual word frequencies.

Thanks!

Once you have the counts of all the words, you need to tokenize the article into sentences and then the sentences into words. Then each sentence can be reduced to the sum of word counts.

from collections import Counter

words = nltk.tokenize.word_tokenize(article)

# I don't know what `freqDist` is in your code. Counter will create a
# dictionary of word counts.
word_count = Counter(words)

sentences = nltk.tokenize.sent_tokenize(article)

# sentence_words is a list of lists. The article is tokenized into sentences
# and each sentence into words
sentence_words = [nltk.tokenize.word_tokenize(sentence) for sentence in sentences]

sentence_scores = [sum(word_count[word] for word in sentence) for sentence in sentence_words]

For your example article sentence_scores is [17, 22]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM