简体   繁体   English

如何按文档中的句子获取词频总和?

[英]How to get sum of word frequencies by sentence in a document?

I have a small article (document), and I have gotten the word frequency of all tokens in this document.我有一篇小文章(文档),得到了这篇文档中所有token的词频。 Now, I hope to break the document into sentences, and get the score for each sentence.现在,我希望将文档分解成句子,并获得每个句子的分数。 'Score' is defined as the sum of the word frequencies for each word in the sentence. “分数”被定义为句子中每个单词的单词频率之和。

For instance, with a short article as follows:例如,一篇短文如下:

article = 'We encourage you to take time to read and understand the below information. The first section will help make sure that your investment objectives are still aligned with your current strategy.'

I get the frequency of words as such:我得到这样的词的频率:

words = nltk.tokenize.word_tokenize(article)
fdist = FreqDist(words)

The solution must be simple, like a lookup back into the tokens of the articles to get the score, but I can't seem to figure it out.解决方案必须很简单,例如查找文章的标记以获得分数,但我似乎无法弄清楚。 Ideally the output would be something like sentScore = [7,5] so that I could easily pick out the top n sentences.理想情况下,output 类似于sentScore = [7,5]这样我就可以轻松挑选出前n 个句子。 In this case sentScore' is just the sum of word frequencies for each sentence (two sentences here)在这种情况下sentScore'只是每个句子的词频总和(这里是两个句子)

Edit: I need these counts summed together at the sentence level, and I'm currently splitting sentences using编辑:我需要在句子级别将这些计数加在一起,我目前正在使用拆分句子

sentences = tokenize.sent_tokenize(article)

which is smart to work around period-punctuation cases.解决句点标点符号的情况很聪明。 Essentially, the frequencies should be calculated at the article level, and then go through at the sentence level by summing the individual word frequencies.本质上,频率应该在文章级别计算,然后在句子级别通过对单个词的频率求和来计算频率。

Thanks!谢谢!

Once you have the counts of all the words, you need to tokenize the article into sentences and then the sentences into words.计算完所有单词后,您需要将文章标记为句子,然后将句子标记为单词。 Then each sentence can be reduced to the sum of word counts.然后可以将每个句子减少为字数的总和。

from collections import Counter

words = nltk.tokenize.word_tokenize(article)

# I don't know what `freqDist` is in your code. Counter will create a
# dictionary of word counts.
word_count = Counter(words)

sentences = nltk.tokenize.sent_tokenize(article)

# sentence_words is a list of lists. The article is tokenized into sentences
# and each sentence into words
sentence_words = [nltk.tokenize.word_tokenize(sentence) for sentence in sentences]

sentence_scores = [sum(word_count[word] for word in sentence) for sentence in sentence_words]

For your example article sentence_scores is [17, 22]对于您的示例文章 sentence_scores 是[17, 22]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM