简体   繁体   中英

Use Python to print sentences belonging to most common words in a document

I have a text document, I am using regex and nltk to find top 5 most common words from this document. I have to print out sentences where these words belong to, how do I do that? further, I want to extend this to finding common words in multiple documents and returning their respective sentences.

import nltk
import collections
from collections import Counter

import re
import string

frequency = {}
document_text = open('test.txt', 'r')
text_string = document_text.read().lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string) #return all the words with the number of characters in the range [3-15]

fdist = nltk.FreqDist(match_pattern) # creates a frequency distribution  from a list
most_common = fdist.max()    # returns a single element
top_five = fdist.most_common(5)# returns a list

list_5=[word for (word, freq) in fdist.most_common(5)]


print(top_five)
print(list_5)

Output:

[('you', 8), ('tuples', 8), ('the', 5), ('are', 5), ('pard', 5)]
['you', 'tuples', 'the', 'are', 'pard']

The output is most commonly occurring words I have to print sentences where these words belong to, how do I do that?

Although it doesn't account for special characters at word boundaries like your code does, the following would be a starting point:

for sentence in text_string.split('.'):
    if list(set(list_5) & set(sentence.split(' '))):
        print sentence

We first iterate over the sentences, assuming each sentence ends with a . and the . character is nowhere else in the text. Afterwards, we print the sentence if the intersection of its set of words with the set of words in your list_5 is not empty.

You will have to install NLTK Data if you haven't already done so.

From http://www.nltk.org/data.html :

Run the Python interpreter and type the commands:

> >>> import nltk
> >>> nltk.download()

A new window should open, showing the NLTK Downloader. Click on the File menu and select Change Download Directory.

Then install the punkt model from the models tab. Once you have that you can tokenize all sentences and extract the ones with your top 5 words in them as such:

sent_tokenize_list = nltk.sent_tokenize(text_string)    
for sentence in sent_tokenize_list:
    for word in list_5:
        if word in sentence:
            print(sentence)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM