简体   繁体   中英

Python's NLTK word frequency in sents

I'm sorry, but it's difficult for me: I have a code with basic frequency for some text, which represents output in "most_common" mode. But it represents in words.

def sym(senten):
    stopwords = nltk.corpus.stopwords.words("english")
    V = [",", ".", "'", "(", ")", '"', "'", ":", "it", "may", "I", "the", "but", "two", "this", "since", "for", "whether", "and", "?", "if", "even", "Mr.", "also", "at", "p."]
    content = [w for w in senten if w not in stopwords and w not in V]
    fd = nltk.FreqDist(content)
    fdc = fd.most_common(75)
    return fdc

For further analysis I need frequency code, which represents me in output in sents(sentences). So, output must show me sents, selected by frequency of words, which locate in it.

I had some idea to realize it with "tabulate". There is a code (for example):

S= ["proposition", "modus", "logic", "2"] #The most frequent words( for example)
cfd = nltk.ConditionalFreqDist(
    (senten, S)
    for senten in senten
    for S in senten)
print cfd.tabulate(conditions = senten,
             samples=S)

It's works, but there are too many pointless data about sentences without frequent words.

I'll gratitude for your ideas, which could resolve my problem.

Do it in two steps. You already have code that finds the most frequent words, so that's good. Now build an index (a dictionary) that will tell you, for each word, which sentences contain it. So the keys in this dictionary should be words, and the value will be a list of whole sentences -- basically the opposite from how you tried to do it. You'll add each sentence multiple times (don't worry it doesn't actually get copied, so it's quite efficient).

This dictionary doesn't need to count anything-- you'll just be looking words up. So you can use an ordinary dict , or use collections.defaultdict for convenience. The final step will be to obtain a list of the most common words using your current function, and for each such word you can simply request all sentences that contain it. Clear enough?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM