简体   繁体   中英

FreqDist for most common words OR phrases

I'm trying to analyze some data from app reviews.

I want to use nltk's FreqDist to see the most frequently occurring phrases in a file. It can be a single token or key phrases. I don't want to tokenize the data because that would give me most frequent tokens only. But right now, the FreqDist function is processing each review as one string, and is not extracting the words in each review.

df = pd.read_csv('Positive.csv')

def pre_process(text):
    translator = str.maketrans("", "", string.punctuation)
    text = text.lower().strip().replace("\n", " ").replace("’", "").translate(translator)
    return text

df['Description'] = df['Description'].map(pre_process)
df = df[df['Description'] != '']

word_dist = nltk.FreqDist(df['Description'])

('Description' is the body/message of the reviews.)

For example, I want to get something like Most Frequent terms: "I like", "useful", "very good app" But instead I'm getting Most Frequent terms: "I really enjoy this app because bablabla" (entire review)

And that's why when I'm plotting the FreqDist I get this:

在此处输入图片说明

TL;DR

Use ngrams or everygrams :

>>> from itertools import chain
>>> import pandas as pd
>>> from nltk import word_tokenize
>>> from nltk import FreqDist

>>> df = pd.read_csv('x')
>>> df['Description']
0            Here is a sentence.
1    This is a foo bar sentence.
Name: Description, dtype: object

>>> df['Description'].map(word_tokenize)
0              [Here, is, a, sentence, .]
1    [This, is, a, foo, bar, sentence, .]
Name: Description, dtype: object

>>> sents = df['Description'].map(word_tokenize).tolist()

>>> FreqDist(list(chain(*[everygrams(sent, 1, 3) for sent in sents])))
FreqDist({('sentence',): 2, ('is', 'a'): 2, ('sentence', '.'): 2, ('is',): 2, ('.',): 2, ('a',): 2, ('Here', 'is', 'a'): 1, ('a', 'foo'): 1, ('a', 'sentence'): 1, ('bar', 'sentence', '.'): 1, ...})

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM