简体   繁体   English

FreqDist用于最常见的单词或短语

[英]FreqDist for most common words OR phrases

I'm trying to analyze some data from app reviews. 我正在尝试分析应用评论中的一些数据。

I want to use nltk's FreqDist to see the most frequently occurring phrases in a file. 我想使用nltk的FreqDist来查看文件中最频繁出现的短语。 It can be a single token or key phrases. 它可以是单个令牌或关键短语。 I don't want to tokenize the data because that would give me most frequent tokens only. 我不想标记数据,因为那样只会给我最频繁的标记。 But right now, the FreqDist function is processing each review as one string, and is not extracting the words in each review. 但是现在,FreqDist函数将每个评论作为一个字符串处理,而不是提取每个评论中的单词。

df = pd.read_csv('Positive.csv')

def pre_process(text):
    translator = str.maketrans("", "", string.punctuation)
    text = text.lower().strip().replace("\n", " ").replace("’", "").translate(translator)
    return text

df['Description'] = df['Description'].map(pre_process)
df = df[df['Description'] != '']

word_dist = nltk.FreqDist(df['Description'])

('Description' is the body/message of the reviews.) (“说明”是评论的正文/消息。)

For example, I want to get something like Most Frequent terms: "I like", "useful", "very good app" But instead I'm getting Most Frequent terms: "I really enjoy this app because bablabla" (entire review) 例如,我想得到一些最常用的术语:“我喜欢”,“有用”,“非常好的应用程序”,但我却得到了最常用的术语:“我真的很喜欢这个应用程序,因为bablabla”(整个评论)

And that's why when I'm plotting the FreqDist I get this: 这就是为什么当我绘制FreqDist时得到以下信息:

在此处输入图片说明

TL;DR TL; DR

Use ngrams or everygrams : 使用ngramseverygrams

>>> from itertools import chain
>>> import pandas as pd
>>> from nltk import word_tokenize
>>> from nltk import FreqDist

>>> df = pd.read_csv('x')
>>> df['Description']
0            Here is a sentence.
1    This is a foo bar sentence.
Name: Description, dtype: object

>>> df['Description'].map(word_tokenize)
0              [Here, is, a, sentence, .]
1    [This, is, a, foo, bar, sentence, .]
Name: Description, dtype: object

>>> sents = df['Description'].map(word_tokenize).tolist()

>>> FreqDist(list(chain(*[everygrams(sent, 1, 3) for sent in sents])))
FreqDist({('sentence',): 2, ('is', 'a'): 2, ('sentence', '.'): 2, ('is',): 2, ('.',): 2, ('a',): 2, ('Here', 'is', 'a'): 1, ('a', 'foo'): 1, ('a', 'sentence'): 1, ('bar', 'sentence', '.'): 1, ...})

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM