FreqDist用於最常見的單詞或短語

Question

我正在嘗試分析應用評論中的一些數據。

我想使用nltk的FreqDist來查看文件中最頻繁出現的短語。 它可以是單個令牌或關鍵短語。 我不想標記數據，因為那樣只會給我最頻繁的標記。 但是現在，FreqDist函數將每個評論作為一個字符串處理，而不是提取每個評論中的單詞。

df = pd.read_csv('Positive.csv')

def pre_process(text):
    translator = str.maketrans("", "", string.punctuation)
    text = text.lower().strip().replace("\n", " ").replace("’", "").translate(translator)
    return text

df['Description'] = df['Description'].map(pre_process)
df = df[df['Description'] != '']

word_dist = nltk.FreqDist(df['Description'])

（“說明”是評論的正文/消息。）

例如，我想得到一些最常用的術語：“我喜歡”，“有用”，“非常好的應用程序”，但我卻得到了最常用的術語：“我真的很喜歡這個應用程序，因為bablabla”（整個評論）

這就是為什么當我繪制FreqDist時得到以下信息：

Answer 1

TL; DR

使用ngrams或everygrams ：

>>> from itertools import chain
>>> import pandas as pd
>>> from nltk import word_tokenize
>>> from nltk import FreqDist

>>> df = pd.read_csv('x')
>>> df['Description']
0            Here is a sentence.
1    This is a foo bar sentence.
Name: Description, dtype: object

>>> df['Description'].map(word_tokenize)
0              [Here, is, a, sentence, .]
1    [This, is, a, foo, bar, sentence, .]
Name: Description, dtype: object

>>> sents = df['Description'].map(word_tokenize).tolist()

>>> FreqDist(list(chain(*[everygrams(sent, 1, 3) for sent in sents])))
FreqDist({('sentence',): 2, ('is', 'a'): 2, ('sentence', '.'): 2, ('is',): 2, ('.',): 2, ('a',): 2, ('Here', 'is', 'a'): 1, ('a', 'foo'): 1, ('a', 'sentence'): 1, ('bar', 'sentence', '.'): 1, ...})

FreqDist用於最常見的單詞或短語

問題描述

1 個解決方案

解決方案1
0 2019-05-24 22:10:52

TL; DR

FreqDist用於最常見的單詞或短語

問題描述

1 個解決方案

解決方案1 0 2019-05-24 22:10:52

TL; DR

解決方案1
0 2019-05-24 22:10:52