pandas和nltk：獲取最常用的詞組

Question

對python來說還算是新手，我正在使用一列充滿文本的pandas數據框。 我正在嘗試使用該列，並使用nltk查找常用短語（三個或四個單詞）。

    dat["text_clean"] = 
    dat["Description"].str.replace('[^\w\s]','').str.lower()

dat["text_clean2"] = dat["text_clean"].apply(word_tokenize)

finder = BigramCollocationFinder.from_words(dat["text_clean2"])
finder
# only bigrams that appear 3+ times
finder.apply_freq_filter(3)
# return the 10 n-grams with the highest PMI
print finder.nbest(bigram_measures.pmi, 10)

最初的評論似乎很好。 但是，當我嘗試使用BigramCollocation時，它將引發以下錯誤。

n [437]: finder = BigramCollocationFinder.from_words(dat["text_clean2"])
finder

Traceback (most recent call last):

  File "<ipython-input-437-635c3b3afaf4>", line 1, in <module>
    finder = BigramCollocationFinder.from_words(dat["text_clean2"])

  File "/Users/abrahammathew/anaconda/lib/python2.7/site-packages/nltk/collocations.py", line 168, in from_words
    wfd[w1] += 1

TypeError: unhashable type: 'list'

任何想法，這是指什么或解決方法。

以下命令也存在相同的錯誤。

gg = dat["text_clean2"].tolist()    
finder = BigramCollocationFinder.from_words(gg)
finder = BigramCollocationFinder.from_words(dat["text_clean2"].values.reshape(-1, ))

以下工作，但返回沒有常見的短語。

gg = dat["Description"].str.replace('[^\w\s]','').str.lower()
finder = BigramCollocationFinder.from_words(gg)
finder
# only bigrams that appear 3+ times
finder.apply_freq_filter(2)
# return the 10 n-grams with the highest PMI
print finder.nbest(bigram_measures.pmi, 10)

Answer 1

看來您的BigramCollocationFinder類需要一個單詞列表，而不是列表列表。 嘗試這個：

finder = BigramCollocationFinder.from_words(dat["text_clean2"].values.reshape(-1, ))

Answer 2

您可能必須將列表列表隱藏到元組列表中。 希望這行得通

dat['text_clean2'] = [tuple(x) for x in dat['text_clean2']]
finder = BigramCollocationFinder.from_words(dat["text_clean2"])

Answer 3

CollocationFinder.from_words適用於單個文檔。 您要使用from_documents ：

finder = BigramCollocationFinder.from_documents(gg)

pandas和nltk：獲取最常用的詞組

問題描述

3 個解決方案

解決方案1
1 2017-07-25 14:59:52

解決方案2
1 2017-07-25 15:15:04

解決方案3
0 2018-06-22 17:42:13

pandas和nltk：獲取最常用的詞組

問題描述

3 個解決方案

解決方案1 1 2017-07-25 14:59:52

解決方案2 1 2017-07-25 15:15:04

解決方案3 0 2018-06-22 17:42:13

解決方案1
1 2017-07-25 14:59:52

解決方案2
1 2017-07-25 15:15:04

解決方案3
0 2018-06-22 17:42:13