简体   繁体   English

pandas和nltk:获取最常用的词组

[英]pandas and nltk: get most common phrases

Fairly new to python and I'm working with pandas data frames with a column full of text. 对python来说还算是新手,我正在使用一列充满文本的pandas数据框。 I'm trying to take that column and use nltk to find common phrases (three or four word). 我正在尝试使用该列,并使用nltk查找常用短语(三个或四个单词)。

    dat["text_clean"] = 
    dat["Description"].str.replace('[^\w\s]','').str.lower()

dat["text_clean2"] = dat["text_clean"].apply(word_tokenize)

finder = BigramCollocationFinder.from_words(dat["text_clean2"])
finder
# only bigrams that appear 3+ times
finder.apply_freq_filter(3)
# return the 10 n-grams with the highest PMI
print finder.nbest(bigram_measures.pmi, 10)

The initial comments seem to work fine. 最初的评论似乎很好。 However, when I attempt to use BigramCollocation, it throws the following error. 但是,当我尝试使用BigramCollocation时,它将引发以下错误。

n [437]: finder = BigramCollocationFinder.from_words(dat["text_clean2"])
finder

Traceback (most recent call last):

  File "<ipython-input-437-635c3b3afaf4>", line 1, in <module>
    finder = BigramCollocationFinder.from_words(dat["text_clean2"])

  File "/Users/abrahammathew/anaconda/lib/python2.7/site-packages/nltk/collocations.py", line 168, in from_words
    wfd[w1] += 1

TypeError: unhashable type: 'list'

Any idea what this refers or a workaround. 任何想法,这是指什么或解决方法。

Same error with the following commands also. 以下命令也存在相同的错误。

gg = dat["text_clean2"].tolist()    
finder = BigramCollocationFinder.from_words(gg)
finder = BigramCollocationFinder.from_words(dat["text_clean2"].values.reshape(-1, ))

The following works, but returns that there are no common phrases. 以下工作,但返回没有常见的短语。

gg = dat["Description"].str.replace('[^\w\s]','').str.lower()
finder = BigramCollocationFinder.from_words(gg)
finder
# only bigrams that appear 3+ times
finder.apply_freq_filter(2)
# return the 10 n-grams with the highest PMI
print finder.nbest(bigram_measures.pmi, 10)

It would seem your BigramCollocationFinder class wants a list of words, not a list of lists. 看来您的BigramCollocationFinder类需要一个单词列表,而不是列表列表。 Try this: 尝试这个:

finder = BigramCollocationFinder.from_words(dat["text_clean2"].values.reshape(-1, ))

You might have to covert the list of lists into list of tuples. 您可能必须将列表列表隐藏到元组列表中。 Hope this works 希望这行得通

dat['text_clean2'] = [tuple(x) for x in dat['text_clean2']]
finder = BigramCollocationFinder.from_words(dat["text_clean2"])

CollocationFinder.from_words is for a single document. CollocationFinder.from_words适用于单个文档。 You want to use from_documents : 您要使用from_documents

finder = BigramCollocationFinder.from_documents(gg)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在python或R中获得最常见的短语或单词 - how to get most common phrases or words in python or R Python Pandas NLTK:使用BigramCollocationFinder从数据框中的文本字段显示常用短语(ngram)的频率 - Python Pandas NLTK: Show Frequency of Common Phrases (ngrams) From Text Field in Dataframe Using BigramCollocationFinder Python Pandas NLTK从Dataframe中的文本字段&#39;join()参数&#39;错误中提取常用短语(ngrams) - Python Pandas NLTK Extract Common Phrases (ngrams) From Text Field in Dataframe 'join() argument' Error FreqDist用于最常见的单词或短语 - FreqDist for most common words OR phrases NLTK 每个词最常见的同义词 (Wordnet) - NLTK Most common synonym (Wordnet) for each word 如何计算 Pandas 中最常重复的短语 - How count the most frequently repeated phrases in Pandas 获取行中每个值的最常见值 - pandas df - Get most common value for each value in row - pandas df 熊猫groudby数据框,并获取每组平均值和最常见的值 - pandas groudby dataframe and get mean and most common value per group Pandas 为 groupby 中的每一列获取三个最常见的值 - Pandas get three most common values for every column in groupby 如何从 python 的列表中找到最常用的短语? - How to find most common phrases from a list in python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM