简体   繁体   English

Python Pandas NLTK从Dataframe中的文本字段'join()参数'错误中提取常用短语(ngrams)

[英]Python Pandas NLTK Extract Common Phrases (ngrams) From Text Field in Dataframe 'join() argument' Error

I have the following sample dataframe: 我有以下示例数据帧:

No  category    problem_definition_stopwords
175 2521       ['coffee', 'maker', 'brewing', 'properly', '2', '420', '420', '420']
211 1438       ['galley', 'work', 'table', 'stuck']
912 2698       ['cloth', 'stuck']
572 2521       ['stuck', 'coffee']

The 'problem_definition_stopwords' field has already been tokenized with stop gap words removed. 'problem_definition_stopwords'字段已被标记化,并删除了间隙词。

I want to create n-grams from the 'problem_definition_stopwords' field. 我想从'problem_definition_stopwords'字段创建n-gram。 Specifically, I want to extract n-grams from my data and find the ones that have the highest point wise mutual information (PMI). 具体来说,我想从我的数据中提取n-gram并找到具有最高点相互信息(PMI)的n-gram。

Essentially I want to find the words that co-occur together much more than I would expect them to by chance. 从本质上讲,我希望找到共同发生的词语,而不是偶然的机会。

I tried the following code: 我尝试了以下代码:

import nltk
from nltk.collocations import *

bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()

# errored out here 
finder = BigramCollocationFinder.from_words(nltk.corpus.genesis.words(df['problem_definition_stopwords']))

# only bigrams that appear 3+ times
finder.apply_freq_filter(3) 

# return the 10 n-grams with the highest PMI
finder.nbest(bigram_measures.pmi, 10) 

The error I received was on the third chunk of code ... TypeError: join() argument must be str or bytes, not 'list' 我收到的错误是在第三块代码上... TypeError:join()参数必须是str或bytes,而不是'list'

Edit: a more portable format for the DataFrame: 编辑:DataFrame的可移植格式:

>>> df.columns
Index(['No', 'category', 'problem_definition_stopwords'], dtype='object')
>>> df.to_dict()
{'No': {0: 175, 1: 211, 2: 912, 3: 572}, 'category': {0: 2521, 1: 1438, 2: 2698, 3: 2521}, 'problem_definition_stopwords': {0: ['coffee', 'maker', 'brewing', 'properly', '2', '420', '420', '420'], 1: ['galley', 'work', 'table', 'stuck'], 2: ['cloth', 'stuck'], 3: ['stuck', 'coffee']}}

It doesn't look like you're using the from_words call in the right way, looking at help(nltk.corpus.genesis.words) 看起来好像你正在以正确的方式使用from_words调用,看看help(nltk.corpus.genesis.words)

Help on method words in module nltk.corpus.reader.plaintext:

words(fileids=None) method of nltk.corpus.reader.plaintext.PlaintextCorpusReader instance
    :return: the given file(s) as a list of words
        and punctuation symbols.
    :rtype: list(str)
(END)

Is this what you're looking for? 这是你在找什么? Since you've already represented your documents as lists of strings, which plays nicely with NLTK in my experience, I think you can use the from_documents method: 既然你已经将你的文档表示为字符串列表,根据我的经验可以很好地与NLTK一起使用,我想你可以使用from_documents方法:

finder = BigramCollocationFinder.from_documents(
    df['problem_definition_stopwords']
)

# only bigrams that appear 3+ times
# Note, I limited this to 1 since the corpus you provided
# is very small and it'll be tough to find repeat ngrams
finder.apply_freq_filter(1) 

# return the 10 n-grams with the highest PMI
finder.nbest(bigram_measures.pmi, 10) 

[('brewing', 'properly'), ('galley', 'work'), ('maker', 'brewing'), ('properly', '2'), ('work', 'table'), ('coffee', 'maker'), ('2', '420'), ('cloth', 'stuck'), ('table', 'stuck'), ('420', '420')]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM