简体   繁体   English

Pandas NLTK 标记“不可散列的类型:'列表'”

[英]Pandas NLTK tokenizing “unhashable type: 'list'”

Following this example: Twitter data mining with Python and Gephi: Case synthetic biology下面这个例子: 使用 Python 和 Gephi 进行 Twitter 数据挖掘:案例合成生物学

CSV to: df['Country', 'Responses']

'Country'
Italy
Italy
France
Germany

'Responses' 
"Loren ipsum..."
"Loren ipsum..."
"Loren ipsum..."
"Loren ipsum..."
  1. tokenize the text in 'Responses'标记“响应”中的文本
  2. remove the 100 most common words (based on brown.corpus)删除 100 个最常用的单词(基于 brown.corpus)
  3. identify the remaining 100 most frequent words找出剩下的 100 个最常用的词

I can get through step 1 and 2, but get an error on step 3:我可以完成第 1 步和第 2 步,但在第 3 步中出现错误:

TypeError: unhashable type: 'list'

I believe it's because I'm working in a dataframe and have made this (likely erronous) modification:我相信这是因为我在数据帧中工作并且进行了这个(可能是错误的)修改:

Original example:原始示例:

#divide to words
tokenizer = RegexpTokenizer(r'\w+')
words = tokenizer.tokenize(tweets)

My code:我的代码:

#divide to words
tokenizer = RegexpTokenizer(r'\w+')
df['tokenized_sents'] = df['Responses'].apply(nltk.word_tokenize)

My full code:我的完整代码:

df = pd.read_csv('CountryResponses.csv', encoding='utf-8', skiprows=0, error_bad_lines=False)

tokenizer = RegexpTokenizer(r'\w+')
df['tokenized_sents'] = df['Responses'].apply(nltk.word_tokenize)

words =  df['tokenized_sents']

#remove 100 most common words based on Brown corpus
fdist = FreqDist(brown.words())
mostcommon = fdist.most_common(100)
mclist = []
for i in range(len(mostcommon)):
    mclist.append(mostcommon[i][0])
words = [w for w in words if w not in mclist]

Out: ['the',
 ',',
 '.',
 'of',
 'and',
...]

#keep only most common words
fdist = FreqDist(words)
mostcommon = fdist.most_common(100)
mclist = []
for i in range(len(mostcommon)):
    mclist.append(mostcommon[i][0])
words = [w for w in words if w not in mclist]

TypeError: unhashable type: 'list'

There are many questions on unhashable lists, but none that I understand to be quite the same.关于不可散列的列表有很多问题,但我认为没有一个是完全相同的。 Any suggestions?有什么建议吗? Thanks.谢谢。


TRACEBACK追溯

TypeError                                 Traceback (most recent call last)
<ipython-input-164-a0d17b850b10> in <module>()
  1 #keep only most common words
----> 2 fdist = FreqDist(words)
  3 mostcommon = fdist.most_common(100)
  4 mclist = []
  5 for i in range(len(mostcommon)):

/home/*******/anaconda3/envs/*******/lib/python3.5/site-packages/nltk/probability.py in __init__(self, samples)
    104         :type samples: Sequence
    105         """
--> 106         Counter.__init__(self, samples)
    107 
    108     def N(self):

/home/******/anaconda3/envs/******/lib/python3.5/collections/__init__.py in __init__(*args, **kwds)
    521             raise TypeError('expected at most 1 arguments, got %d' % len(args))
    522         super(Counter, self).__init__()
--> 523         self.update(*args, **kwds)
    524 
    525     def __missing__(self, key):

/home/******/anaconda3/envs/******/lib/python3.5/collections/__init__.py in update(*args, **kwds)
    608                     super(Counter, self).update(iterable) # fast path when counter is empty
    609             else:
--> 610                 _count_elements(self, iterable)
    611         if kwds:
    612             self.update(kwds)

TypeError: unhashable type: 'list'

The FreqDist function takes in an iterable of hashable objects (made to be strings, but it probably works with whatever). FreqDist函数接受一个可迭代的可散列对象(制成字符串,但它可能适用于任何东西)。 The error you're getting is because you pass in an iterable of lists.你得到的错误是因为你传入了一个可迭代的列表。 As you suggested, this is because of the change you made:正如您所建议的,这是因为您所做的更改:

df['tokenized_sents'] = df['Responses'].apply(nltk.word_tokenize)

If I understand the Pandas apply function documentation correctly, that line is applying the nltk.word_tokenize function to some series.如果我正确理解Pandas 应用函数文档,那一行就是将nltk.word_tokenize函数应用于某些系列。 word-tokenize returns a list of words. word-tokenize返回一个单词列表。

As a solution, simply add the lists together before trying to apply FreqDist , like so:作为解决方案,只需在尝试应用FreqDist之前将列表添加在一起,如下所示:

allWords = []
for wordList in words:
    allWords += wordList
FreqDist(allWords)

A more complete revision to do what you would like.一个更完整的修订来做你想做的事。 If all you need is to identify the second set of 100, note that mclist will have that the second time.如果您只需要识别第二组 100,请注意mclist将第二次识别。

df = pd.read_csv('CountryResponses.csv', encoding='utf-8', skiprows=0, error_bad_lines=False)

tokenizer = RegexpTokenizer(r'\w+')
df['tokenized_sents'] = df['Responses'].apply(nltk.word_tokenize)

lists =  df['tokenized_sents']
words = []
for wordList in lists:
    words += wordList

#remove 100 most common words based on Brown corpus
fdist = FreqDist(brown.words())
mostcommon = fdist.most_common(100)
mclist = []
for i in range(len(mostcommon)):
    mclist.append(mostcommon[i][0])
words = [w for w in words if w not in mclist]

Out: ['the',
 ',',
 '.',
 'of',
 'and',
...]

#keep only most common words
fdist = FreqDist(words)
mostcommon = fdist.most_common(100)
mclist = []
for i in range(len(mostcommon)):
    mclist.append(mostcommon[i][0])
# mclist contains second-most common set of 100 words
words = [w for w in words if w in mclist]
# this will keep ALL occurrences of the words in mclist

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM