简体   繁体   English

添加停用词后CountVectorizer在fit_transform上引发错误

[英]CountVectorizer throws error on fit_transform after adding stop words

I have two sections of code. 我有两段代码。 One works, and one does not. 一种有效,一种无效。

The following code runs as expected without error: (Note: postrain , negtrain , postest , and negtest are lists of strings defined earlier.) 下面的代码按预期运行,没有错误:(注意: postrainnegtrainpostestnegtest是前面定义的字符串的列表。)

from sklearn.feature_extraction.text import CountVectorizer

vector = CountVectorizer()
train_vector = vector.fit_transform(postrain+negtrain)
test_vector = vector.transform(postest+negtest)
print test_vector.shape

However, this code throws an error: 但是,此代码将引发错误:

import re

stop = [re.split('\n|\t', open('stop_words.txt').read())]

vector2 = CountVectorizer(stop_words=stop)
train_vector = vector2.fit_transform(postrain+negtrain) # <-- Error occurs here
test_vector = vector2.transform(postest+negtest)
print test_vector.shape

the error: 错误:

TypeErrorTraceback (most recent call last)
<ipython-input-43-cf5f4754d58c> in <module>()
      7 
      8 vector2 = CountVectorizer(stop_words=stop)
----> 9 train_vector = vector2.fit_transform(postrain+negtrain)
     10 test_vector = vector2.transform(postest+negtest)
     11 

C:\Users\Nsth\Anaconda2\envs\cs489\lib\site-packages\sklearn\feature_extraction\text.pyc in fit_transform(self, raw_documents, y)
    815 
    816         vocabulary, X = self._count_vocab(raw_documents,
--> 817                                           self.fixed_vocabulary_)
    818 
    819         if self.binary:

C:\Users\Nsth\Anaconda2\envs\cs489\lib\site-packages\sklearn\feature_extraction\text.pyc in _count_vocab(self, raw_documents, fixed_vocab)
    745             vocabulary.default_factory = vocabulary.__len__
    746 
--> 747         analyze = self.build_analyzer()
    748         j_indices = _make_int_array()
    749         indptr = _make_int_array()

C:\Users\Nsth\Anaconda2\envs\cs489\lib\site-packages\sklearn\feature_extraction\text.pyc in build_analyzer(self)
    232 
    233         elif self.analyzer == 'word':
--> 234             stop_words = self.get_stop_words()
    235             tokenize = self.build_tokenizer()
    236 

C:\Users\Nsth\Anaconda2\envs\cs489\lib\site-packages\sklearn\feature_extraction\text.pyc in get_stop_words(self)
    215     def get_stop_words(self):
    216         """Build or fetch the effective stop words list"""
--> 217         return _check_stop_list(self.stop_words)
    218 
    219     def build_analyzer(self):

C:\Users\Nsth\Anaconda2\envs\cs489\lib\site-packages\sklearn\feature_extraction\text.pyc in _check_stop_list(stop)
     92         return None
     93     else:               # assume it's a collection
---> 94         return frozenset(stop)
     95 
     96 

TypeError: unhashable type: 'list'

How did adding stop words cause the error? 添加停用词如何导致错误?

I'm dumb. 我很笨 It should have been: 应该是:

stop = re.split('\n|\t', open('stop_words.txt').read())

without the brackets. 没有括号。 Not sure why it threw the error on the line after that though. 不知道为什么在那之后把错误扔到了网上。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM