使用带有scikit-learn的TfidfVectorizer的NLTK停用词时的Unicode警告

Question

I am trying to use the Tf-idf Vectorizer from scikit-learn, using the spanish stopwords from NLTK: 我试图使用来自sckit-learn的Tf-idf Vectorizer，使用来自NLTK的西班牙语停用词：

from nltk.corpus import stopwords

vectorizer = TfidfVectorizer(stop_words=stopwords.words("spanish"))

The problem is that I get the following warning: 问题是我得到以下警告：

/home/---/.virtualenvs/thesis/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py:122: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
tokens = [w for w in tokens if w not in stop_words]

Is there an easy way to solve this issue? 有没有简单的方法来解决这个问题？

Answer 1

Actually the problem was more easy to solve than I thought. 实际上问题比我想象的更容易解决。 The issue here is that NLTK does not return unicode object, but str objects. 这里的问题是NLTK不返回unicode对象，而是str对象。 So I needed to decode them from utf-8 before using them: 所以我需要在使用之前从utf-8解码它们：

stopwords = [word.decode('utf-8') for word in stopwords.words('spanish')]

使用带有scikit-learn的TfidfVectorizer的NLTK停用词时的Unicode警告

问题描述

1 个解决方案

解决方案1
5 2014-08-22 11:20:50

使用带有scikit-learn的TfidfVectorizer的NLTK停用词时的Unicode警告

问题描述

1 个解决方案

解决方案1 5 2014-08-22 11:20:50

解决方案1
5 2014-08-22 11:20:50