I am trying to use the Tf-idf Vectorizer from scikit-learn, using the spanish stopwords from NLTK:
from nltk.corpus import stopwords
vectorizer = TfidfVectorizer(stop_words=stopwords.words("spanish"))
The problem is that I get the following warning:
/home/---/.virtualenvs/thesis/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py:122: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
tokens = [w for w in tokens if w not in stop_words]
Is there an easy way to solve this issue?
Actually the problem was more easy to solve than I thought. The issue here is that NLTK does not return unicode object, but str objects. So I needed to decode them from utf-8 before using them:
stopwords = [word.decode('utf-8') for word in stopwords.words('spanish')]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.