[英]Text Classification Using spaCy
我试图用 spacy 进行一些文本分类,但我收到一个关于我的词汇为空的错误。
我尝试了一个经典的数据集,但我得到了同样的错误,我看到了一些拆分文本部分的建议,但我有很多行不是很大的行。
这是代码:
#
df_amazon = pd.read_csv("amazon_alexa.tsv",sep="\t")
bow_vector = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range = (1,1))
tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer)
classifier = LogisticRegression()
pipe = Pipeline ([("cleaner", predictors()),
("vectorizer", bow_vector),
("classifier", classifier)])
pipe.fit(X_train, y_train)
--------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-91-b5a14e655d5a> in <module>
10
11 # Model generation
---> 12 pipe.fit(X_train, y_train)
~\anaconda3\lib\site-packages\sklearn\pipeline.py in fit(self, X, y, **fit_params)
339 """
340 fit_params_steps = self._check_fit_params(**fit_params)
--> 341 Xt = self._fit(X, y, **fit_params_steps)
342 with _print_elapsed_time('Pipeline',
343 self._log_message(len(self.steps) - 1)):
~\anaconda3\lib\site-packages\sklearn\pipeline.py in _fit(self, X, y, **fit_params_steps)
301 cloned_transformer = clone(transformer)
302 # Fit or load from cache the current transformer
--> 303 X, fitted_transformer = fit_transform_one_cached(
304 cloned_transformer, X, y, None,
305 message_clsname='Pipeline',
~\anaconda3\lib\site-packages\joblib\memory.py in __call__(self, *args, **kwargs)
350
351 def __call__(self, *args, **kwargs):
--> 352 return self.func(*args, **kwargs)
353
354 def call_and_shelve(self, *args, **kwargs):
~\anaconda3\lib\site-packages\sklearn\pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
752 with _print_elapsed_time(message_clsname, message):
753 if hasattr(transformer, 'fit_transform'):
--> 754 res = transformer.fit_transform(X, y, **fit_params)
755 else:
756 res = transformer.fit(X, y, **fit_params).transform(X)
~\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
1200 max_features = self.max_features
1201
-> 1202 vocabulary, X = self._count_vocab(raw_documents,
1203 self.fixed_vocabulary_)
1204
~\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in _count_vocab(self, raw_documents, fixed_vocab)
1131 vocabulary = dict(vocabulary)
1132 if not vocabulary:
-> 1133 raise ValueError("empty vocabulary; perhaps the documents only"
1134 " contain stop words")
1135
ValueError: empty vocabulary; perhaps the documents only contain stop words
看起来您只是在使用 spaCy 标记器? 我不确定发生了什么,但您应该检查文档上标记器的 output。
请注意,虽然我认为您可以通过这种方式使用标记器,但更典型的是使用空白管道,如下所示:
import spacy
nlp = spacy.blank("en")
words = [tok.text for tok in nlp("this is my input text")]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.