使用 spaCy 的文本分类

Question

我试图用 spacy 进行一些文本分类，但我收到一个关于我的词汇为空的错误。

我尝试了一个经典的数据集，但我得到了同样的错误，我看到了一些拆分文本部分的建议，但我有很多行不是很大的行。

这是代码：

# 
df_amazon = pd.read_csv("amazon_alexa.tsv",sep="\t")
bow_vector = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range = (1,1))
tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer)
classifier = LogisticRegression()

pipe = Pipeline ([("cleaner", predictors()),
                 ("vectorizer", bow_vector),
                 ("classifier", classifier)])
pipe.fit(X_train, y_train)

--------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-91-b5a14e655d5a> in <module>
     10 
     11 # Model generation
---> 12 pipe.fit(X_train, y_train)

~\anaconda3\lib\site-packages\sklearn\pipeline.py in fit(self, X, y, **fit_params)
    339         """
    340         fit_params_steps = self._check_fit_params(**fit_params)
--> 341         Xt = self._fit(X, y, **fit_params_steps)
    342         with _print_elapsed_time('Pipeline',
    343                                  self._log_message(len(self.steps) - 1)):

~\anaconda3\lib\site-packages\sklearn\pipeline.py in _fit(self, X, y, **fit_params_steps)
    301                 cloned_transformer = clone(transformer)
    302             # Fit or load from cache the current transformer
--> 303             X, fitted_transformer = fit_transform_one_cached(
    304                 cloned_transformer, X, y, None,
    305                 message_clsname='Pipeline',

~\anaconda3\lib\site-packages\joblib\memory.py in __call__(self, *args, **kwargs)
    350 
    351     def __call__(self, *args, **kwargs):
--> 352         return self.func(*args, **kwargs)
    353 
    354     def call_and_shelve(self, *args, **kwargs):

~\anaconda3\lib\site-packages\sklearn\pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
    752     with _print_elapsed_time(message_clsname, message):
    753         if hasattr(transformer, 'fit_transform'):
--> 754             res = transformer.fit_transform(X, y, **fit_params)
    755         else:
    756             res = transformer.fit(X, y, **fit_params).transform(X)

~\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
   1200         max_features = self.max_features
   1201 
-> 1202         vocabulary, X = self._count_vocab(raw_documents,
   1203                                           self.fixed_vocabulary_)
   1204 

~\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in _count_vocab(self, raw_documents, fixed_vocab)
   1131             vocabulary = dict(vocabulary)
   1132             if not vocabulary:
-> 1133                 raise ValueError("empty vocabulary; perhaps the documents only"
   1134                                  " contain stop words")
   1135 

ValueError: empty vocabulary; perhaps the documents only contain stop words

Answer 1

看起来您只是在使用 spaCy 标记器？ 我不确定发生了什么，但您应该检查文档上标记器的 output。

请注意，虽然我认为您可以通过这种方式使用标记器，但更典型的是使用空白管道，如下所示：

import spacy
nlp = spacy.blank("en")
words = [tok.text for tok in nlp("this is my input text")]

使用 spaCy 的文本分类

问题描述

1 个解决方案

解决方案1
0 2022-08-24 03:55:08

使用 spaCy 的文本分类

问题描述

1 个解决方案

解决方案1 0 2022-08-24 03:55:08

解决方案1
0 2022-08-24 03:55:08