简体   繁体   English

spaCy 和 scikit-learn 向量化器

[英]spaCy and scikit-learn vectorizer

I wrote a lemma tokenizer using spaCy for scikit-learn based on their example , it works OK standalone:我根据他们的 示例使用 spaCy 为 scikit-learn 编写了一个引理标记器,它可以独立工作:

import spacy
from sklearn.feature_extraction.text import TfidfVectorizer

class LemmaTokenizer(object):
    def __init__(self):
        self.spacynlp = spacy.load('en')
    def __call__(self, doc):
        nlpdoc = self.spacynlp(doc)
        nlpdoc = [token.lemma_ for token in nlpdoc if (len(token.lemma_) > 1) or (token.lemma_.isalnum()) ]
        return nlpdoc

vect = TfidfVectorizer(tokenizer=LemmaTokenizer())
vect.fit(['Apples and oranges are tasty.'])
print(vect.vocabulary_)
### prints {'apple': 1, 'and': 0, 'tasty': 4, 'be': 2, 'orange': 3}

However, using it in GridSearchCV gives errors, a self contained example is below:但是,在GridSearchCV使用它会出错,下面是一个自包含的示例:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV

wordvect = TfidfVectorizer(analyzer='word', strip_accents='ascii', tokenizer=LemmaTokenizer())
classifier = OneVsRestClassifier(SVC(kernel='linear'))
pipeline = Pipeline([('vect', wordvect), ('classifier', classifier)])
parameters = {'vect__min_df': [1, 2], 'vect__max_df': [0.7, 0.8], 'classifier__estimator__C': [0.1, 1, 10]}
gs_clf = GridSearchCV(pipeline, parameters, n_jobs=7, verbose=1)

from sklearn.datasets import fetch_20newsgroups
categories = ['comp.graphics', 'rec.sport.baseball']
newsgroups = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'), shuffle=True, categories=categories)
X = newsgroups.data
y = newsgroups.target
gs_clf = gs_clf.fit(X, y)

### AttributeError: 'spacy.tokenizer.Tokenizer' object has no attribute '_prefix_re'

The error does not appear when I load spacy outside of constructor of the tokenizer, then the GridSearchCV runs:当我在标记生成器的构造函数之外加载 spacy 时不会出现错误,然后GridSearchCV运行:

spacynlp = spacy.load('en')
    class LemmaTokenizer(object):
        def __call__(self, doc):
            nlpdoc = spacynlp(doc)
            nlpdoc = [token.lemma_ for token in nlpdoc if (len(token.lemma_) > 1) or (token.lemma_.isalnum()) ]
            return nlpdoc

But this means that every of my n_jobs from the GridSearchCV will access and call the same spacynlp object, it is shared among these jobs, which leaves the questions:但这意味着来自GridSearchCV每个n_jobs都将访问​​和调用相同的 spacynlp 对象,它在这些作业之间共享,这就留下了问题:

  1. Is the spacynlp object from spacy.load('en') safe to be used by multiple jobs in GridSearchCV?来自spacy.load('en')的 spacynlp 对象是否可以安全地被 GridSearchCV 中的多个作业使用?
  2. Is this the correct way to implement calls to spacy inside a tokenizer for scikit-learn?这是在 scikit-learn 的标记器中实现对 spacy 调用的正确方法吗?

You are wasting time by running Spacy for each parameter setting in the grid. 您正在通过为网格中的每个参数设置运行Spacy来浪费时间。 The memory overhead is also significant. 内存开销也很重要。 You should run all data through Spacy once and save it to disk, then use a simplified vectoriser that reads in pre-lemmatised data. 您应该通过Spacy运行一次所有数据并将其保存到磁盘,然后使用读取预先模拟数据的简化矢量器。 Look at the tokenizer , analyser and preprocessor parameters of TfidfVectorizer . 查看TfidfVectorizertokenizeranalyserpreprocessor参数。 There are plenty of examples on stack overflow that show how to build a custom vectoriser. 有很多关于堆栈溢出的例子,展示了如何构建自定义矢量化器。

Based on the comments of the post of mbatchkarov , I tried to run all my documents in a pandas series through Spacy once for tokenization and lemmatization and save it to disk first.根据mbatchkarov帖子的评论,我尝试通过 Spacy 将Pandas系列中的所有文档运行一次以进行标记化和词形还原,然后先将其保存到磁盘。 Then, I load in the the lemmatized spacy Doc objects, extract a list of tokens for every document and supply it as input to a pipeline consisting of a simplified TfidfVectorizer and a DecisionTreeClassifier .然后,我加载 lemmatized spacy Doc对象,提取每个文档的标记列表并将其作为输入提供给由简化的TfidfVectorizerDecisionTreeClassifier组成的管道。 I run the pipeline with GridSearchCV and extract the best estimator and respective params.我使用GridSearchCV运行pipeline并提取最佳估计器和相应的参数。

See an example:看一个例子:

from sklearn import tree
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
import spacy
from spacy.tokens import DocBin
nlp = spacy.load("de_core_news_sm") # define your language model

# adjust attributes to your liking:
doc_bin = DocBin(attrs=["LEMMA", "ENT_IOB", "ENT_TYPE"], store_user_data=True)

for doc in nlp.pipe(df['articleDocument'].str.lower()):
    doc_bin.add(doc)

# either save DocBin to a bytes object, or...
#bytes_data = doc_bin.to_bytes()

# save DocBin to a file on disc
file_name_spacy = 'output/preprocessed_documents.spacy'
doc_bin.to_disk(file_name_spacy)

#Load DocBin at later time or on different system from disc or bytes object
#doc_bin = DocBin().from_bytes(bytes_data)
doc_bin = DocBin().from_disk(file_name_spacy)

docs = list(doc_bin.get_docs(nlp.vocab))
print(len(docs))

tokenized_lemmatized_texts = [[token.lemma_ for token in doc 
                               if not token.is_stop and not token.is_punct and not token.is_space and not token.like_url and not token.like_email] 
                               for doc in docs]

# classifier to use
clf = tree.DecisionTreeClassifier()

# just some random target response
y = np.random.randint(2, size=len(docs))


vectorizer = TfidfVectorizer(ngram_range=(1, 1), lowercase=False, tokenizer=lambda x: x, max_features=3000)

pipeline = Pipeline([('vect', vectorizer), ('dectree', clf)])
parameters = {'dectree__max_depth':[4, 10]}
gs_clf = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, cv=5)
gs_clf.fit(tokenized_lemmatized_texts, y)
print(gs_clf.best_estimator_.get_params()['dectree'])

Some further useful resources:一些其他有用的资源:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM