spaCy 和 scikit-learn 向量化器

Question

我根據他們的示例使用 spaCy 為 scikit-learn 編寫了一個引理標記器，它可以獨立工作：

import spacy
from sklearn.feature_extraction.text import TfidfVectorizer

class LemmaTokenizer(object):
    def __init__(self):
        self.spacynlp = spacy.load('en')
    def __call__(self, doc):
        nlpdoc = self.spacynlp(doc)
        nlpdoc = [token.lemma_ for token in nlpdoc if (len(token.lemma_) > 1) or (token.lemma_.isalnum()) ]
        return nlpdoc

vect = TfidfVectorizer(tokenizer=LemmaTokenizer())
vect.fit(['Apples and oranges are tasty.'])
print(vect.vocabulary_)
### prints {'apple': 1, 'and': 0, 'tasty': 4, 'be': 2, 'orange': 3}

但是，在GridSearchCV使用它會出錯，下面是一個自包含的示例：

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV

wordvect = TfidfVectorizer(analyzer='word', strip_accents='ascii', tokenizer=LemmaTokenizer())
classifier = OneVsRestClassifier(SVC(kernel='linear'))
pipeline = Pipeline([('vect', wordvect), ('classifier', classifier)])
parameters = {'vect__min_df': [1, 2], 'vect__max_df': [0.7, 0.8], 'classifier__estimator__C': [0.1, 1, 10]}
gs_clf = GridSearchCV(pipeline, parameters, n_jobs=7, verbose=1)

from sklearn.datasets import fetch_20newsgroups
categories = ['comp.graphics', 'rec.sport.baseball']
newsgroups = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'), shuffle=True, categories=categories)
X = newsgroups.data
y = newsgroups.target
gs_clf = gs_clf.fit(X, y)

### AttributeError: 'spacy.tokenizer.Tokenizer' object has no attribute '_prefix_re'

當我在標記生成器的構造函數之外加載 spacy 時不會出現錯誤，然后GridSearchCV運行：

spacynlp = spacy.load('en')
    class LemmaTokenizer(object):
        def __call__(self, doc):
            nlpdoc = spacynlp(doc)
            nlpdoc = [token.lemma_ for token in nlpdoc if (len(token.lemma_) > 1) or (token.lemma_.isalnum()) ]
            return nlpdoc

但這意味着來自GridSearchCV每個n_jobs都將訪問和調用相同的 spacynlp 對象，它在這些作業之間共享，這就留下了問題：

來自spacy.load('en')的 spacynlp 對象是否可以安全地被 GridSearchCV 中的多個作業使用？
這是在 scikit-learn 的標記器中實現對 spacy 調用的正確方法嗎？

Answer 1

您正在通過為網格中的每個參數設置運行Spacy來浪費時間。 內存開銷也很重要。 您應該通過Spacy運行一次所有數據並將其保存到磁盤，然后使用讀取預先模擬數據的簡化矢量器。 查看TfidfVectorizer的tokenizer ， analyser和preprocessor參數。 有很多關於堆棧溢出的例子，展示了如何構建自定義矢量化器。

Answer 2

根據mbatchkarov帖子的評論，我嘗試通過 Spacy 將Pandas系列中的所有文檔運行一次以進行標記化和詞形還原，然后先將其保存到磁盤。 然后，我加載 lemmatized spacy Doc對象，提取每個文檔的標記列表並將其作為輸入提供給由簡化的TfidfVectorizer和DecisionTreeClassifier組成的管道。 我使用GridSearchCV運行pipeline並提取最佳估計器和相應的參數。

看一個例子：

from sklearn import tree
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
import spacy
from spacy.tokens import DocBin
nlp = spacy.load("de_core_news_sm") # define your language model

# adjust attributes to your liking:
doc_bin = DocBin(attrs=["LEMMA", "ENT_IOB", "ENT_TYPE"], store_user_data=True)

for doc in nlp.pipe(df['articleDocument'].str.lower()):
    doc_bin.add(doc)

# either save DocBin to a bytes object, or...
#bytes_data = doc_bin.to_bytes()

# save DocBin to a file on disc
file_name_spacy = 'output/preprocessed_documents.spacy'
doc_bin.to_disk(file_name_spacy)

#Load DocBin at later time or on different system from disc or bytes object
#doc_bin = DocBin().from_bytes(bytes_data)
doc_bin = DocBin().from_disk(file_name_spacy)

docs = list(doc_bin.get_docs(nlp.vocab))
print(len(docs))

tokenized_lemmatized_texts = [[token.lemma_ for token in doc 
                               if not token.is_stop and not token.is_punct and not token.is_space and not token.like_url and not token.like_email] 
                               for doc in docs]

# classifier to use
clf = tree.DecisionTreeClassifier()

# just some random target response
y = np.random.randint(2, size=len(docs))


vectorizer = TfidfVectorizer(ngram_range=(1, 1), lowercase=False, tokenizer=lambda x: x, max_features=3000)

pipeline = Pipeline([('vect', vectorizer), ('dectree', clf)])
parameters = {'dectree__max_depth':[4, 10]}
gs_clf = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, cv=5)
gs_clf.fit(tokenized_lemmatized_texts, y)
print(gs_clf.best_estimator_.get_params()['dectree'])

一些其他有用的資源：

spaCy 和 scikit-learn 向量化器

問題描述

2 個解決方案

解決方案1
2 已采納 2017-07-20 10:51:43

解決方案2
0 2021-12-24 13:04:56

spaCy 和 scikit-learn 向量化器

問題描述

2 個解決方案

解決方案1 2 已采納 2017-07-20 10:51:43

解決方案2 0 2021-12-24 13:04:56

解決方案1
2 已采納 2017-07-20 10:51:43

解決方案2
0 2021-12-24 13:04:56