[英]spaCy and scikit-learn vectorizer
I wrote a lemma tokenizer using spaCy for scikit-learn based on their example , it works OK standalone:我根据他们的 示例使用 spaCy 为 scikit-learn 编写了一个引理标记器,它可以独立工作:
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
class LemmaTokenizer(object):
def __init__(self):
self.spacynlp = spacy.load('en')
def __call__(self, doc):
nlpdoc = self.spacynlp(doc)
nlpdoc = [token.lemma_ for token in nlpdoc if (len(token.lemma_) > 1) or (token.lemma_.isalnum()) ]
return nlpdoc
vect = TfidfVectorizer(tokenizer=LemmaTokenizer())
vect.fit(['Apples and oranges are tasty.'])
print(vect.vocabulary_)
### prints {'apple': 1, 'and': 0, 'tasty': 4, 'be': 2, 'orange': 3}
However, using it in GridSearchCV
gives errors, a self contained example is below:但是,在
GridSearchCV
使用它会出错,下面是一个自包含的示例:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
wordvect = TfidfVectorizer(analyzer='word', strip_accents='ascii', tokenizer=LemmaTokenizer())
classifier = OneVsRestClassifier(SVC(kernel='linear'))
pipeline = Pipeline([('vect', wordvect), ('classifier', classifier)])
parameters = {'vect__min_df': [1, 2], 'vect__max_df': [0.7, 0.8], 'classifier__estimator__C': [0.1, 1, 10]}
gs_clf = GridSearchCV(pipeline, parameters, n_jobs=7, verbose=1)
from sklearn.datasets import fetch_20newsgroups
categories = ['comp.graphics', 'rec.sport.baseball']
newsgroups = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'), shuffle=True, categories=categories)
X = newsgroups.data
y = newsgroups.target
gs_clf = gs_clf.fit(X, y)
### AttributeError: 'spacy.tokenizer.Tokenizer' object has no attribute '_prefix_re'
The error does not appear when I load spacy outside of constructor of the tokenizer, then the GridSearchCV
runs:当我在标记生成器的构造函数之外加载 spacy 时不会出现错误,然后
GridSearchCV
运行:
spacynlp = spacy.load('en')
class LemmaTokenizer(object):
def __call__(self, doc):
nlpdoc = spacynlp(doc)
nlpdoc = [token.lemma_ for token in nlpdoc if (len(token.lemma_) > 1) or (token.lemma_.isalnum()) ]
return nlpdoc
But this means that every of my n_jobs
from the GridSearchCV
will access and call the same spacynlp object, it is shared among these jobs, which leaves the questions:但这意味着来自
GridSearchCV
每个n_jobs
都将访问和调用相同的 spacynlp 对象,它在这些作业之间共享,这就留下了问题:
spacy.load('en')
safe to be used by multiple jobs in GridSearchCV?spacy.load('en')
的 spacynlp 对象是否可以安全地被 GridSearchCV 中的多个作业使用? You are wasting time by running Spacy for each parameter setting in the grid. 您正在通过为网格中的每个参数设置运行Spacy来浪费时间。 The memory overhead is also significant.
内存开销也很重要。 You should run all data through Spacy once and save it to disk, then use a simplified vectoriser that reads in pre-lemmatised data.
您应该通过Spacy运行一次所有数据并将其保存到磁盘,然后使用读取预先模拟数据的简化矢量器。 Look at the
tokenizer
, analyser
and preprocessor
parameters of TfidfVectorizer
. 查看
TfidfVectorizer
的tokenizer
, analyser
和preprocessor
参数。 There are plenty of examples on stack overflow that show how to build a custom vectoriser. 有很多关于堆栈溢出的例子,展示了如何构建自定义矢量化器。
Based on the comments of the post of mbatchkarov , I tried to run all my documents in a pandas series through Spacy once for tokenization and lemmatization and save it to disk first.根据mbatchkarov帖子的评论,我尝试通过 Spacy 将Pandas系列中的所有文档运行一次以进行标记化和词形还原,然后先将其保存到磁盘。 Then, I load in the the lemmatized spacy
Doc
objects, extract a list of tokens for every document and supply it as input to a pipeline consisting of a simplified TfidfVectorizer
and a DecisionTreeClassifier
.然后,我加载 lemmatized spacy
Doc
对象,提取每个文档的标记列表并将其作为输入提供给由简化的TfidfVectorizer
和DecisionTreeClassifier
组成的管道。 I run the pipeline
with GridSearchCV
and extract the best estimator and respective params.我使用
GridSearchCV
运行pipeline
并提取最佳估计器和相应的参数。
See an example:看一个例子:
from sklearn import tree
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
import spacy
from spacy.tokens import DocBin
nlp = spacy.load("de_core_news_sm") # define your language model
# adjust attributes to your liking:
doc_bin = DocBin(attrs=["LEMMA", "ENT_IOB", "ENT_TYPE"], store_user_data=True)
for doc in nlp.pipe(df['articleDocument'].str.lower()):
doc_bin.add(doc)
# either save DocBin to a bytes object, or...
#bytes_data = doc_bin.to_bytes()
# save DocBin to a file on disc
file_name_spacy = 'output/preprocessed_documents.spacy'
doc_bin.to_disk(file_name_spacy)
#Load DocBin at later time or on different system from disc or bytes object
#doc_bin = DocBin().from_bytes(bytes_data)
doc_bin = DocBin().from_disk(file_name_spacy)
docs = list(doc_bin.get_docs(nlp.vocab))
print(len(docs))
tokenized_lemmatized_texts = [[token.lemma_ for token in doc
if not token.is_stop and not token.is_punct and not token.is_space and not token.like_url and not token.like_email]
for doc in docs]
# classifier to use
clf = tree.DecisionTreeClassifier()
# just some random target response
y = np.random.randint(2, size=len(docs))
vectorizer = TfidfVectorizer(ngram_range=(1, 1), lowercase=False, tokenizer=lambda x: x, max_features=3000)
pipeline = Pipeline([('vect', vectorizer), ('dectree', clf)])
parameters = {'dectree__max_depth':[4, 10]}
gs_clf = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, cv=5)
gs_clf.fit(tokenized_lemmatized_texts, y)
print(gs_clf.best_estimator_.get_params()['dectree'])
Some further useful resources:一些其他有用的资源:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.