简体   繁体   English

Word2vec的自定义变压器和FeatureUnion

[英]Custom Transformer and FeatureUnion for word2vec

I am trying to classify a set of text documents using multiple sets of features. 我正在尝试使用多套功能对一组文本文档进行分类。 I am using sklearn's Feature Union to combine different features for fitting into a single model. 我正在使用sklearn的Feature Union组合不同的特征以适合单个模型。 One of the features includes word embeddings using gensim's word2vec . 功能之一包括使用gensim的word2vec进行单词嵌入。

import numpy as np
from gensim.models.word2vec import Word2Vec
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest

categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
data = fetch_20newsgroups(subset='train', categories=categories)#dummy dataset

w2v_model= Word2Vec(data .data, size=100, window=5, min_count=5, workers=2)
word2vec={w: vec for w, vec in zip(w2v_model.wv.index2word, w2v_model.wv.syn0)} #dictionary of word embeddings
feat_select = SelectKBest(score_func=chi2, k=10) #other features
TSVD = TruncatedSVD(n_components=50, algorithm = "randomized", n_iter = 5)
#other features

In order to include transformers/estimators not already available in sklearn, I am attempting to wrap my word2vec results into a custom transformer class that returns the vector averages. 为了包括sklearn中尚不可用的转换器/估计器,我尝试将word2vec结果包装到一个自定义转换器类中,该类返回矢量平均值。

class w2vTransformer(TransformerMixin):
    """
    Wrapper class for running word2vec into pipelines and FeatureUnions
    """
    def __init__(self,word2vec,**kwargs):
        self.word2vec=word2vec
        self.kwargs=kwargs
        self.dim = len(word2vec.values())
    def fit(self,x, y=None):
        return self

    def transform(self, X):
        return np.array([
        np.mean([self.word2vec[w] for w in words if w in self.word2vec] 
            or [np.zeros(self.dim)], axis=0)
       for words in X
])

However when it comes time to fit the model I receive an error. 但是,当需要拟合模型时,我会收到一个错误。

combined_features = FeatureUnion([("w2v_class",w2vTransformer(word2vec)),
     ("feat",feat_select),("TSVD",TSVD)])#join features into combined_features
#combined_features = FeatureUnion([("feat",feat_select),("TSVD",TSVD)])#runs when word embeddings are not included    
text_clf_svm = Pipeline([('vect', CountVectorizer()),
         ('tfidf', TfidfTransformer()),
         ('feature_selection', combined_features),
          ('clf-svm',  SGDClassifier( loss="modified_huber")),
 ]) 

text_clf_svm_1 = text_clf_svm.fit(data.data,data.target) # fits data

text_clf_svm_1 = text_clf_svm.fit(data.data,data.target) # fits data
Traceback (most recent call last):

  File "<ipython-input-8-a085b7d40f8f>", line 1, in <module>
    text_clf_svm_1 = text_clf_svm.fit(data.data,data.target) # fits data

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 248, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 213, in _fit
    **fit_params_steps[name])

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\memory.py", line 362, in __call__
    return self.func(*args, **kwargs)

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 581, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 739, in fit_transform
    for name, trans, weight in self._iter())

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 779, in __call__
    while self.dispatch_one_batch(iterator):

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 625, in dispatch_one_batch
    self._dispatch(tasks)

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 588, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 111, in apply_async
    result = ImmediateResult(func)

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 332, in __init__
    self.results = batch()

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 131, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 131, in <listcomp>
    return [func(*args, **kwargs) for func, args, kwargs in self.items]

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 581, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\base.py", line 520, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)

  File "<ipython-input-6-cbc52cd420cd>", line 16, in transform
    for words in X

  File "<ipython-input-6-cbc52cd420cd>", line 16, in <listcomp>
    for words in X

  File "<ipython-input-6-cbc52cd420cd>", line 14, in <listcomp>
    np.mean([self.word2vec[w] for w in words if w in self.word2vec]

TypeError: unhashable type: 'csr_matrix'

Traceback (most recent call last):

  File "<ipython-input-8-a085b7d40f8f>", line 1, in <module>
    text_clf_svm_1 = text_clf_svm.fit(data.data,data.target) # fits data

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 248, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 213, in _fit
    **fit_params_steps[name])

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\memory.py", line 362, in __call__
    return self.func(*args, **kwargs)

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 581, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 739, in fit_transform
    for name, trans, weight in self._iter())

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 779, in __call__
    while self.dispatch_one_batch(iterator):

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 625, in dispatch_one_batch
    self._dispatch(tasks)

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 588, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 111, in apply_async
    result = ImmediateResult(func)

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 332, in __init__
    self.results = batch()

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 131, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 131, in <listcomp>
    return [func(*args, **kwargs) for func, args, kwargs in self.items]

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 581, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)

  File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\base.py", line 520, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)

  File "<ipython-input-6-cbc52cd420cd>", line 16, in transform
    for words in X

  File "<ipython-input-6-cbc52cd420cd>", line 16, in <listcomp>
    for words in X

  File "<ipython-input-6-cbc52cd420cd>", line 14, in <listcomp>
    np.mean([self.word2vec[w] for w in words if w in self.word2vec]

TypeError: unhashable type: 'csr_matrix'

I understand that the error is because the variable "words" is a csr_matrix, but it needs to be an iterable such as a list. 我知道该错误是因为变量“ words”是一个csr_matrix,但它必须是可迭代的,例如列表。 My question is how do I modify the transformer class or data so I can use the word embeddings as features to feed into FeatureUnion? 我的问题是如何修改转换器类或数据,以便可以将词嵌入用作特征以馈入FeatureUnion? This is my first SO post, please be gentle. 这是我的第一篇SO帖子,请保持温柔。

Instead of your custom transformer you can avoid the bug using the new scikit-learn API directly provided by Gensim! 您可以使用Gensim直接提供的新scikit-learn API来避免该错误,而不用自定义转换器! https://radimrehurek.com/gensim/sklearn_api/w2vmodel.html https://radimrehurek.com/gensim/sklearn_api/w2vmodel.html

Also, it depends on your version of Gensim, but in my case I could solve the same bug using the wv attribute of your word2vec object, instead of indexing on the object itself. 同样,这取决于您的Gensim版本,但就我而言,我可以使用word2vec对象的wv属性解决相同的错误,而不是对对象本身进行索引。

In the transform method of your w2vTransformer class: 在w2vTransformer类的transform方法中:

self.word2vec.wv[w]

instead of 代替

self.word2vec[w]

Hope it helps! 希望能帮助到你!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM