简体   繁体   中英

How to make use of pre-trained word embeddings when training a model in sklearn?

With things like neural networks (NNs) in keras it is very clear how to use word embeddings within the training of the NN, you can simply do something like

embeddings = ...
model = Sequential(Embedding(...),
                   layer1,
                   layer2,...)

But I'm unsure of how to do this with algorithms in sklearn such as SVMs, NBs, and logistic regression. I understand that there is a Pipeline method, which works simply ( http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html ) like

pip = Pipeline([(Countvectorizer()), (TfidfTransformer()), (Classifier())])
pip.fit(X_train, y_train)

But how can I include loaded word embeddings in this pipeline? Or should it somehow be included outside the pipeline? I can't find much documentation online about how to do this.

Thanks.

You can use the FunctionTransformer class. If your goal is to have a transformer that takes a matrix of indexes and outputs a 3d tensor with word vectors, then this should suffice:

# this assumes you're using numpy ndarrays
word_vecs_matrix = get_wv_matrix()  # pseudo-code
def transform(x):
    return word_vecs_matrix[x]
transformer = FunctionTransformer(transform)

Be aware that, unlike keras, the word vector will not be fine tuned using some kind of gradient descent

There is any easy way to get word embeddings transformers with the Zeugma package.

It handles the downloading of the pre-trained embeddings and returns a "Transformer interface" for the embeddings.

For example if you want to use the averge of the GloVe embeddings for sentences representations you'd just have to write:

    from zeugma.embeddings import EmbeddingTransformer
    glove = EmbeddingTransformer('glove')

Here glove is a sklearn transformer has the standard transform method that takes a list of sentences as input and outputs a design matrix, just like Tfidftransformer. You can get the resulting embeddings with embeddings = glove.transform(['first sentence of the corpus', 'another sentence']) and embeddings woud contain a 2 x N matrics, where N is the dimension of the chosen embedding. Note that you don't have to bother with embeddings downloading, or local loading if you've already done it, Zeugma handles this transparently.

Hope this helps

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM