简体   繁体   中英

How to train a model with only an Embedding layer in Keras and no labels

I have some text without any labels. Just a bunch of text files. And I want to train an Embedding layer to map the words to embedding vectors. Most of the examples I've seen so far are like this:

from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense

model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()
model.compile(optimizer='rmsprop',
    loss='binary_crossentropy',
    metrics=['acc'])
model.fit(x_train, y_train,
    epochs=10,
    batch_size=32,
    validation_data=(x_val, y_val))

They all assume that the Embedding layer is part of a bigger model which tries to predict a label. But in my case, I have no label. I'm not trying to classify anything. I just want to train the mapping from words (more precisely integers) to embedding vectors. But the fit method of the model, asks for x_train and y_train (as the example given above).

How can I train a model only with an Embedding layer and no labels?

[UPDATE]

Based on the answer I've got from @Daniel Möller, Embedding layer in Keras is implementing a supervised algorithm and thus cannot be trained without labels. Initially, I was thinking that it is a variation of Word2Vec and thus does not need labels to be trained. Apparently, that's not the case. Personally, I ended up using the FastText which has nothing to do with Keras or Python.

Does it make sense to do that without a label/target?

How will your model decide which values in the vectors are good for anything if there is no objective?

All embeddings are "trained" for a purpose. If there is no purpose, there is no target, if there is no target, there is no training.

If you really want to transform words in vectors without any purpose/target, you've got two options:

  • Make one-hot encoded vectors. You may use the Keras to_categorical function for that.
  • Use a pretrained embedding. There are some available, such as glove, embeddings from Google, etc. (All of they were trained at some point for some purpose).

A very naive approach based on our chat, considering word distance

Warning: I don't really know anything about Word2Vec, but I'll try to show how to add the rules for your embedding using some naive kind of word distance and how to use dummy "labels" just to satisfy Keras' way of training.

from keras.layers import Input, Embedding, Subtract, Lambda
import keras.backend as K
from keras.models import Model

input1 = Input((1,)) #word1
input2 = Input((1,)) #word2

embeddingLayer = Embedding(...params...)

word1 = embeddingLayer(input1)
word2 = embeddingLayer(input2)

#naive distance rule, subtract, expect zero difference
word_distance = Subtract()([word1,word2])

#reduce all dimensions to a single dimension
word_distance = Lambda(lambda x: K.mean(x, axis=-1))(word_distance)

model = Model([input1,input2], word_distance)

Now that our model outputs directly a word distance, our labels will be "zero", they're not really labels for a supervised training, but they're the expected result of the model, something necessary for Keras to work.

We can have as loss function the mae (mean absolute error) or mse (mean squared error), for instance.

model.compile(optimizer='adam', loss='mse')

And training with word2 being the word after word1:

xTrain = entireText
xTrain1 = entireText[:-1]
xTrain2 = entireText[1:]
yTrain = np.zeros((len(xTrain1),))

model.fit([xTrain1,xTrain2], yTrain, .... more params.... ) 

Although this may be completely wrong regarding what Word2Vec really does, it shows the main points that are:

  • Embedding layers don't have special properties, they're just trainable lookup tables
  • Rules for creating an embedding should be defined by the model and expected outputs
  • A Keras model will need "targets", even if those targets are not "labels" but a mathematical trick for an expected result.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM