简体   繁体   English

如何在3D张量输入中使用keras嵌入层?

[英]How to use keras embedding layer with 3D tensor input?

I am facing difficulty in using Keras embedding layer with one hot encoding of my input data. 我在使用Keras嵌入层和输入数据的一种热编码时遇到了困难。

Following is the toy code. 以下是玩具代码。

Import packages 导入包

from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
from keras.optimizers import Adam
import matplotlib.pyplot as plt
import numpy as np
import openpyxl
import pandas as pd
from keras.callbacks import ModelCheckpoint
from keras.callbacks import ReduceLROnPlateau

The input data is text based as follows. 输入数据是基于文本的,如下所示。

Train and Test data 训练和测试数据

X_train_orignal= np.array(['OC(=O)C1=C(Cl)C=CC=C1Cl', 'OC(=O)C1=C(Cl)C=C(Cl)C=C1Cl',
       'OC(=O)C1=CC=CC(=C1Cl)Cl', 'OC(=O)C1=CC(=CC=C1Cl)Cl',
       'OC1=C(C=C(C=C1)[N+]([O-])=O)[N+]([O-])=O'])

X_test_orignal=np.array(['OC(=O)C1=CC=C(Cl)C=C1Cl', 'CCOC(N)=O',
       'OC1=C(Cl)C(=C(Cl)C=C1Cl)Cl'])

Y_train=np.array(([[2.33],
       [2.59],
       [2.59],
       [2.54],
       [4.06]]))

Y_test=np.array([[2.20],
   [2.81],
   [2.00]])

Creating dictionaries 创建字典

Now i create two dictionaries, characters to index vice. 现在,我创建了两个字典,用来索引虎钳的字符。 The unique character number is stored in len(charset) and maximum length of the string along with 5 additional characters is stored in embed . 唯一字符号存储在len(charset)中,字符串的最大长度以及5个其他字符存储在embed The start of each string will be padded with ! 每个字符串的开头都将填充! and end will be E . 并且结尾将是E

charset = set("".join(list(X_train_orignal))+"!E")
char_to_int = dict((c,i) for i,c in enumerate(charset))
int_to_char = dict((i,c) for i,c in enumerate(charset))
embed = max([len(smile) for smile in X_train_orignal]) + 5
print (str(charset))
print(len(charset), embed)

One hot encoding 一种热编码

I convert all the train data into one hot encoding as follows. 我将所有火车数据转换为一种热编码,如下所示。

def vectorize(smiles):
        one_hot =  np.zeros((smiles.shape[0], embed , len(charset)),dtype=np.int8)
        for i,smile in enumerate(smiles):
            #encode the startchar
            one_hot[i,0,char_to_int["!"]] = 1
            #encode the rest of the chars
            for j,c in enumerate(smile):
                one_hot[i,j+1,char_to_int[c]] = 1
            #Encode endchar
            one_hot[i,len(smile)+1:,char_to_int["E"]] = 1

        return one_hot[:,0:-1,:]

X_train = vectorize(X_train_orignal)
print(X_train.shape)
X_test = vectorize(X_test_orignal)
print(X_test.shape)

When it converts the input train data into one hot encoding, the shape of the one hot encoded data becomes (5, 44, 14) for train and (3, 44, 14) for test. 当将输入的火车数据转换为一种热编码时,一种热编码数据的形状对于火车变为(5, 44, 14) (3, 44, 14)对于测试变为(5, 44, 14) (3, 44, 14) For train, there are 5 example, 0-44 is the maximum length and 14 are the unique characters. 对于火车,有5个示例,最大长度为0-44,唯一字符为14。 The examples for which there are less number of characters, are padded with E till the maximum length. 对于字符数较少的示例,用E填充直到最大长度。

Verifying the correct padding Following is the code to verify if we have done the padding rightly. 验证正确的填充以下是用于验证我们是否正确完成了填充的代码。

mol_str_train=[]
mol_str_test=[]
for x in range(5):

    mol_str_train.append("".join([int_to_char[idx] for idx in np.argmax(X_train[x,:,:], axis=1)]))

for x in range(3):
    mol_str_test.append("".join([int_to_char[idx] for idx in np.argmax(X_test[x,:,:], axis=1)]))

and let's see, how the train set looks like. 让我们看看火车的样子。

mol_str_train

['!OC(=O)C1=C(Cl)C=CC=C1ClEEEEEEEEEEEEEEEEEEEE',
 '!OC(=O)C1=C(Cl)C=C(Cl)C=C1ClEEEEEEEEEEEEEEEE',
 '!OC(=O)C1=CC=CC(=C1Cl)ClEEEEEEEEEEEEEEEEEEEE',
 '!OC(=O)C1=CC(=CC=C1Cl)ClEEEEEEEEEEEEEEEEEEEE',
 '!OC1=C(C=C(C=C1)[N+]([O-])=O)[N+]([O-])=OEEE']

Now is the time to build model. 现在是构建模型的时候了。

Model 模型

model = Sequential()
model.add(Embedding(len(charset), 10, input_length=embed))
model.add(Flatten())
model.add(Dense(1, activation='linear'))

def coeff_determination(y_true, y_pred):
    from keras import backend as K
    SS_res =  K.sum(K.square( y_true-y_pred ))
    SS_tot = K.sum(K.square( y_true - K.mean(y_true) ) )
    return ( 1 - SS_res/(SS_tot + K.epsilon()) )

def get_lr_metric(optimizer):
    def lr(y_true, y_pred):
        return optimizer.lr
    return lr


optimizer = Adam(lr=0.00025)
lr_metric = get_lr_metric(optimizer)
model.compile(loss="mse", optimizer=optimizer, metrics=[coeff_determination, lr_metric])



callbacks_list = [
    ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=5, min_lr=1e-15, verbose=1, mode='auto',cooldown=0),
    ModelCheckpoint(filepath="weights.best.hdf5", monitor='val_loss', save_best_only=True, verbose=1, mode='auto')]


history =model.fit(x=X_train, y=Y_train,
                              batch_size=1,
                              epochs=10,
                              validation_data=(X_test,Y_test),
                              callbacks=callbacks_list)

Error 错误

ValueError: Error when checking input: expected embedding_3_input to have 2 dimensions, but got array with shape (5, 44, 14)

The embedding layer expects two dimensional array. 嵌入层需要二维数组。 How can I deal with this issue so that it can accept the one hot vector encoded data. 我该如何处理此问题,以便它可以接受一个热矢量编码数据。

All the above code can be run. 以上所有代码都可以运行。

The Keras embedding layer works with indices, not directly with one-hot encodings. Keras嵌入层可用于索引,而不能直接用于一键编码。 So you don't need to have (5,44,14), just (5,44) works fine. 所以您不需要(5,44,14),只需(5,44)即可。

Eg get indices with argmax: 例如使用argmax获取索引:

X_test = np.argmax(X_test, axis=2)
X_train = np.argmax(X_train, axis=2)

Although it's probably better to not one-hot encode it first =) 虽然最好不要先对其进行一次热编码=)

Besides that, your 'embed' variable says size 45, while your data is size 44. 除此之外,“嵌入”变量的大小为45,而数据大小为44。

If you change those, your model runs fine: 如果您更改了这些,您的模型将运行良好:

model = Sequential()
model.add(Embedding(len(charset), 10, input_length=44))
model.add(Flatten())
model.add(Dense(1, activation='linear'))

def coeff_determination(y_true, y_pred):
    from keras import backend as K
    SS_res =  K.sum(K.square( y_true-y_pred ))
    SS_tot = K.sum(K.square( y_true - K.mean(y_true) ) )
    return ( 1 - SS_res/(SS_tot + K.epsilon()) )

def get_lr_metric(optimizer):
    def lr(y_true, y_pred):
        return optimizer.lr
    return lr


optimizer = Adam(lr=0.00025)
lr_metric = get_lr_metric(optimizer)
model.compile(loss="mse", optimizer=optimizer, metrics=[coeff_determination,     lr_metric])



callbacks_list = [
    ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=5, min_lr=1e-15,     verbose=1, mode='auto',cooldown=0),
    ModelCheckpoint(filepath="weights.best.hdf5", monitor='val_loss',         save_best_only=True, verbose=1, mode='auto')]


history =model.fit(x=np.argmax(X_train, axis=2), y=Y_train,
                              batch_size=1,
                              epochs=10,
                              validation_data=(np.argmax(X_test, axis=2),Y_test),
                              callbacks=callbacks_list)    

our input shape was not defined properly in the embedding layer. 我们的输入形状未在嵌入层中正确定义。 The following code works for me by reducing the steps to covert your data dimensions to 2D you can directly pass the 3-D input to your embedding layer. 以下代码通过减少将数据维转换为2D的步骤为我工作,您可以将3D输入直接传递到嵌入层。

#THE MISSING STUFF
#_________________________________________
Y_train = Y_train.reshape(5) #Dense layer contains a single unit so need to input single dimension array
max_len = len(charset)
max_features = embed-1
inputshape = (max_features, max_len) #input shape didn't define. Embedding layer can accept 3D input by using input_shape
#__________________________________________

model = Sequential()
#model.add(Embedding(len(charset), 10, input_length=14))

model.add(Embedding(max_features, 10, input_shape=inputshape))#input_length=max_len))
model.add(Flatten())
model.add(Dense(1, activation='linear'))
print(model.summary())

optimizer = Adam(lr=0.00025)
lr_metric = get_lr_metric(optimizer)
model.compile(loss="mse", optimizer=optimizer, metrics=[coeff_determination, lr_metric])


callbacks_list = [
    ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=5, min_lr=1e-15, verbose=1, mode='auto',cooldown=0),
    ModelCheckpoint(filepath="weights.best.hdf5", monitor='val_loss', save_best_only=True, verbose=1, mode='auto')]

history =model.fit(x=X_train, y=Y_train,
                              batch_size=10,
                              epochs=10,
                              validation_data=(X_test,Y_test),
                              callbacks=callbacks_list)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM