简体   繁体   English

有没有办法加快 tf.keras 中的嵌入层?

[英]Is there a way to speed up Embedding layer in tf.keras?

I'm trying to implement an LSTM model for DNA sequence classification, but at the moment it is unusable because of how long it takes to train (25 seconds per epoch over 6.5K sequences, about 4ms per sample, and we need to train several versions of the model over 100s of thousands of sequences).我正在尝试实现一个 LSTM model 用于 DNA 序列分类,但目前它无法使用,因为训练需要多长时间(每个 epoch 25 秒,超过 6.5K 序列,每个样本大约 4ms,我们需要训练几个model 的版本超过 100 上千个序列)。

DNA sequence can be represented as a string of A, C, G, and T, eg "ACGGGTGACAT" could be an example of a single DNA sequence. DNA序列可以表示为一串A、C、G和T,例如“ACGGGTGACAT”可以是单个DNA序列的例子。 Each sequence belongs to one of two categories that I am trying to predict and each sequence contains 1000 characters.每个序列属于我试图预测的两个类别之一,每个序列包含 1000 个字符。

Initially, my model did not include an Embedding layer and instead I manually converted each sequence into a one-hot encoded matrix (4 rows by 1000 columns) and the model didn't work great but was incredibly fast.最初,我的 model 不包含嵌入层,而是我手动将每个序列转换为单热编码矩阵(4 行 x 1000 列),model 运行不佳,但速度非常快。 At this point though I had seen online that using an embedding layer has clear advantages.在这一点上,虽然我在网上看到使用嵌入层具有明显的优势。 So I added an embedding layer and instead of using the one-hot encoded matrix I converted the sequences into integers with each character represented by a different integer.所以我添加了一个嵌入层,而不是使用 one-hot 编码矩阵,而是将序列转换为整数,每个字符由不同的 integer 表示。

Indeed the model works much better now, but it is about 30 times slower and impossible to work with.确实,model 现在工作得更好,但速度要慢 30 倍,而且无法使用。 Is there something I can do here to speed up the embedding layer?我可以在这里做些什么来加快嵌入层的速度吗?

Here are the functions for constructing and fitting my model:以下是构建和安装我的 model 的函数:

from tensorflow.keras.layers import Embedding, Dense, LSTM, Activation
from tensorflow.keras import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical

def build_model():
    # initialize a sequential model
    model = Sequential()

    # add embedding layer
    model.add(Embedding(5, 1, input_length=1000, mask_zero=True))

    # Add LSTM layer
    model.add(
       LSTM(5)
    )

    # Add Dense NN layer
    model.add(
        Dense(units=2)
    )

    model.add(Activation('softmax'))

    optimizer = Adam(clipnorm=1.)

    model.compile(
        loss="categorical_crossentropy", optimizer=optimizer, metrics=['accuracy']
    )

    return model

def train_model(X_train, y_train, epochs, batch_size):
    model = build_model()

    # y_train is initially a list of zeroes and ones, needs to be converted to categorical
    y_train = to_categorical(y_train)  

    history = model.fit(
        X_train, y_train, epochs=epochs, batch_size=batch_size
    )

    return model, history

Any help will be greatly appreciated - after much googling and trial-and-error, I can't seem to speed this up.任何帮助将不胜感激 - 经过多次谷歌搜索和反复试验,我似乎无法加快速度。

A possible suggestion is to use a "cheaper" RNN, such as the SimpleRNN instead of LSTM.一个可能的建议是使用“更便宜”的 RNN,例如 SimpleRNN 而不是 LSTM。 It has less parameters to train.它需要训练的参数更少。 In some simple testing, I got a ~3x speed up over LSTM, with the same Embedding processing as you currently have.在一些简单的测试中,我得到了比 LSTM 快约 3 倍的速度,与您目前拥有的嵌入处理相同。 Not sure if you can reduce the sequence length from 1000 to a lower number, but that might be a direction to explore as well.不确定您是否可以将序列长度从 1000 减少到更低的数字,但这也可能是一个探索的方向。 I hope this helps.我希望这有帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM