tf.keras：如何使用大型嵌入层提高性能

Question

I am training an LSTM model with embedding input layer with a vocabulary size of approximately 100,000.我正在训练带有嵌入输入层的 LSTM model，词汇量约为 100,000。 While profiling the training via tensorboard, I discovered that most of the training time is spent on "Kernel Launch" (58%), followed by "All Others" (36%).在通过 tensorboard 分析训练时，我发现大部分训练时间都花在了“内核启动”（58%）上，其次是“所有其他”（36%）。 In other words the GPU is idle most of the time due to overhead.换句话说，由于开销，GPU 大部分时间都是空闲的。 The high kernel launch time seems to be driven by the size of the embedding layer.高 kernel 发射时间似乎是由嵌入层的大小驱动的。

My question is: how can I improve the training speed?我的问题是：如何提高训练速度？ Is it inevitable that most of the training time is spent on kernel launch when working with a large-ish embedding?使用大型嵌入时，大部分培训时间是否不可避免地花费在 kernel 启动上？ Increasing the batch size (currently at 128) would help since the kernel launch time doesn't depend on the batch size, but 128 is already on the high side.增加批量大小（目前为 128）会有所帮助，因为 kernel 的启动时间不取决于批量大小，但 128 已经偏高。

Not sure what exactly falls under "All Others"?不确定究竟什么属于“所有其他”？

I am working on a Tesla T4 GPU with Tensorflow 2.2.0, but I see the same behavior using the nightly build.我正在使用 Tensorflow 2.2.0 开发 Tesla T4 GPU，但我看到使用每晚构建的相同行为。

Following the RNN tutorial on tensorflow.org ( https://www.tensorflow.org/tutorials/text/text_classification_rnn ), here is an example that highlights the performance issues:按照 tensorflow.org 上的 RNN 教程（ https://www.tensorflow.org/tutorials/text/text_classification_rnn ），这是一个例子：

import tensorflow_datasets as tfds
import tensorflow as tf

from datetime import datetime
from tqdm.auto import tqdm

### retrieve data ###
# use imdb_reviews dataset from TFDS
dataset = tfds.load('imdb_reviews',as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']

### get encoder ###
# initialize tokenizer
tokenizer = tfds.features.text.Tokenizer()

# build vocabulary
def addOrUpdate(d,token):
    d[token] = d.get(token,0)+1

vocab = dict()

dataset_iter = iter(train_dataset)
for el in tqdm(dataset_iter):    
  text = el[0].numpy().decode("utf-8") 
  for token in tokenizer.tokenize(text):
      addOrUpdate(vocab,token)

# shrink vocabulary (MIN_COUNT>1 significantly reduces model dimension)
MIN_COUNT = 1

vocab_subset = set([k for k,v in vocab.items() if v >= MIN_COUNT])
print("Using vocabulary subset with min_count={:}: {:,} words, ".format(MIN_COUNT,len(vocab_subset)))

# create encoder
encoder = tfds.features.text.TokenTextEncoder(vocab_subset)

### Prepare the data for training ###
def encode(text_tensor, label):
    encoded_text = encoder.encode(text_tensor.numpy())
    return encoded_text, label

def encode_map_fn(text,label):
    # encode
    encoded_text, label = tf.py_function(encode, 
                                         inp=[text, label], 
                                         Tout=(tf.int64, tf.int64))
    # set shapes
    encoded_text.set_shape([None])
    label.set_shape([])

    return encoded_text, label

train_dataset = train_dataset.map(encode_map_fn)
test_dataset = test_dataset.map(encode_map_fn)

BUFFER_SIZE = 25000
BATCH_SIZE = 128

train_dataset = train_dataset.shuffle(BUFFER_SIZE)
train_dataset = train_dataset.padded_batch(BATCH_SIZE)

### create the model ###
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(encoder.vocab_size, 256, mask_zero=True),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(1)
])

model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

### Train the model ###
# create tensorboard callback
log_path = 'logs_'+datetime.now().strftime("%Y%m%d_%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_path,
                                                      profile_batch = '10,20')

history = model.fit(train_dataset, epochs=1, steps_per_epoch=30,
                    callbacks=[tensorboard_callback])

Same code in a Colab Notebook: https://colab.research.google.com/drive/1WoAShXR2cGOYWPQoKdh4IGlhZh4FAK7o?usp=sharing Colab 笔记本中的相同代码： https://colab.research.google.com/drive/1WoAShXR2cGOYWPQoKdh4IGlhZh4FAK7o?usp=sharing

Answer 1

I haven't tried your code, but from looking at it, I guess the following issue might be related:我没有尝试过您的代码，但从查看它，我猜可能与以下问题有关：

If a GPU is present but eager execution is enabled, Embedding layers are still placed on the CPU.如果 GPU 存在但启用了急切执行，则嵌入层仍放置在 CPU 上。

See https://github.com/tensorflow/tensorflow/issues/44194 (it includes a workaround).请参阅https://github.com/tensorflow/tensorflow/issues/44194 （它包括一种解决方法）。

tf.keras：如何使用大型嵌入层提高性能

问题描述

1 个解决方案

解决方案1
0 2021-03-11 16:26:19

tf.keras：如何使用大型嵌入层提高性能

问题描述

1 个解决方案

解决方案1 0 2021-03-11 16:26:19

解决方案1
0 2021-03-11 16:26:19