简体   繁体   中英

tf.keras: how to improve performance with large-ish embedding layer

I am training an LSTM model with embedding input layer with a vocabulary size of approximately 100,000. While profiling the training via tensorboard, I discovered that most of the training time is spent on "Kernel Launch" (58%), followed by "All Others" (36%). In other words the GPU is idle most of the time due to overhead. The high kernel launch time seems to be driven by the size of the embedding layer.

My question is: how can I improve the training speed? Is it inevitable that most of the training time is spent on kernel launch when working with a large-ish embedding? Increasing the batch size (currently at 128) would help since the kernel launch time doesn't depend on the batch size, but 128 is already on the high side.

Not sure what exactly falls under "All Others"?

I am working on a Tesla T4 GPU with Tensorflow 2.2.0, but I see the same behavior using the nightly build.

Following the RNN tutorial on tensorflow.org ( https://www.tensorflow.org/tutorials/text/text_classification_rnn ), here is an example that highlights the performance issues:

import tensorflow_datasets as tfds
import tensorflow as tf

from datetime import datetime
from tqdm.auto import tqdm

### retrieve data ###
# use imdb_reviews dataset from TFDS
dataset = tfds.load('imdb_reviews',as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']

### get encoder ###
# initialize tokenizer
tokenizer = tfds.features.text.Tokenizer()

# build vocabulary
def addOrUpdate(d,token):
    d[token] = d.get(token,0)+1

vocab = dict()

dataset_iter = iter(train_dataset)
for el in tqdm(dataset_iter):    
  text = el[0].numpy().decode("utf-8") 
  for token in tokenizer.tokenize(text):
      addOrUpdate(vocab,token)

# shrink vocabulary (MIN_COUNT>1 significantly reduces model dimension)
MIN_COUNT = 1

vocab_subset = set([k for k,v in vocab.items() if v >= MIN_COUNT])
print("Using vocabulary subset with min_count={:}: {:,} words, ".format(MIN_COUNT,len(vocab_subset)))

# create encoder
encoder = tfds.features.text.TokenTextEncoder(vocab_subset)

### Prepare the data for training ###
def encode(text_tensor, label):
    encoded_text = encoder.encode(text_tensor.numpy())
    return encoded_text, label

def encode_map_fn(text,label):
    # encode
    encoded_text, label = tf.py_function(encode, 
                                         inp=[text, label], 
                                         Tout=(tf.int64, tf.int64))
    # set shapes
    encoded_text.set_shape([None])
    label.set_shape([])

    return encoded_text, label

train_dataset = train_dataset.map(encode_map_fn)
test_dataset = test_dataset.map(encode_map_fn)

BUFFER_SIZE = 25000
BATCH_SIZE = 128

train_dataset = train_dataset.shuffle(BUFFER_SIZE)
train_dataset = train_dataset.padded_batch(BATCH_SIZE)

### create the model ###
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(encoder.vocab_size, 256, mask_zero=True),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(1)
])

model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

### Train the model ###
# create tensorboard callback
log_path = 'logs_'+datetime.now().strftime("%Y%m%d_%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_path,
                                                      profile_batch = '10,20')

history = model.fit(train_dataset, epochs=1, steps_per_epoch=30,
                    callbacks=[tensorboard_callback])

Same code in a Colab Notebook: https://colab.research.google.com/drive/1WoAShXR2cGOYWPQoKdh4IGlhZh4FAK7o?usp=sharing

I haven't tried your code, but from looking at it, I guess the following issue might be related:

If a GPU is present but eager execution is enabled, Embedding layers are still placed on the CPU.

See https://github.com/tensorflow/tensorflow/issues/44194 (it includes a workaround).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM