Training a BERT-based model causes an OutOfMemory error. How do I fix this?

Question

My setup has an NVIDIA P100 GPU. I am working on a Google BERT model to answer questions. I am using the SQuAD question-answering dataset, which gives me questions, and paragraphs from which the answers should be drawn, and my research indicates this architecture should be OK, but I keep getting OutOfMemory errors during training:

ResourceExhaustedError: OOM when allocating tensor with shape[786432,1604] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node dense_3/kernel/Initializer/random_uniform/RandomUniform}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Below, please find a full program that uses someone else's implementation of Google's BERT algorithm inside my own model. Please let me know what I can do to fix my error. Thank you!

import json
import numpy as np
import pandas as pd
import os
assert os.path.isfile("train-v1.1.json"),"Non-existent file"
from tensorflow.python.client import device_lib
import tensorflow.compat.v1 as tf
#import keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import re
regex = re.compile(r'\W+')
#Reading the files.
def readFile(filename):
  with open(filename) as file:
    fields = []
    JSON = json.loads(file.read())
    articles = []
    for article in JSON["data"]:
      articleTitle = article["title"]
      article_body = []
      for paragraph in article["paragraphs"]:
        paragraphContext = paragraph["context"]
        article_body.append(paragraphContext)
        for qas in paragraph["qas"]:
          question = qas["question"]
          answer = qas["answers"][0]
          fields.append({"question":question,"answer_text":answer["text"],"answer_start":answer["answer_start"],"paragraph_context":paragraphContext,"article_title":articleTitle})
      article_body = "\\n".join(article_body)
      article = {"title":articleTitle,"body":article_body}
      articles.append(article)
  fields = pd.DataFrame(fields)
  fields["question"] = fields["question"].str.replace(regex," ")
  assert not (fields["question"].str.contains("catalanswhat").any())
  fields["paragraph_context"] = fields["paragraph_context"].str.replace(regex," ")
  fields["answer_text"] = fields["answer_text"].str.replace(regex," ")
  assert not (fields["paragraph_context"].str.contains("catalanswhat").any())
  fields["article_title"] = fields["article_title"].str.replace("_"," ")
  assert not (fields["article_title"].str.contains("catalanswhat").any())
  return fields,JSON["data"]
trainingData,training_JSON = readFile("train-v1.1.json")
print("JSON dataset read.")
#Text preprocessing
## Converting text to skipgrams
print("Tokenizing sentences.")
strings = trainingData.drop("answer_start",axis=1)
strings = strings.values.flatten()

answer_start_train_one_hot = pd.get_dummies(trainingData["answer_start"])

# @title Keras-BERT Environment
import os
pretrained_path = 'uncased_L-12_H-768_A-12'
config_path = os.path.join(pretrained_path, 'bert_config.json')
checkpoint_path = os.path.join(pretrained_path, 'bert_model.ckpt')
vocab_path = os.path.join(pretrained_path, 'vocab.txt')
# Use TF_Keras
os.environ["TF_KERAS"] = "1"

# @title Load Basic Model
import codecs
from keras_bert import load_trained_model_from_checkpoint
token_dict = {}
with codecs.open(vocab_path, 'r', 'utf8') as reader:
    for line in reader:
        token = line.strip()
        token_dict[token] = len(token_dict)

model = load_trained_model_from_checkpoint(config_path, checkpoint_path)

#@title Model Summary
model.summary()

#@title Create tokenization stuff.
from keras_bert import Tokenizer

tokenizer = Tokenizer(token_dict)
def tokenize(text,max_len):
  tokenizer.tokenize(text)
  return tokenizer.encode(first=text,max_len=max_len)
def tokenize_array(texts,max_len=512):
  indices = np.zeros((texts.shape[0],max_len))
  segments = np.zeros((texts.shape[0],max_len))
  for i in range(texts.shape[0]):
    tokens = tokenize(texts[i],max_len)
    indices[i] = tokens[0]
    segments[i] = tokens[1]
  #print(indices.shape)
  #print(segments.shape)
  return np.stack([segments,indices],axis=1)

#@ Tokenize inputs.
def X_Y(dataset,answer_start_one_hot,batch_size=10):
    questions = dataset["question"]
    contexts = dataset["paragraph_context"]
    questions_tokenized = tokenize_array(questions.values)
    contexts_tokenized = tokenize_array(contexts.values)
    X = np.stack([questions_tokenized,contexts_tokenized],axis=1)
    Y = answer_start_one_hot
    return X,Y
def X_Y_generator(dataset,answer_start_one_hot,batch_size=10):
    while True:
        try:
            batch_indices = np.random.choice(np.arange(0,dataset.shape[0]),size=batch_size)
            dataset_batch = dataset.iloc[batch_indices]
            X,Y = X_Y(dataset_batch,answer_start_one_hot.iloc[batch_indices])
            max_int = pd.concat((trainingData["answer_start"],devData["answer_start"])).max()
            yield (X,Y)
        except Exception as e:
            print("Unhandled exception in X_Y_generator: ",e)
            raise

model.trainable = True

answers_network_checkpoint = ModelCheckpoint('answers_network-best.h5', verbose=1, monitor='val_loss',save_best_only=True, mode='auto')

input_layer = Input(shape=(2,2,512,))
print("input layer: ",input_layer.shape)
questions_input_layer = Lambda(lambda x: x[:,0])(input_layer)
context_input_layer = Lambda(lambda x: x[:,1])(input_layer)
print("questions input layer: ",questions_input_layer.shape)
print("context input layer: ",context_input_layer.shape)
questions_indices_layer = Lambda(lambda x: tf.cast(x[:,0],tf.float64))(questions_input_layer)
print("questions indices layer: ",questions_indices_layer.shape)
questions_segments_layer = Lambda(lambda x: tf.cast(x[:,1],tf.float64))(questions_input_layer)
print("questions segments layer: ",questions_segments_layer.shape)
context_indices_layer = Lambda(lambda x: tf.cast(x[:,0],tf.float64))(context_input_layer)
context_segments_layer = Lambda(lambda x: tf.cast(x[:,1],tf.float64))(context_input_layer)
questions_bert_layer = model([questions_indices_layer,questions_segments_layer])
print("Questions bert layer loaded.")
context_bert_layer = model([context_indices_layer,context_segments_layer])
print("Context bert layer loaded.")
questions_flattened = Flatten()(questions_bert_layer)
context_flattened = Flatten()(context_bert_layer)
combined = Concatenate()([questions_flattened,context_flattened])
#bert_dense_questions = Dense(256,activation="sigmoid")(questions_flattened)
#bert_dense_context = Dense(256,activation="sigmoid")(context_flattened)
answers_network_output = Dense(1604,activation="softmax")(combined)
#answers_network = Model(inputs=[input_layer],outputs=[questions_bert_layer,context_bert_layer])
answers_network = Model(inputs=[input_layer],outputs=[answers_network_output])
answers_network.summary()

answers_network.compile("adam","categorical_crossentropy",metrics=["accuracy"])

answers_network.fit_generator(
    X_Y_generator(
        trainingData,
        answer_start_train_one_hot,
        batch_size=10),
    steps_per_epoch=100,
    epochs=100,
    callbacks=[answers_network_checkpoint])

My vocabulary size is about 83,000 words. Any model with a "good" accuracy/F1 score is preferred, but I am also on a non-extensible deadline in 5 days.

EDIT:

Unfortunately, there was one thing I didn't mention: I am actually using CyberZHG's keras-bert module for preprocessing, and for the actual BERT model, so some optimizations may actually break the code. For example, I tried setting the default float value to float16, but this caused a compatibility error.

EDIT #2:

By request, here's the code for my full program:

Jupyter notebook

Answer 1

Edit : I have edited my response in place rather than increasing the length of the already long response.

After looking at the issue rises from the final layer in your model. And I was able to get it to work with the following fixes/changes.

ResourceExhaustedError: OOM when allocating tensor with shape[786432,1604] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node dense_3/kernel/Initializer/random_uniform/RandomUniform}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

So, looking at the error the problem is not being able to allocate an array of [786432,1604] . If you do a simple calculation you have 5GB array allocated here (assuming float32). If it is float64 that goes to 10GB . Add the parameters coming from Bert and other layers in the model, viola! you run out of memory.

The issues

Data type

Looking at the code all these layers in your answer network are producing float64 because you are specifying float64 for all your Lambda layers. So my first suggestion is,

Setting it globally should fix the problem tf.keras.backend.set_floatx('float16')

And as a precaution,

question_indices_layer = Input(shape=(256,), dtype='float16')
question_segments_layer = Input(shape=(256,), dtype='float16')
context_indices_layer = Input(shape=(256,), dtype='float16')
context_segments_layer = Input(shape=(256,), dtype='float16')
questions_bert_layer = model([question_indices_layer,question_segments_layer])
context_bert_layer = model([context_indices_layer,context_segments_layer])
questions_flattened = Flatten(dtype=tf.float16)(questions_bert_layer)
questions_flattened = Dense(64, activation='relu',dtype=tf.float16)(questions_flattened)
contexts_flattened = Flatten(dtype=tf.float16)(context_bert_layer)
contexts_flattened = Dense(64,activation="relu",dtype=tf.float16)
combined = Concatenate(dtype=tf.float16)([questions_flattened,contexts_flattened])

Now you will have all your layers float16 .

Squashing the output before that last `softmax` layer

Another thing you can do is, without passing a massive [batch size, 512, 768] output to your dense layer, you squash it using a smaller layer or some transformation. Few things you can try are,

Adding smaller dense layers that reduces the dimensionality before feeding it to the 1604 softmax layer. This reduces the model parameters significantly.

questions_flattened = Flatten(dtype=tf.float16)(questions_bert_layer)
questions_flattened = Dense(64, activation='relu',dtype=tf.float16)(questions_flattened)
contexts_flattened = Flatten(dtype=tf.float16)(context_bert_layer)
contexts_flattened = Dense(64,activation="relu",dtype=tf.float16)(contexts_flattened)
combined = Concatenate(dtype=tf.float16)([questions_flattened,contexts_flattened])

Summing/Averaging over time dimension of the question output. Because, you only care about understanding what the question is, so it would be fine to lose positional information from that output. You can do this the following way,

questions_flattened = Lambda(lambda x: K.sum(x, axis=1))(questions_bert_layer)

Instead of Concatenate try Add() so that you don't increase the dimensionality.
You can try any of these (optional while combining with others in the list). But make sure you match dimensions of questions_flattend and answers_flattened when doing these in combination, as otherwise you'll get errors.

Length or the sequence

The next problem is that your input length is 512 . I'm not sure how you arrived at that number but I think you can do better well below that number. For example you get the following statistics for questions and paragraphs .

count    175198.000000
mean         11.217582
std           3.597345
min           1.000000
25%           9.000000
50%          11.000000
75%          13.000000
max          41.000000
Name: question, dtype: float64

count    175198.000000
mean        123.791653
std          50.541241
min          21.000000
25%          92.000000
50%         114.000000
75%         147.000000
max         678.000000
Name: paragraph_context, dtype: float64

You can get this information as,

pd.Series(trainingData["question"]).str.split(' ').str.len().describe()

As an example, when you pad your sequences using pad_sequences you don't specify a maxlen which leads to padding sentences to the maximum length found in the corpus. For example you'd have a 678 elements long paragraph context, where 75% of the data is under 150 words long.

I'm not exactly sure how these values play into the length 512 but I hope you get my point. From the looks of it it seems you can do fine with a length of 150 .

Vocabulary sizes

You can also reduce the vocabulary.

A good way of deciding this number would be to set the number of unique words that appear more than n times in your corpus ( n can be 10-25 or better do some further analysis and find an optimal value.).

For example you can get vocabulary stats as follows.

counts = sorted([(k, v) for k, v in list(textTokenizer.word_counts.items())], key=lambda x: x[1])

Which gives you word frequency combinations. You will see that around 37000 words appear less than (or approximately) 10 times. So you can set the vocabulary size of the tokenizer to something smaller.

textTokenizer = Tokenizer(num_words=50000, oov_token='unk')

But keep in mind that word_index still contain all the words. So you need to make sure you remove these rare words when you pass it as token_dict .

Batch size

You seem to be setting batch_size=10 which should be fine. But to get better results (and hopefully with more memory once you do the above suggestions), go for a higher batch size like 32 or 64 , which will improve performance.

Answer 2

Check out this Out-of-memory issues section on their github page.

Often it's because that batch size or sequence length is too large to fit in the GPU memory, followings are the maximum batch configurations for a 12GB memory GPU, as listed in the above link

System       | Seq Length | Max Batch Size
------------ | ---------- | --------------
`BERT-Base`  | 64         | 64
...          | 128        | 32
...          | 256        | 16
...          | 320        | 14
...          | 384        | 12
...          | 512        | 6
`BERT-Large` | 64         | 12
...          | 128        | 6
...          | 256        | 2
...          | 320        | 1
...          | 384        | 0
...          | 512        | 0

Update

I see what you're doing here, this tensor with shape[786432,1604] that causes the error is from the last layer Dense(1604,activation="softmax")(combined) , where the first dimension 786432 = 768*1024 comes from concatenating the 768d bert features of two 512 sequences, the second dimension 1604 I suppose is for all the possible locations or intervals of the predicted answer.

However for sequence labeling tasks like SQUAD, people usually don't use such a big fully connected layer. Instead you can try applying the same weights for each position, then normalize the sequence outputs by softmax. This way you can reduce the number of parameters in the final layer from 768*1024*1604 to something like 768*2 , where the output dimension 2 is for predicting the start and end position of the answer.

There's an example from the bert github repo that shows how to do SQUAD for bert like models. Also there's a section in the BERT paper describing this.

Answer 3

Your problem is when you create this Dense() layer:

combined = Concatenate()([questions_flattened,context_flattened])
answers_network_output = Dense(1604,activation="softmax")(combined)

Concatenate() gives you a huge layer, and when you connect that to Dense(1604, ...) you get (786432,1604) tensor, which is 1.2G-values (weight + bias, both floats), that will easily overflow your GPU memory.

To check if my assumption is correct, try to create layer:

answers_network_output = Dense(1604,activation="softmax")(something_smaller)

where something_smaller is the layer of smaller size than concatenated . Once you figure out this is your problem, you'll find the way to use less memory than you do now.

Training a BERT-based model causes an OutOfMemory error. How do I fix this?

Question

3 answers

solution1
6 ACCPTED 2020-01-09 02:46:58

Data type

Squashing the output before that last `softmax` layer

Length or the sequence

Vocabulary sizes

Batch size

solution2
5 2020-01-09 01:59:44

solution3
0 2020-01-09 02:43:34

Training a BERT-based model causes an OutOfMemory error. How do I fix this?

Question

3 answers

solution1 6 ACCPTED 2020-01-09 02:46:58

Data type

Squashing the output before that last softmax layer

Length or the sequence

Vocabulary sizes

Batch size

solution2 5 2020-01-09 01:59:44

solution3 0 2020-01-09 02:43:34

solution1
6 ACCPTED 2020-01-09 02:46:58

Squashing the output before that last `softmax` layer

solution2
5 2020-01-09 01:59:44

solution3
0 2020-01-09 02:43:34