训练基于 BERT 的模型会导致 OutOfMemory 错误。我该如何解决？

Question

我的设置有一个 NVIDIA P100 GPU。 我正在使用 Google BERT 模型来回答问题。 我正在使用 SQuAD 问答数据集，它给了我问题，以及应该从中得出答案的段落，我的研究表明这种架构应该没问题，但我在训练期间不断收到 OutOfMemory 错误：

ResourceExhaustedError：在分配形状为 [786432,1604] 的张量时出现 OOM，并通过分配器 GPU_0_bfc 在 /job:localhost/replica:0/task:0/device:GPU:0 上键入 float
[[{{node density_3/kernel/Initializer/random_uniform/RandomUniform}}]] 提示：如果您想在发生 OOM 时查看已分配张量的列表，请将 report_tensor_allocations_upon_oom 添加到 RunOptions 以获取当前分配信息。

下面，请找到一个完整的程序，该程序在我自己的模型中使用了其他人对 Google BERT 算法的实现。 请让我知道我可以做些什么来修复我的错误。 谢谢！

import json
import numpy as np
import pandas as pd
import os
assert os.path.isfile("train-v1.1.json"),"Non-existent file"
from tensorflow.python.client import device_lib
import tensorflow.compat.v1 as tf
#import keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import re
regex = re.compile(r'\W+')
#Reading the files.
def readFile(filename):
  with open(filename) as file:
    fields = []
    JSON = json.loads(file.read())
    articles = []
    for article in JSON["data"]:
      articleTitle = article["title"]
      article_body = []
      for paragraph in article["paragraphs"]:
        paragraphContext = paragraph["context"]
        article_body.append(paragraphContext)
        for qas in paragraph["qas"]:
          question = qas["question"]
          answer = qas["answers"][0]
          fields.append({"question":question,"answer_text":answer["text"],"answer_start":answer["answer_start"],"paragraph_context":paragraphContext,"article_title":articleTitle})
      article_body = "\\n".join(article_body)
      article = {"title":articleTitle,"body":article_body}
      articles.append(article)
  fields = pd.DataFrame(fields)
  fields["question"] = fields["question"].str.replace(regex," ")
  assert not (fields["question"].str.contains("catalanswhat").any())
  fields["paragraph_context"] = fields["paragraph_context"].str.replace(regex," ")
  fields["answer_text"] = fields["answer_text"].str.replace(regex," ")
  assert not (fields["paragraph_context"].str.contains("catalanswhat").any())
  fields["article_title"] = fields["article_title"].str.replace("_"," ")
  assert not (fields["article_title"].str.contains("catalanswhat").any())
  return fields,JSON["data"]
trainingData,training_JSON = readFile("train-v1.1.json")
print("JSON dataset read.")
#Text preprocessing
## Converting text to skipgrams
print("Tokenizing sentences.")
strings = trainingData.drop("answer_start",axis=1)
strings = strings.values.flatten()

answer_start_train_one_hot = pd.get_dummies(trainingData["answer_start"])

# @title Keras-BERT Environment
import os
pretrained_path = 'uncased_L-12_H-768_A-12'
config_path = os.path.join(pretrained_path, 'bert_config.json')
checkpoint_path = os.path.join(pretrained_path, 'bert_model.ckpt')
vocab_path = os.path.join(pretrained_path, 'vocab.txt')
# Use TF_Keras
os.environ["TF_KERAS"] = "1"

# @title Load Basic Model
import codecs
from keras_bert import load_trained_model_from_checkpoint
token_dict = {}
with codecs.open(vocab_path, 'r', 'utf8') as reader:
    for line in reader:
        token = line.strip()
        token_dict[token] = len(token_dict)

model = load_trained_model_from_checkpoint(config_path, checkpoint_path)

#@title Model Summary
model.summary()

#@title Create tokenization stuff.
from keras_bert import Tokenizer

tokenizer = Tokenizer(token_dict)
def tokenize(text,max_len):
  tokenizer.tokenize(text)
  return tokenizer.encode(first=text,max_len=max_len)
def tokenize_array(texts,max_len=512):
  indices = np.zeros((texts.shape[0],max_len))
  segments = np.zeros((texts.shape[0],max_len))
  for i in range(texts.shape[0]):
    tokens = tokenize(texts[i],max_len)
    indices[i] = tokens[0]
    segments[i] = tokens[1]
  #print(indices.shape)
  #print(segments.shape)
  return np.stack([segments,indices],axis=1)

#@ Tokenize inputs.
def X_Y(dataset,answer_start_one_hot,batch_size=10):
    questions = dataset["question"]
    contexts = dataset["paragraph_context"]
    questions_tokenized = tokenize_array(questions.values)
    contexts_tokenized = tokenize_array(contexts.values)
    X = np.stack([questions_tokenized,contexts_tokenized],axis=1)
    Y = answer_start_one_hot
    return X,Y
def X_Y_generator(dataset,answer_start_one_hot,batch_size=10):
    while True:
        try:
            batch_indices = np.random.choice(np.arange(0,dataset.shape[0]),size=batch_size)
            dataset_batch = dataset.iloc[batch_indices]
            X,Y = X_Y(dataset_batch,answer_start_one_hot.iloc[batch_indices])
            max_int = pd.concat((trainingData["answer_start"],devData["answer_start"])).max()
            yield (X,Y)
        except Exception as e:
            print("Unhandled exception in X_Y_generator: ",e)
            raise

model.trainable = True

answers_network_checkpoint = ModelCheckpoint('answers_network-best.h5', verbose=1, monitor='val_loss',save_best_only=True, mode='auto')

input_layer = Input(shape=(2,2,512,))
print("input layer: ",input_layer.shape)
questions_input_layer = Lambda(lambda x: x[:,0])(input_layer)
context_input_layer = Lambda(lambda x: x[:,1])(input_layer)
print("questions input layer: ",questions_input_layer.shape)
print("context input layer: ",context_input_layer.shape)
questions_indices_layer = Lambda(lambda x: tf.cast(x[:,0],tf.float64))(questions_input_layer)
print("questions indices layer: ",questions_indices_layer.shape)
questions_segments_layer = Lambda(lambda x: tf.cast(x[:,1],tf.float64))(questions_input_layer)
print("questions segments layer: ",questions_segments_layer.shape)
context_indices_layer = Lambda(lambda x: tf.cast(x[:,0],tf.float64))(context_input_layer)
context_segments_layer = Lambda(lambda x: tf.cast(x[:,1],tf.float64))(context_input_layer)
questions_bert_layer = model([questions_indices_layer,questions_segments_layer])
print("Questions bert layer loaded.")
context_bert_layer = model([context_indices_layer,context_segments_layer])
print("Context bert layer loaded.")
questions_flattened = Flatten()(questions_bert_layer)
context_flattened = Flatten()(context_bert_layer)
combined = Concatenate()([questions_flattened,context_flattened])
#bert_dense_questions = Dense(256,activation="sigmoid")(questions_flattened)
#bert_dense_context = Dense(256,activation="sigmoid")(context_flattened)
answers_network_output = Dense(1604,activation="softmax")(combined)
#answers_network = Model(inputs=[input_layer],outputs=[questions_bert_layer,context_bert_layer])
answers_network = Model(inputs=[input_layer],outputs=[answers_network_output])
answers_network.summary()

answers_network.compile("adam","categorical_crossentropy",metrics=["accuracy"])

answers_network.fit_generator(
    X_Y_generator(
        trainingData,
        answer_start_train_one_hot,
        batch_size=10),
    steps_per_epoch=100,
    epochs=100,
    callbacks=[answers_network_checkpoint])

我的词汇量约为 83,000 个单词。 任何具有“良好”准确性/F1 分数的模型都是首选，但我也有 5 天不可扩展的截止日期。

编辑：

不幸的是，有一件事我没有提到：我实际上使用 CyberZHG 的keras-bert模块进行预处理，以及实际的 BERT 模型，因此一些优化实际上可能会破坏代码。 例如，我尝试将默认浮点值设置为 float16，但这导致了兼容性错误。

编辑#2：

根据要求，这是我的完整程序的代码：

Jupyter 笔记本

Answer 1

编辑：我已经就地编辑了我的回复，而不是增加已经很长的回复的长度。

在查看问题后，您的模型中的最后一层出现了问题。 我能够让它与以下修复/更改一起工作。

ResourceExhaustedError：在分配形状为 [786432,1604] 的张量时出现 OOM，并在 /job:localhost/replica:0/task:0/device:GPU:0 上通过分配器 GPU_0_bfc [[{{node density_3/kernel/Initializer/random_uniform] 键入 float /RandomUniform}}]]] 提示：如果您想在 OOM 发生时查看已分配张量的列表，请将 report_tensor_allocations_upon_oom 添加到 RunOptions 以获取当前分配信息。

因此，查看错误是无法分配[786432,1604]的数组。 如果你做一个简单的计算，你在这里分配了5GB数组（假设是 float32）。 如果是float64则达到10GB 。 在模型中添加来自Bert和其他层的参数，中提琴！ 你的内存不足。

问题

数据类型

查看代码，您的答案网络中的所有这些层都生成float64因为您为所有Lambda层指定了float64 。 所以我的第一个建议是，

全局设置它应该可以解决问题tf.keras.backend.set_floatx('float16')

作为预防措施，

question_indices_layer = Input(shape=(256,), dtype='float16')
question_segments_layer = Input(shape=(256,), dtype='float16')
context_indices_layer = Input(shape=(256,), dtype='float16')
context_segments_layer = Input(shape=(256,), dtype='float16')
questions_bert_layer = model([question_indices_layer,question_segments_layer])
context_bert_layer = model([context_indices_layer,context_segments_layer])
questions_flattened = Flatten(dtype=tf.float16)(questions_bert_layer)
questions_flattened = Dense(64, activation='relu',dtype=tf.float16)(questions_flattened)
contexts_flattened = Flatten(dtype=tf.float16)(context_bert_layer)
contexts_flattened = Dense(64,activation="relu",dtype=tf.float16)
combined = Concatenate(dtype=tf.float16)([questions_flattened,contexts_flattened])

现在您将拥有所有图层float16 。

在最后一个`softmax`层之前压缩输出

您可以做的另一件事是，在不将大量[batch size, 512, 768]输出传递到密集层的情况下，您可以使用较小的层或某种转换对其进行压缩。 您可以尝试的几件事是，

添加较小的密集层，在将其馈送到1604 softmax 层之前降低维度。 这显着减少了模型参数。

questions_flattened = Flatten(dtype=tf.float16)(questions_bert_layer)
questions_flattened = Dense(64, activation='relu',dtype=tf.float16)(questions_flattened)
contexts_flattened = Flatten(dtype=tf.float16)(context_bert_layer)
contexts_flattened = Dense(64,activation="relu",dtype=tf.float16)(contexts_flattened)
combined = Concatenate(dtype=tf.float16)([questions_flattened,contexts_flattened])

对question输出的时间维度求和/求平均值。 因为，您只关心理解问题是什么，所以从该输出中丢失位置信息是可以的。 你可以通过以下方式做到这一点，

questions_flattened = Lambda(lambda x: K.sum(x, axis=1))(questions_bert_layer)

而不是Concatenate尝试Add()这样你就不会增加维度。
您可以尝试其中任何一个（可选，同时与列表中的其他人组合）。 但请确保在组合执行这些answers_flattened时匹配questions_flattend和answers_flattened维度，否则你会得到错误。

长度或序列

下一个问题是您的输入长度是512 。 我不确定你是如何得出这个数字的，但我认为你可以在低于这个数字的情况下做得更好。 例如，您将获得以下questions和paragraphs统计信息。

count    175198.000000
mean         11.217582
std           3.597345
min           1.000000
25%           9.000000
50%          11.000000
75%          13.000000
max          41.000000
Name: question, dtype: float64

count    175198.000000
mean        123.791653
std          50.541241
min          21.000000
25%          92.000000
50%         114.000000
75%         147.000000
max         678.000000
Name: paragraph_context, dtype: float64

你可以得到这些信息，

pd.Series(trainingData["question"]).str.split(' ').str.len().describe()

例如，当您使用pad_sequences填充序列时，您没有指定maxlen ，这会导致将句子填充到语料库中找到的最大长度。 例如，您有一个 678 个元素的长段落上下文，其中 75% 的数据长度低于 150 个单词。

我不太确定这些值如何影响长度512但我希望你明白我的意思。 从它的外观来看，长度为150似乎可以做得很好。

词汇量

你也可以减少词汇量。

确定这个数字的一个好方法是设置在您的语料库中出现超过n次的唯一词的数量（ n可以是 10-25 或更好地做一些进一步的分析并找到一个最佳值。）。

例如，您可以按如下方式获取vocabulary统计信息。

counts = sorted([(k, v) for k, v in list(textTokenizer.word_counts.items())], key=lambda x: x[1])

这为您提供了词频组合。 您会看到大约 37000 个单词出现的次数少于（或大约）10 次。 因此，您可以将分词器的词汇量设置得更小一些。

textTokenizer = Tokenizer(num_words=50000, oov_token='unk')

但请记住， word_index仍然包含所有单词。 因此，当您将其作为token_dict传递时，您需要确保删除这些稀有词。

批量大小

您似乎正在设置batch_size=10应该没问题。 但是为了获得更好的结果（并希望在您执行上述建议后获得更多内存），请使用更高的批次大小，例如32或64 ，这将提高性能。

Answer 2

在他们的 github 页面上查看内存不足问题部分。

通常是因为批处理大小或序列长度太大而无法放入 GPU 内存，以下是 12GB 内存 GPU 的最大批处理配置，如上面链接中所列

System       | Seq Length | Max Batch Size
------------ | ---------- | --------------
`BERT-Base`  | 64         | 64
...          | 128        | 32
...          | 256        | 16
...          | 320        | 14
...          | 384        | 12
...          | 512        | 6
`BERT-Large` | 64         | 12
...          | 128        | 6
...          | 256        | 2
...          | 320        | 1
...          | 384        | 0
...          | 512        | 0

更新

我明白你在这里做什么，导致错误的tensor with shape[786432,1604]来自最后一层Dense(1604,activation="softmax")(combined) ，其中第一维 786432 = 768*1024来自连接两个 512 序列的 768d bert 特征，我认为第二维1604是预测答案的所有可能位置或区间。

然而对于像 SQUAD 这样的序列标注任务，人们通常不会使用这么大的全连接层。 相反，您可以尝试对每个位置应用相同的权重，然后通过 softmax 对序列输出进行归一化。 通过这种方式，您可以将最后一层中的参数数量从768*1024*1604到类似768*2 ，其中输出维度 2 用于预测答案的开始和结束位置。

bert github repo 中有一个示例，展示了如何为 bert 样模型执行 SQUAD。 BERT 论文中也有一节描述了这一点。

Answer 3

你的问题是当你创建这个Dense()层时：

combined = Concatenate()([questions_flattened,context_flattened])
answers_network_output = Dense(1604,activation="softmax")(combined)

Concatenate()为您提供了一个巨大的层，当您将其连接到Dense(1604, ...)您将获得(786432,1604)张量，即 1.2G 值（权重 + 偏差，两者都是浮点数），这很容易溢出你的 GPU 内存。

要检查我的假设是否正确，请尝试创建层：

answers_network_output = Dense(1604,activation="softmax")(something_smaller)

其中something_smaller是比concatenated更小的层。 一旦你发现这是你的问题，你就会找到比现在使用更少内存的方法。

训练基于 BERT 的模型会导致 OutOfMemory 错误。我该如何解决？

问题描述

3 个解决方案

解决方案1
6 已采纳 2020-01-09 02:46:58

数据类型

在最后一个`softmax`层之前压缩输出

长度或序列

词汇量

批量大小

解决方案2
5 2020-01-09 01:59:44

解决方案3
0 2020-01-09 02:43:34

训练基于 BERT 的模型会导致 OutOfMemory 错误。 我该如何解决？

问题描述

3 个解决方案

解决方案1 6 已采纳 2020-01-09 02:46:58

数据类型

在最后一个softmax层之前压缩输出

长度或序列

词汇量

批量大小

解决方案2 5 2020-01-09 01:59:44

解决方案3 0 2020-01-09 02:43:34

训练基于 BERT 的模型会导致 OutOfMemory 错误。我该如何解决？

解决方案1
6 已采纳 2020-01-09 02:46:58

在最后一个`softmax`层之前压缩输出

解决方案2
5 2020-01-09 01:59:44

解决方案3
0 2020-01-09 02:43:34