訓練基於 BERT 的模型會導致 OutOfMemory 錯誤。我該如何解決？

Question

我的設置有一個 NVIDIA P100 GPU。 我正在使用 Google BERT 模型來回答問題。 我正在使用 SQuAD 問答數據集，它給了我問題，以及應該從中得出答案的段落，我的研究表明這種架構應該沒問題，但我在訓練期間不斷收到 OutOfMemory 錯誤：

ResourceExhaustedError：在分配形狀為 [786432,1604] 的張量時出現 OOM，並通過分配器 GPU_0_bfc 在 /job:localhost/replica:0/task:0/device:GPU:0 上鍵入 float
[[{{node density_3/kernel/Initializer/random_uniform/RandomUniform}}]] 提示：如果您想在發生 OOM 時查看已分配張量的列表，請將 report_tensor_allocations_upon_oom 添加到 RunOptions 以獲取當前分配信息。

下面，請找到一個完整的程序，該程序在我自己的模型中使用了其他人對 Google BERT 算法的實現。 請讓我知道我可以做些什么來修復我的錯誤。 謝謝！

import json
import numpy as np
import pandas as pd
import os
assert os.path.isfile("train-v1.1.json"),"Non-existent file"
from tensorflow.python.client import device_lib
import tensorflow.compat.v1 as tf
#import keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import re
regex = re.compile(r'\W+')
#Reading the files.
def readFile(filename):
  with open(filename) as file:
    fields = []
    JSON = json.loads(file.read())
    articles = []
    for article in JSON["data"]:
      articleTitle = article["title"]
      article_body = []
      for paragraph in article["paragraphs"]:
        paragraphContext = paragraph["context"]
        article_body.append(paragraphContext)
        for qas in paragraph["qas"]:
          question = qas["question"]
          answer = qas["answers"][0]
          fields.append({"question":question,"answer_text":answer["text"],"answer_start":answer["answer_start"],"paragraph_context":paragraphContext,"article_title":articleTitle})
      article_body = "\\n".join(article_body)
      article = {"title":articleTitle,"body":article_body}
      articles.append(article)
  fields = pd.DataFrame(fields)
  fields["question"] = fields["question"].str.replace(regex," ")
  assert not (fields["question"].str.contains("catalanswhat").any())
  fields["paragraph_context"] = fields["paragraph_context"].str.replace(regex," ")
  fields["answer_text"] = fields["answer_text"].str.replace(regex," ")
  assert not (fields["paragraph_context"].str.contains("catalanswhat").any())
  fields["article_title"] = fields["article_title"].str.replace("_"," ")
  assert not (fields["article_title"].str.contains("catalanswhat").any())
  return fields,JSON["data"]
trainingData,training_JSON = readFile("train-v1.1.json")
print("JSON dataset read.")
#Text preprocessing
## Converting text to skipgrams
print("Tokenizing sentences.")
strings = trainingData.drop("answer_start",axis=1)
strings = strings.values.flatten()

answer_start_train_one_hot = pd.get_dummies(trainingData["answer_start"])

# @title Keras-BERT Environment
import os
pretrained_path = 'uncased_L-12_H-768_A-12'
config_path = os.path.join(pretrained_path, 'bert_config.json')
checkpoint_path = os.path.join(pretrained_path, 'bert_model.ckpt')
vocab_path = os.path.join(pretrained_path, 'vocab.txt')
# Use TF_Keras
os.environ["TF_KERAS"] = "1"

# @title Load Basic Model
import codecs
from keras_bert import load_trained_model_from_checkpoint
token_dict = {}
with codecs.open(vocab_path, 'r', 'utf8') as reader:
    for line in reader:
        token = line.strip()
        token_dict[token] = len(token_dict)

model = load_trained_model_from_checkpoint(config_path, checkpoint_path)

#@title Model Summary
model.summary()

#@title Create tokenization stuff.
from keras_bert import Tokenizer

tokenizer = Tokenizer(token_dict)
def tokenize(text,max_len):
  tokenizer.tokenize(text)
  return tokenizer.encode(first=text,max_len=max_len)
def tokenize_array(texts,max_len=512):
  indices = np.zeros((texts.shape[0],max_len))
  segments = np.zeros((texts.shape[0],max_len))
  for i in range(texts.shape[0]):
    tokens = tokenize(texts[i],max_len)
    indices[i] = tokens[0]
    segments[i] = tokens[1]
  #print(indices.shape)
  #print(segments.shape)
  return np.stack([segments,indices],axis=1)

#@ Tokenize inputs.
def X_Y(dataset,answer_start_one_hot,batch_size=10):
    questions = dataset["question"]
    contexts = dataset["paragraph_context"]
    questions_tokenized = tokenize_array(questions.values)
    contexts_tokenized = tokenize_array(contexts.values)
    X = np.stack([questions_tokenized,contexts_tokenized],axis=1)
    Y = answer_start_one_hot
    return X,Y
def X_Y_generator(dataset,answer_start_one_hot,batch_size=10):
    while True:
        try:
            batch_indices = np.random.choice(np.arange(0,dataset.shape[0]),size=batch_size)
            dataset_batch = dataset.iloc[batch_indices]
            X,Y = X_Y(dataset_batch,answer_start_one_hot.iloc[batch_indices])
            max_int = pd.concat((trainingData["answer_start"],devData["answer_start"])).max()
            yield (X,Y)
        except Exception as e:
            print("Unhandled exception in X_Y_generator: ",e)
            raise

model.trainable = True

answers_network_checkpoint = ModelCheckpoint('answers_network-best.h5', verbose=1, monitor='val_loss',save_best_only=True, mode='auto')

input_layer = Input(shape=(2,2,512,))
print("input layer: ",input_layer.shape)
questions_input_layer = Lambda(lambda x: x[:,0])(input_layer)
context_input_layer = Lambda(lambda x: x[:,1])(input_layer)
print("questions input layer: ",questions_input_layer.shape)
print("context input layer: ",context_input_layer.shape)
questions_indices_layer = Lambda(lambda x: tf.cast(x[:,0],tf.float64))(questions_input_layer)
print("questions indices layer: ",questions_indices_layer.shape)
questions_segments_layer = Lambda(lambda x: tf.cast(x[:,1],tf.float64))(questions_input_layer)
print("questions segments layer: ",questions_segments_layer.shape)
context_indices_layer = Lambda(lambda x: tf.cast(x[:,0],tf.float64))(context_input_layer)
context_segments_layer = Lambda(lambda x: tf.cast(x[:,1],tf.float64))(context_input_layer)
questions_bert_layer = model([questions_indices_layer,questions_segments_layer])
print("Questions bert layer loaded.")
context_bert_layer = model([context_indices_layer,context_segments_layer])
print("Context bert layer loaded.")
questions_flattened = Flatten()(questions_bert_layer)
context_flattened = Flatten()(context_bert_layer)
combined = Concatenate()([questions_flattened,context_flattened])
#bert_dense_questions = Dense(256,activation="sigmoid")(questions_flattened)
#bert_dense_context = Dense(256,activation="sigmoid")(context_flattened)
answers_network_output = Dense(1604,activation="softmax")(combined)
#answers_network = Model(inputs=[input_layer],outputs=[questions_bert_layer,context_bert_layer])
answers_network = Model(inputs=[input_layer],outputs=[answers_network_output])
answers_network.summary()

answers_network.compile("adam","categorical_crossentropy",metrics=["accuracy"])

answers_network.fit_generator(
    X_Y_generator(
        trainingData,
        answer_start_train_one_hot,
        batch_size=10),
    steps_per_epoch=100,
    epochs=100,
    callbacks=[answers_network_checkpoint])

我的詞匯量約為 83,000 個單詞。 任何具有“良好”准確性/F1 分數的模型都是首選，但我也有 5 天不可擴展的截止日期。

編輯：

不幸的是，有一件事我沒有提到：我實際上使用 CyberZHG 的keras-bert模塊進行預處理，以及實際的 BERT 模型，因此一些優化實際上可能會破壞代碼。 例如，我嘗試將默認浮點值設置為 float16，但這導致了兼容性錯誤。

編輯#2：

根據要求，這是我的完整程序的代碼：

Jupyter 筆記本

Answer 1

編輯：我已經就地編輯了我的回復，而不是增加已經很長的回復的長度。

在查看問題后，您的模型中的最后一層出現了問題。 我能夠讓它與以下修復/更改一起工作。

ResourceExhaustedError：在分配形狀為 [786432,1604] 的張量時出現 OOM，並在 /job:localhost/replica:0/task:0/device:GPU:0 上通過分配器 GPU_0_bfc [[{{node density_3/kernel/Initializer/random_uniform] 鍵入 float /RandomUniform}}]]] 提示：如果您想在 OOM 發生時查看已分配張量的列表，請將 report_tensor_allocations_upon_oom 添加到 RunOptions 以獲取當前分配信息。

因此，查看錯誤是無法分配[786432,1604]的數組。 如果你做一個簡單的計算，你在這里分配了5GB數組（假設是 float32）。 如果是float64則達到10GB 。 在模型中添加來自Bert和其他層的參數，中提琴！ 你的內存不足。

問題

數據類型

查看代碼，您的答案網絡中的所有這些層都生成float64因為您為所有Lambda層指定了float64 。 所以我的第一個建議是，

全局設置它應該可以解決問題tf.keras.backend.set_floatx('float16')

作為預防措施，

question_indices_layer = Input(shape=(256,), dtype='float16')
question_segments_layer = Input(shape=(256,), dtype='float16')
context_indices_layer = Input(shape=(256,), dtype='float16')
context_segments_layer = Input(shape=(256,), dtype='float16')
questions_bert_layer = model([question_indices_layer,question_segments_layer])
context_bert_layer = model([context_indices_layer,context_segments_layer])
questions_flattened = Flatten(dtype=tf.float16)(questions_bert_layer)
questions_flattened = Dense(64, activation='relu',dtype=tf.float16)(questions_flattened)
contexts_flattened = Flatten(dtype=tf.float16)(context_bert_layer)
contexts_flattened = Dense(64,activation="relu",dtype=tf.float16)
combined = Concatenate(dtype=tf.float16)([questions_flattened,contexts_flattened])

現在您將擁有所有圖層float16 。

在最后一個`softmax`層之前壓縮輸出

您可以做的另一件事是，在不將大量[batch size, 512, 768]輸出傳遞到密集層的情況下，您可以使用較小的層或某種轉換對其進行壓縮。 您可以嘗試的幾件事是，

添加較小的密集層，在將其饋送到1604 softmax 層之前降低維度。 這顯着減少了模型參數。

questions_flattened = Flatten(dtype=tf.float16)(questions_bert_layer)
questions_flattened = Dense(64, activation='relu',dtype=tf.float16)(questions_flattened)
contexts_flattened = Flatten(dtype=tf.float16)(context_bert_layer)
contexts_flattened = Dense(64,activation="relu",dtype=tf.float16)(contexts_flattened)
combined = Concatenate(dtype=tf.float16)([questions_flattened,contexts_flattened])

對question輸出的時間維度求和/求平均值。 因為，您只關心理解問題是什么，所以從該輸出中丟失位置信息是可以的。 你可以通過以下方式做到這一點，

questions_flattened = Lambda(lambda x: K.sum(x, axis=1))(questions_bert_layer)

而不是Concatenate嘗試Add()這樣你就不會增加維度。
您可以嘗試其中任何一個（可選，同時與列表中的其他人組合）。 但請確保在組合執行這些answers_flattened時匹配questions_flattend和answers_flattened維度，否則你會得到錯誤。

長度或序列

下一個問題是您的輸入長度是512 。 我不確定你是如何得出這個數字的，但我認為你可以在低於這個數字的情況下做得更好。 例如，您將獲得以下questions和paragraphs統計信息。

count    175198.000000
mean         11.217582
std           3.597345
min           1.000000
25%           9.000000
50%          11.000000
75%          13.000000
max          41.000000
Name: question, dtype: float64

count    175198.000000
mean        123.791653
std          50.541241
min          21.000000
25%          92.000000
50%         114.000000
75%         147.000000
max         678.000000
Name: paragraph_context, dtype: float64

你可以得到這些信息，

pd.Series(trainingData["question"]).str.split(' ').str.len().describe()

例如，當您使用pad_sequences填充序列時，您沒有指定maxlen ，這會導致將句子填充到語料庫中找到的最大長度。 例如，您有一個 678 個元素的長段落上下文，其中 75% 的數據長度低於 150 個單詞。

我不太確定這些值如何影響長度512但我希望你明白我的意思。 從它的外觀來看，長度為150似乎可以做得很好。

詞匯量

你也可以減少詞匯量。

確定這個數字的一個好方法是設置在您的語料庫中出現超過n次的唯一詞的數量（ n可以是 10-25 或更好地做一些進一步的分析並找到一個最佳值。）。

例如，您可以按如下方式獲取vocabulary統計信息。

counts = sorted([(k, v) for k, v in list(textTokenizer.word_counts.items())], key=lambda x: x[1])

這為您提供了詞頻組合。 您會看到大約 37000 個單詞出現的次數少於（或大約）10 次。 因此，您可以將分詞器的詞匯量設置得更小一些。

textTokenizer = Tokenizer(num_words=50000, oov_token='unk')

但請記住， word_index仍然包含所有單詞。 因此，當您將其作為token_dict傳遞時，您需要確保刪除這些稀有詞。

批量大小

您似乎正在設置batch_size=10應該沒問題。 但是為了獲得更好的結果（並希望在您執行上述建議后獲得更多內存），請使用更高的批次大小，例如32或64 ，這將提高性能。

Answer 2

在他們的 github 頁面上查看內存不足問題部分。

通常是因為批處理大小或序列長度太大而無法放入 GPU 內存，以下是 12GB 內存 GPU 的最大批處理配置，如上面鏈接中所列

System       | Seq Length | Max Batch Size
------------ | ---------- | --------------
`BERT-Base`  | 64         | 64
...          | 128        | 32
...          | 256        | 16
...          | 320        | 14
...          | 384        | 12
...          | 512        | 6
`BERT-Large` | 64         | 12
...          | 128        | 6
...          | 256        | 2
...          | 320        | 1
...          | 384        | 0
...          | 512        | 0

更新

我明白你在這里做什么，導致錯誤的tensor with shape[786432,1604]來自最后一層Dense(1604,activation="softmax")(combined) ，其中第一維 786432 = 768*1024來自連接兩個 512 序列的 768d bert 特征，我認為第二維1604是預測答案的所有可能位置或區間。

然而對於像 SQUAD 這樣的序列標注任務，人們通常不會使用這么大的全連接層。 相反，您可以嘗試對每個位置應用相同的權重，然后通過 softmax 對序列輸出進行歸一化。 通過這種方式，您可以將最后一層中的參數數量從768*1024*1604到類似768*2 ，其中輸出維度 2 用於預測答案的開始和結束位置。

bert github repo 中有一個示例，展示了如何為 bert 樣模型執行 SQUAD。 BERT 論文中也有一節描述了這一點。

Answer 3

你的問題是當你創建這個Dense()層時：

combined = Concatenate()([questions_flattened,context_flattened])
answers_network_output = Dense(1604,activation="softmax")(combined)

Concatenate()為您提供了一個巨大的層，當您將其連接到Dense(1604, ...)您將獲得(786432,1604)張量，即 1.2G 值（權重 + 偏差，兩者都是浮點數），這很容易溢出你的 GPU 內存。

要檢查我的假設是否正確，請嘗試創建層：

answers_network_output = Dense(1604,activation="softmax")(something_smaller)

其中something_smaller是比concatenated更小的層。 一旦你發現這是你的問題，你就會找到比現在使用更少內存的方法。

訓練基於 BERT 的模型會導致 OutOfMemory 錯誤。我該如何解決？

問題描述

3 個解決方案

解決方案1
6 已采納 2020-01-09 02:46:58

數據類型

在最后一個`softmax`層之前壓縮輸出

長度或序列

詞匯量

批量大小

解決方案2
5 2020-01-09 01:59:44

解決方案3
0 2020-01-09 02:43:34

訓練基於 BERT 的模型會導致 OutOfMemory 錯誤。 我該如何解決？

問題描述

3 個解決方案

解決方案1 6 已采納 2020-01-09 02:46:58

數據類型

在最后一個softmax層之前壓縮輸出

長度或序列

詞匯量

批量大小

解決方案2 5 2020-01-09 01:59:44

解決方案3 0 2020-01-09 02:43:34

訓練基於 BERT 的模型會導致 OutOfMemory 錯誤。我該如何解決？

解決方案1
6 已采納 2020-01-09 02:46:58

在最后一個`softmax`層之前壓縮輸出

解決方案2
5 2020-01-09 01:59:44

解決方案3
0 2020-01-09 02:43:34