在 Tensorflow 中执行梯度累积时超出 Memory

Question

I am trying to implement gradient accumulation for a twitter sentiment analysis model using HuggingFace's BERT model.我正在尝试使用 HuggingFace 的 BERT model 为 twitter 情感分析 model 实现梯度累积。 However, when I go to implement gradient accumulation with a batch size of 64, I get the dreaded "OOM" error.但是，当我 go 以 64 的批大小实现梯度累积时，我得到了可怕的“OOM”错误。 Oddly enough, when I go to run my same model with a batch size of 64 and not using gradient accumulation, it trains right through.奇怪的是，当我 go 以 64 的批次大小运行相同的 model 并且不使用梯度累积时，它会一直训练。 Does anyone know why this is and/or if my code is wrong?有谁知道这是为什么和/或我的代码是否错误？


batch_size = 32
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
vocabulary = tokenizer.get_vocab()
optimizer = tf.keras.optimizers.Adam() 
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')

# data preprocessing
tweets_pos = pd.read_csv('C:/1_Tweets.csv', sep=',', names = ['Tweet', 'Sentiment'])
tweets_neg = pd.read_csv('C:/0_Tweets.csv', sep=',', names = ['Tweet',  'Sentiment'])
data = pd.concat([tweets_pos, tweets_neg], axis=0)
data = data.sample(frac=1) 
all_tweets = data['Tweet'].to_list()
all_sentiment = data['Sentiment'].to_list()
training_tweets = all_tweets[0:512]
training_labels = all_sentiment[0:512]

# create dataset
def create_dataset(tweets, labels):
  
  inputs_ids_list = []
  token_type_ids_list = []
  attention_mask_list = []
  label_list = []
  
  for i in range(len(tweets)):
    encoded = tokenizer.encode_plus(tweets[i], max_length = 512, pad_to_max_length=True, return_attention_mask=True, add_special_tokens=True)
    inputs_ids_list.append(encoded['input_ids'])
    token_type_ids_list.append(encoded['token_type_ids'])
    attention_mask_list.append(encoded['attention_mask'])
    label_list.append([labels[i]])

  ids_and_mask = {'input_ids':inputs_ids_list, 'token_type_ids':token_type_ids_list,'attention_mask':attention_mask_list}
  
  return tf.data.Dataset.from_tensor_slices((ids_and_mask, label_list))

# create dataset of batch_size = 32
train_dataset = create_dataset(training_tweets, training_labels).batch(batch_size)

# Accumulate Gradients
num_epochs = 1
for i in range(num_epochs):
  print(f'Epoch: {i + 1}')
  total_loss = 0

  # get trainable variables
  train_vars = model.trainable_variables
  accum_gradient = [tf.zeros_like(this_var) for this_var in train_vars]

  for (batch, (tweets, labels)) in enumerate(train_dataset):
      labels = tf.dtypes.cast(labels, tf.float32)
      with tf.GradientTape() as tape:
          prediction = model(tweets, training=True)
          prediction = tf.dtypes.cast(prediction, tf.float32)
          loss_value = loss(y_true=labels, y_pred=prediction)
      total_loss += loss_value

      # get gradients of this tape
      gradients = tape.gradient(loss_value, train_vars)
      # Accumulate the gradients
      accum_gradient = [(acum_grad+grad) for acum_grad, grad in zip(accum_gradient, gradients)]


  # average gradients and apply the optimization step
  accum_gradient = [this_grad/batch_size for this_grad in accum_gradient]
  optimizer.apply_gradients(zip(accum_gradient,train_vars))
      
  epoch_loss = total_loss / batch_size
  print(f'Epoch loss: {epoch_loss}')

Answer 1

I know I'm a bit late for this, but on your question, you are already answering it.我知道我对此有点晚了，但是关于你的问题，你已经在回答了。

It holds the graph in the memory to correctly calculate the gradient when accumulating gradients.它将图形保存在 memory 中，以便在累积梯度时正确计算梯度。 In other words, when accumulating, you have every forward that you have already done in your memory.换句话说，在累积时，您已经在 memory 中完成了所有前锋。 When not accumulating and using a batch size of 64, TensorFlow flushes the graph after back propping it.当不累积并使用 64 的批量大小时，TensorFlow 在反向支撑后刷新图形。

This scenario is why you can train with this batch size of 64 but not accumulate with 64. I don't know why you are trying to accumulate it.这种情况就是为什么您可以使用 64 的批大小进行训练，但不能使用 64 进行累积。我不知道您为什么要尝试累积它。 However, maybe downsizing your batch size a little if you need it to accumulate.但是，如果您需要累积批量，可能会稍微缩小批量大小。

在 Tensorflow 中执行梯度累积时超出 Memory

问题描述

1 个解决方案

解决方案1
0 2022-01-21 21:24:10

在 Tensorflow 中执行梯度累积时超出 Memory

问题描述

1 个解决方案

解决方案1 0 2022-01-21 21:24:10

解决方案1
0 2022-01-21 21:24:10