Pytorch 中基于磁带的 autograd 是什么？

Question

I understand autograd is used to imply automatic differentiation.我了解autograd用于暗示自动微分。 But what exactly is tape-based autograd in Pytorch and why there are so many discussions that affirm or deny it.但究竟什么是tape-based autograd在Pytorch ，为什么有一些肯定或否定它这么多的讨论。

For example:例如：

this 这个

In pytorch, there is no traditional sense of tape在pytorch中，没有传统意义上的胶带

and this而这

We don't really build gradient tapes per se.我们本身并没有真正构建梯度磁带。 But graphs.但是图表。

but not this但不是这个

Autograd is now a core torch package for automatic differentiation. Autograd 现在是用于自动微分的核心 Torch 包。 It uses a tape based system for automatic differentiation.它使用基于磁带的系统进行自动区分。

And for further reference, please compare it with GradientTape in Tensorflow .为了进一步参考，请将其与Tensorflow GradientTape进行Tensorflow 。

Answer 1

There are different types of automatic differentiation eg forward-mode , reverse-mode , hybrids ;有不同类型的自动区分，例如forward-mode 、 reverse-mode 、 hybrids ； ( more explanation ). （更多解释）。 The tape-based autograd in Pytorch simply refers to the uses of reverse-mode automatic differentiation, source . Pytorch 中tape-based Pytorch简单地指的是反向模式自动微分的使用， source 。 The reverse-mode auto diff is simply a technique used to compute gradients efficiently and it happens to be used by backpropagation , source .反向模式自动差异只是一种用于有效计算梯度的技术，它恰好被反向传播源使用。

Now, in PyTorch , Autograd is the core torch package for automatic differentiation.现在，在PyTorch 中，Autograd 是自动微分的核心 Torch 包。 It uses a tape-based system for automatic differentiation .它使用基于tape-based系统进行自动区分。 In the forward phase, the autograd tape will remember all the operations it executed, and in the backward phase, it will replay the operations .在前进阶段， autograd磁带会记住它执行的所有操作，而在后退阶段，它将重放这些操作。

Same in TensorFlow , to differentiate automatically, It also needs to remember what operations happen in what order during the forward pass.在TensorFlow 中也是如此，为了自动区分，它还需要记住在前向传递过程中以什么顺序发生了什么操作。 Then, during the backward pass, TensorFlow traverses this list of operations in reverse order to compute gradients .然后，在向后传递期间，TensorFlow 以相反的顺序遍历此操作列表以计算梯度。 Now, TensorFlow provides the tf.GradientTape API for automatic differentiation;现在，TensorFlow 提供了tf.GradientTape API 用于自动微分； that is, computing the gradient of computation with respect to some inputs, usually tf.Variables .也就是说，计算关于某些输入的计算梯度，通常是tf.Variables 。 TensorFlow records relevant operations executed inside the context of a tf.GradientTape onto a tape . TensorFlow 将在tf.GradientTape上下文中执行的相关操作记录到磁带上。 TensorFlow then uses that tape to compute the gradients of a recorded computation usingreverse mode differentiation . TensorFlow 然后使用该磁带来计算使用反向模式微分的记录计算的梯度。

So, as we can see from the high-level viewpoint, both are doing the same operation.所以，从高层的角度来看，两者都在做同样的操作。 However, during the custom training loop, the forward pass and calculation of the loss are more explicit in TensorFlow as it uses tf.GradientTape API scope whereas in PyTorch it's implicit for these operations but it requires to set required_grad flags to False temporarily while updating the training parameters (weights and biases).然而，在自定义训练循环期间， forward传递和loss计算在TensorFlow中更加明确，因为它使用tf.GradientTape API 范围，而在PyTorch ，这些操作是隐式的，但它需要在更新时暂时将required_grad标志设置为False训练参数（权重和偏差）。 For that, it uses torch.no_grad API explicitly.为此，它明确使用了torch.no_grad API。 In other words, TensorFlow's tf.GradientTape() is similar to PyTorch's loss.backward() .换句话说，TensorFlow 的tf.GradientTape()类似于 PyTorch 的loss.backward() 。 Below is the simplistic form in the code of the above statements.以下是上述语句的代码中的简单形式。

# TensorFlow 
[w, b] = tf_model.trainable_variables
for epoch in range(epochs):
  with tf.GradientTape() as tape:
    # forward passing and loss calculations 
    # within explicit tape scope 
    predictions = tf_model(x)
    loss = squared_error(predictions, y)

  # compute gradients (grad)
  w_grad, b_grad = tape.gradient(loss, tf_model.trainable_variables)

  # update training variables 
  w.assign(w - w_grad * learning_rate)
  b.assign(b - b_grad * learning_rate)


# PyTorch 
[w, b] = torch_model.parameters()
for epoch in range(epochs):
  # forward pass and loss calculation 
  # implicit tape-based AD 
  y_pred = torch_model(inputs)
  loss = squared_error(y_pred, labels)

  # compute gradients (grad)
  loss.backward()
  
  # update training variables / parameters  
  with torch.no_grad():
    w -= w.grad * learning_rate
    b -= b.grad * learning_rate
    w.grad.zero_()
    b.grad.zero_()

FYI, in the above, the trainable variables ( w , b ) are manually updated in both frameworks but we generally use an optimizer (eg adam ) to do the job.仅供参考，在上面，可训练变量（ w ， b ）在两个框架中都是手动更新的，但我们通常使用优化器（例如adam ）来完成这项工作。

# TensorFlow 
# ....
# update training variables 
optimizer.apply_gradients(zip([w_grad, b_grad], model.trainable_weights))

# PyTorch
# ....
# update training variables / parameters
optimizer.step()
optimizer.zero_grad()

Answer 2

I suspect this comes from two different uses of the word 'tape' in the context of automatic differentiation.我怀疑这来自于在自动微分的上下文中“磁带”这个词的两种不同用法。

When people say that pytorch is not tape-based, they mean it uses Operator Overloading as opposed to [tape-based] Source Transformation for automatic differentiation.当人们说pytorch不是基于磁带的时，他们的意思是它使用运算符重载而不是 [基于磁带的] 源转换来自动区分。

[ Operator overloading ] relies on a language's ability to redefine the meaning of functions and operators. [运算符重载] 依赖于语言重新定义函数和运算符含义的能力。 All primitives are overloaded so that they additionally perform a tracing operation: The primitive is logged onto a 'tape', along with its inputs to ensure that those intermediate variables are kept alive.所有原语都被重载，以便它们额外执行跟踪操作：原语及其输入被记录到“磁带”上，以确保这些中间变量保持活动状态。 At the end of the function's execution, this tape contains a linear trace of all the numerical operations in the program.在函数执行结束时，该磁带包含程序中所有数值运算的线性轨迹。 Derivatives can be calculated by walking this tape in reverse.可以通过反向走这条带子来计算导数。 [...] [...]
OO is the technique used by PyTorch, Autograd, and Chainer [37]. OO 是 PyTorch、Autograd 和 Chainer [37] 使用的技术。

... ...

Tape-based Frameworks such as ADIFOR [8] and Tapenade [20] for Fortran and C use a global stack also called a 'tape' ² to ensure that intermediate variables are kept alive.基于磁带的框架，例如用于 Fortran 和 C 的 ADIFOR [8] 和 Tapenade [20] 使用全局堆栈，也称为“磁带” ²以确保中间变量保持活动状态。 The original (primal) function is augmented so that it writes intermediate variables to the tape during the forward pass, and the adjoint program will read intermediate variables from the tape during the backward pass.原始（原始）函数被扩充，以便在前向传递期间将中间变量写入磁带，而伴随程序将在向后传递期间从磁带中读取中间变量。 More recently, tape-based ST was implemented for Python in the ML framework Tangent [38].最近，在机器学习框架 Tangent [38] 中为 Python 实现了基于磁带的 ST。

... ...

^{² The tape used in ST stores only the intermediate variables, whereas the tape in OO is a program trace that stores the executed primitives as well.} ^{² ST 中使用的磁带仅存储中间变量，而 OO 中的磁带是程序跟踪，也存储执行的原语。}

Automatic differentiation in ML: Where we are and where we should be going 机器学习中的自动微分：我们在哪里以及我们应该去哪里

Pytorch 中基于磁带的 autograd 是什么？

问题描述

2 个解决方案

解决方案1
5 已采纳 2021-05-18 18:14:23

解决方案2
2 2021-05-14 15:03:34

Pytorch 中基于磁带的 autograd 是什么？

问题描述

2 个解决方案

解决方案1 5 已采纳 2021-05-18 18:14:23

解决方案2 2 2021-05-14 15:03:34

解决方案1
5 已采纳 2021-05-18 18:14:23

解决方案2
2 2021-05-14 15:03:34