在 PyTorch 中计算批次中每个样本的梯度

Question

I'm trying to implement a version of differentially private stochastic gradient descent (eg, this ), which goes as follows:我正在尝试实现一个版本的差异私有随机梯度下降（例如this ），如下所示：

Compute the gradient with respect to each point in the batch of size L, then clip each of the L gradients separately, then average them together, and then finally perform a (noisy) gradient descent step.计算大小为 L 的批次中每个点的梯度，然后分别裁剪 L 个梯度中的每一个，然后将它们平均在一起，最后执行（噪声）梯度下降步骤。

What is the best way to do this in pytorch?在 pytorch 中执行此操作的最佳方法是什么？

Preferably, there would be a way to simulataneously compute the gradients for each point in the batch:最好有一种方法可以同时计算批次中每个点的梯度：

x # inputs with batch size L
y #true labels
y_output = model(x)
loss = loss_func(y_output,y) #vector of length L
loss.backward() #stores L distinct gradients in each param.grad, magically

But failing that, compute each gradient separately and then clip the norm before accumulating, but但是失败了，分别计算每个梯度，然后在累积之前裁剪范数，但是

x # inputs with batch size L
y #true labels
y_output = model(x)
loss = loss_func(y_output,y) #vector of length L   
for i in range(loss.size()[0]):
    loss[i].backward(retain_graph=True)
    torch.nn.utils.clip_grad_norm(model.parameters(), clip_size)

accumulates the ith gradient, and then clips, rather than clipping before accumulating it into the gradient.累加第i个梯度，然后裁剪，而不是先裁剪再累加成梯度。 What's the best way to get around this issue?解决此问题的最佳方法是什么？

Answer 1

I don't think you can do much better than the second method in terms of computational efficiency, you're losing the benefits of batching in your backward and that's a fact.我不认为在计算效率方面你可以比第二种方法做得更好，你正在失去backward批处理的好处，这是事实。 Regarding the order of clipping, autograd stores the gradients in .grad of parameter tensors.关于裁剪的顺序，autograd将梯度存储在参数张量的.grad中。 A crude solution would be to add a dictionary like一个粗略的解决方案是添加一个字典

clipped_grads = {name: torch.zeros_like(param) for name, param in net.named_parameters()}

Run your for loop like像这样运行你的for循环

for i in range(loss.size(0)):
    loss[i].backward(retain_graph=True)
    torch.nn.utils.clip_grad_norm_(net.parameters())
    for name, param in net.named_parameters():
        clipped_grads[name] += param.grad / loss.size(0)
    net.zero_grad()

for name, param in net.named_parameters():
    param.grad = clipped_grads[name]

optimizer.step()

where I omitted much of the detach , requires_grad=False and similar business which may be necessary to make it behave as expected.我省略了很多detach ， requires_grad=False和类似的业务，这些业务可能是使其按预期运行所必需的。

The disadvantage of the above is that you end up storing 2x the memory for your parameter gradients.上面的缺点是你最终为你的参数梯度存储了 2 倍的内存。 In principle you could take the "raw" gradient, clip it, add to clipped_gradient , and then discard as soon as no downstream operations need it, whereas here you retain the raw values in grad until the end of a backward pass.原则上，您可以采用“原始”梯度，对其进行剪辑，添加到clipped_gradient ，然后在下游操作不需要它时立即丢弃，而在这里您将原始值保留在grad中，直到反向传递结束。 It may be that register_backward_hook allows you to do that if you go against the guidelines and actually modify the grad_input , but you would have to verify with someone more intimately acquaintanced with autograd.如果您违反指南并实际修改grad_input ，则register_backward_hook可能允许您这样做，但您必须与更熟悉 autograd 的人进行验证。

Answer 2

This package calculates per-sample gradient in parallel.该包并行计算每个样本的梯度。 The memory needed is still batch_size times of standard stochastic gradient descent, but due to parallelization it can run much faster.所需的内存仍然是标准随机梯度下降的batch_size倍，但由于并行化，它可以运行得更快。

在 PyTorch 中计算批次中每个样本的梯度

问题描述

2 个解决方案

解决方案1
3 2018-12-16 12:26:58

解决方案2
1 2019-12-15 07:37:20

在 PyTorch 中计算批次中每个样本的梯度

问题描述

2 个解决方案

解决方案1 3 2018-12-16 12:26:58

解决方案2 1 2019-12-15 07:37:20

解决方案1
3 2018-12-16 12:26:58

解决方案2
1 2019-12-15 07:37:20