[英]computing gradients for every individual sample in a batch in PyTorch
I'm trying to implement a version of differentially private stochastic gradient descent (eg, this ), which goes as follows:我正在尝试实现一个版本的差异私有随机梯度下降(例如this ),如下所示:
Compute the gradient with respect to each point in the batch of size L, then clip each of the L gradients separately, then average them together, and then finally perform a (noisy) gradient descent step.计算大小为 L 的批次中每个点的梯度,然后分别裁剪 L 个梯度中的每一个,然后将它们平均在一起,最后执行(噪声)梯度下降步骤。
What is the best way to do this in pytorch?在 pytorch 中执行此操作的最佳方法是什么?
Preferably, there would be a way to simulataneously compute the gradients for each point in the batch:最好有一种方法可以同时计算批次中每个点的梯度:
x # inputs with batch size L
y #true labels
y_output = model(x)
loss = loss_func(y_output,y) #vector of length L
loss.backward() #stores L distinct gradients in each param.grad, magically
But failing that, compute each gradient separately and then clip the norm before accumulating, but但是失败了,分别计算每个梯度,然后在累积之前裁剪范数,但是
x # inputs with batch size L
y #true labels
y_output = model(x)
loss = loss_func(y_output,y) #vector of length L
for i in range(loss.size()[0]):
loss[i].backward(retain_graph=True)
torch.nn.utils.clip_grad_norm(model.parameters(), clip_size)
accumulates the ith gradient, and then clips, rather than clipping before accumulating it into the gradient.累加第i个梯度,然后裁剪,而不是先裁剪再累加成梯度。 What's the best way to get around this issue?
解决此问题的最佳方法是什么?
I don't think you can do much better than the second method in terms of computational efficiency, you're losing the benefits of batching in your backward
and that's a fact.我不认为在计算效率方面你可以比第二种方法做得更好,你正在失去
backward
批处理的好处,这是事实。 Regarding the order of clipping, autograd stores the gradients in .grad
of parameter tensors.关于裁剪的顺序,autograd将梯度存储在参数张量的
.grad
中。 A crude solution would be to add a dictionary like一个粗略的解决方案是添加一个字典
clipped_grads = {name: torch.zeros_like(param) for name, param in net.named_parameters()}
Run your for loop like像这样运行你的for循环
for i in range(loss.size(0)):
loss[i].backward(retain_graph=True)
torch.nn.utils.clip_grad_norm_(net.parameters())
for name, param in net.named_parameters():
clipped_grads[name] += param.grad / loss.size(0)
net.zero_grad()
for name, param in net.named_parameters():
param.grad = clipped_grads[name]
optimizer.step()
where I omitted much of the detach
, requires_grad=False
and similar business which may be necessary to make it behave as expected.我省略了很多
detach
, requires_grad=False
和类似的业务,这些业务可能是使其按预期运行所必需的。
The disadvantage of the above is that you end up storing 2x the memory for your parameter gradients.上面的缺点是你最终为你的参数梯度存储了 2 倍的内存。 In principle you could take the "raw" gradient, clip it, add to
clipped_gradient
, and then discard as soon as no downstream operations need it, whereas here you retain the raw values in grad
until the end of a backward pass.原则上,您可以采用“原始”梯度,对其进行剪辑,添加到
clipped_gradient
,然后在下游操作不需要它时立即丢弃,而在这里您将原始值保留在grad
中,直到反向传递结束。 It may be that register_backward_hook allows you to do that if you go against the guidelines and actually modify the grad_input
, but you would have to verify with someone more intimately acquaintanced with autograd.如果您违反指南并实际修改
grad_input
,则register_backward_hook可能允许您这样做,但您必须与更熟悉 autograd 的人进行验证。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.