简体   繁体   English

在 PyTorch 中计算批次中每个样本的梯度

[英]computing gradients for every individual sample in a batch in PyTorch

I'm trying to implement a version of differentially private stochastic gradient descent (eg, this ), which goes as follows:我正在尝试实现一个版本的差异私有随机梯度下降(例如this ),如下所示:

Compute the gradient with respect to each point in the batch of size L, then clip each of the L gradients separately, then average them together, and then finally perform a (noisy) gradient descent step.计算大小为 L 的批次中每个点的梯度,然后分别裁剪 L 个梯度中的每一个,然后将它们平均在一起,最后执行(噪声)梯度下降步骤。

What is the best way to do this in pytorch?在 pytorch 中执行此操作的最佳方法是什么?

Preferably, there would be a way to simulataneously compute the gradients for each point in the batch:最好有一种方法可以同时计算批次中每个点的梯度:

x # inputs with batch size L
y #true labels
y_output = model(x)
loss = loss_func(y_output,y) #vector of length L
loss.backward() #stores L distinct gradients in each param.grad, magically

But failing that, compute each gradient separately and then clip the norm before accumulating, but但是失败了,分别计算每个梯度,然后在累积之前裁剪范数,但是

x # inputs with batch size L
y #true labels
y_output = model(x)
loss = loss_func(y_output,y) #vector of length L   
for i in range(loss.size()[0]):
    loss[i].backward(retain_graph=True)
    torch.nn.utils.clip_grad_norm(model.parameters(), clip_size)

accumulates the ith gradient, and then clips, rather than clipping before accumulating it into the gradient.累加第i个梯度,然后裁剪,而不是先裁剪再累加成梯度。 What's the best way to get around this issue?解决此问题的最佳方法是什么?

I don't think you can do much better than the second method in terms of computational efficiency, you're losing the benefits of batching in your backward and that's a fact.我不认为在计算效率方面你可以比第二种方法做得更好,你正在失去backward批处理的好处,这是事实。 Regarding the order of clipping, autograd stores the gradients in .grad of parameter tensors.关于裁剪的顺序,autograd将梯度存储在参数张量的.grad中。 A crude solution would be to add a dictionary like一个粗略的解决方案是添加一个字典

clipped_grads = {name: torch.zeros_like(param) for name, param in net.named_parameters()}

Run your for loop like像这样运行你的for循环

for i in range(loss.size(0)):
    loss[i].backward(retain_graph=True)
    torch.nn.utils.clip_grad_norm_(net.parameters())
    for name, param in net.named_parameters():
        clipped_grads[name] += param.grad / loss.size(0)
    net.zero_grad()

for name, param in net.named_parameters():
    param.grad = clipped_grads[name]

optimizer.step()

where I omitted much of the detach , requires_grad=False and similar business which may be necessary to make it behave as expected.我省略了很多detachrequires_grad=False和类似的业务,这些业务可能是使其按预期运行所必需的。

The disadvantage of the above is that you end up storing 2x the memory for your parameter gradients.上面的缺点是你最终为你的参数梯度存储了 2 倍的内存。 In principle you could take the "raw" gradient, clip it, add to clipped_gradient , and then discard as soon as no downstream operations need it, whereas here you retain the raw values in grad until the end of a backward pass.原则上,您可以采用“原始”梯度,对其进行剪辑,添加到clipped_gradient ,然后在下游操作不需要它时立即丢弃,而在这里您将原始值保留在grad中,直到反向传递结束。 It may be that register_backward_hook allows you to do that if you go against the guidelines and actually modify the grad_input , but you would have to verify with someone more intimately acquaintanced with autograd.如果您违反指南并实际修改grad_input ,则register_backward_hook可能允许您这样做,但您必须与更熟悉 autograd 的人进行验证。

This package calculates per-sample gradient in parallel.包并行计算每个样本的梯度。 The memory needed is still batch_size times of standard stochastic gradient descent, but due to parallelization it can run much faster.所需的内存仍然是标准随机梯度下降的batch_size倍,但由于并行化,它可以运行得更快。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM