[英]L1/L2 regularization in PyTorch
How do I add L1/L2 regularization in PyTorch without manually computing it?如何在不手动计算的情况下在 PyTorch 中添加 L1/L2 正则化?
使用weight_decay > 0
进行 L2 正则化:
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4, weight_decay=1e-5)
See the documentation .请参阅文档。 Add a
weight_decay
parameter to the optimizer for L2 regularization.为 L2 正则化的优化器添加
weight_decay
参数。
Previous answers, while technically correct, are inefficient performance wise and are not too modular (hard to apply on a per-layer basis, as provided by, say, keras
layers).以前的答案虽然在技术上是正确的,但在性能方面效率低下并且不是太模块化(很难在每层的基础上应用,例如
keras
层提供的)。
Why PyTorch implemented L2
inside torch.optim.Optimizer
instances?为什么 PyTorch 在
torch.optim.Optimizer
实例中实现L2
?
Let's take a look at torch.optim.SGD
source code (currently as functional optimization procedure), especially this part:我们看一下
torch.optim.SGD
源码(目前为功能优化程序),特别是这部分:
for i, param in enumerate(params):
d_p = d_p_list[i]
# L2 weight decay specified HERE!
if weight_decay != 0:
d_p = d_p.add(param, alpha=weight_decay)
d_p
(derivative of parameter, gradient) is modified and re-assigned for faster computation (not saving the temporary variables)d_p
(参数的导数,梯度)被修改并重新分配以加快计算速度(不保存临时变量)O(N)
complexity without any complicated math like pow
O(N)
复杂度,没有任何复杂的数学,例如pow
autograd
extending the graph without any needautograd
扩展图形而无需任何需要Compare that to O(n)
**2
operations, addition and also taking part in backpropagation.将其与
O(n)
**2
操作、加法以及参与反向传播进行比较。
Let's see L2
equation with alpha
regularization factor (same could be done for L1 ofc):让我们看看具有
alpha
正则化因子的L2
方程(L1 ofc 也可以这样做):
If we take derivative of any loss with L2
regularization wrt parameters w
(it is independent of loss), we get:如果我们用
L2
正则化 wrt 参数w
对任何损失求导(它与损失无关),我们得到:
So it is simply an addition of alpha * weight
for gradient of every weight!所以它只是为每个权重的梯度添加了
alpha * weight
weight! And this is exactly what PyTorch does above!这正是 PyTorch 在上面所做的!
Using this (and some PyTorch magic), we can come up with quite generic L1 regularization layer, but let's look at first derivative of L1
first ( sgn
is signum function, returning 1
for positive input and -1
for negative, 0
for 0
):使用这个(和一些 PyTorch 魔法),我们可以提出非常通用的 L1 正则化层,但让我们先看看
L1
的一阶导数( sgn
是符号函数,返回1
表示正输入,返回-1
表示负, 0
表示0
) :
Full code with WeightDecay
interface located in torchlayers third party library providing stuff like regularizing only weights/biases/specifically named paramters ( disclaimer: I'm the author ), but the essence of the idea outlined below (see comments):带有
WeightDecay
接口的完整代码位于torchlayers 第三方库中,提供了诸如仅对权重/偏差/特定命名的参数进行正则化等内容(免责声明:我是作者),但下面概述的想法的本质(见评论):
class L1(torch.nn.Module):
def __init__(self, module, weight_decay):
super().__init__()
self.module = module
self.weight_decay = weight_decay
# Backward hook is registered on the specified module
self.hook = self.module.register_full_backward_hook(self._weight_decay_hook)
# Not dependent on backprop incoming values, placeholder
def _weight_decay_hook(self, *_):
for param in self.module.parameters():
# If there is no gradient or it was zeroed out
# Zeroed out using optimizer.zero_grad() usually
# Turn on if needed with grad accumulation/more safer way
# if param.grad is None or torch.all(param.grad == 0.0):
# Apply regularization on it
param.grad = self.regularize(param)
def regularize(self, parameter):
# L1 regularization formula
return self.weight_decay * torch.sign(parameter.data)
def forward(self, *args, **kwargs):
# Simply forward and args and kwargs to module
return self.module(*args, **kwargs)
Read more about hooks in this answer or respective PyTorch docs if needed.如果需要,请在此答案或相应的 PyTorch 文档中阅读有关钩子的更多信息。
And usage is also pretty simple (should work with gradient accumulation and and PyTorch layers):而且用法也很简单(应该使用梯度累积和 PyTorch 层):
layer = L1(torch.nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3))
Also, as a side note, L1
regularization is not implemented as it does not actually induce sparsity (lost citation, it was some GitHub issue on PyTorch repo I think, if anyone has it, please edit) as understood by weights being equal to zero.此外,作为旁注,
L1
正则化没有实现,因为它实际上并没有引起稀疏性(丢失引用,我认为这是 PyTorch 存储库上的一些 GitHub 问题,如果有人有它,请编辑),正如权重等于零所理解的那样.
More often, weight values are thresholded (simply assigning zero value to them) if they reach some small predefined magnitude (say 0.001
)更常见的是,如果权重值达到某个小的预定义量级(例如
0.001
),则会对其进行阈值处理(只需为其分配零值)
For L2 regularization,对于 L2 正则化,
l2_lambda = 0.01
l2_reg = torch.tensor(0.)
for param in model.parameters():
l2_reg += torch.norm(param)
loss += l2_lambda * l2_reg
References:参考:
Yes, pytorch optimizers have a parameter called weight_decay
which corresponds to the L2 regularization factor:是的,pytorch优化器有一个名为
weight_decay
的参数,它对应于 L2 正则化因子:
sgd = torch.optim.SGD(model.parameters(), weight_decay=weight_decay)
There is no analogous argument for L1, however this is straightforward to implement manually: L1 没有类似的论点,但是这很容易手动实现:
loss = loss_fn(outputs, labels)
l1_lambda = 0.001
l1_norm = sum(torch.linalg.norm(p, 1) for p in model.parameters())
loss = loss + l1_lambda * l1_norm
The equivalent manual implementation of L2 would be: L2 的等效手动实现将是:
l2_norm = sum(torch.linalg.norm(p, 2) for p in model.parameters())
Source: Deep Learning with PyTorch (8.5.2)资料来源: 使用 PyTorch 进行深度学习(8.5.2)
for L1 regularization and inclulde weight
only:对于 L1 正则化和仅包括
weight
:
L1_reg = torch.tensor(0., requires_grad=True)
for name, param in model.named_parameters():
if 'weight' in name:
L1_reg = L1_reg + torch.norm(param, 1)
total_loss = total_loss + 10e-4 * L1_reg
Interesting torch.norm
is slower on CPU and faster on GPU vs. direct approach.有趣的
torch.norm
在 CPU 上较慢,在 GPU 上较直接方法更快。
import torch
x = torch.randn(1024,100)
y = torch.randn(1024,100)
%timeit torch.sqrt((x - y).pow(2).sum(1))
%timeit torch.norm(x - y, 2, 1)
Out:出去:
1000 loops, best of 3: 910 µs per loop
1000 loops, best of 3: 1.76 ms per loop
On the other hand:另一方面:
import torch
x = torch.randn(1024,100).cuda()
y = torch.randn(1024,100).cuda()
%timeit torch.sqrt((x - y).pow(2).sum(1))
%timeit torch.norm(x - y, 2, 1)
Out:出去:
10000 loops, best of 3: 50 µs per loop
10000 loops, best of 3: 26 µs per loop
To extend on the good answers: As it was said, L2 norm added to the loss is equivalent to weight decay iff you use plain SGD without momentum.扩展好的答案:正如所说,如果您使用没有动量的普通 SGD,则添加到损失中的 L2 范数相当于权重衰减。 Otherwise, eg with Adam, it is not exactly the same.
否则,例如对于 Adam,它并不完全相同。 The AdamW paper [1] pointed out that weight decay is actually more stable.
AdamW 论文 [1] 指出权重衰减实际上更稳定。 That is why you should use weight decay, which is an option to the optimizer.
这就是为什么你应该使用权重衰减,这是优化器的一个选项。 And consider using
AdamW
instead of Adam
.并考虑使用
AdamW
而不是Adam
。
Also note, you probably don't want weight decay on all parameters ( model.parameters()
), but only on a subset.另请注意,您可能不希望所有参数 (
model.parameters()
) 的权重衰减,而只是在一个子集上。 See here for examples:有关示例,请参见此处:
[1] Decoupled Weight Decay Regularization (AdamW), 2017 [1]解耦权重衰减正则化 (AdamW),2017
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.