简体   繁体   English

PyTorch 中的权重标准化

[英]Weight Normalization in PyTorch

An important weight normalization technique was introduced in this paper and has been included in PyTorch since long as follows:本文介绍了一种重要的权重归一化技术,并已包含在 PyTorch 中,如下所示:

    from torch.nn.utils import weight_norm
    weight_norm(nn.Conv2d(in_channles, out_channels))

From the docs I get to know, weight_norm does re-parametrization before each forward() pass.从我了解到的文档中, weight_norm在每次forward()传递之前都会重新参数化。 But I am not sure if this re-parameterization is also happening during the inference when everything is running inside with torch.no_grad() and the model is set to eval() mode.但是我不确定当一切都在内部with torch.no_grad()运行并且 model 设置为eval()模式时,这种重新参数化是否也在推理过程中发生。

Can someone please help me know if weight_norm is active only during training or during the inference mode as described above?有人可以帮我知道weight_norm是否仅在训练期间或如上所述的推理模式期间有效?

Thank you谢谢

I tested the "no_gard", it works!我测试了“no_gard”,它有效!

For the "remove_weight_norm", I am still confused.对于“remove_weight_norm”,我仍然感到困惑。 I use WeightNorm(conv1d) a lot in my model.我在 model 中经常使用 WeightNorm(conv1d)。 To export the model, I use the following code, with or without "remove_weight_norm" funciton which call the function "nn.utils.remove_weight_norm" to all related.要导出 model,我使用以下代码,有或没有“remove_weight_norm”功能,它调用 function“nn.utils.remove_weight_norm”到所有相关的。

model.load_state_dict(checkpoint)
model = model.eval()
model.remove_weight_norm(); //with and without this code
remove_hooks(model)
scripted_module = torch.jit.script(model)
torch.jit.save(scripted_module, 'model.pt')

Then I tested two models using C++ code with libtorch.然后我使用 C++ 代码和 libtorch 测试了两个模型。 But the results are not the same.但结果并不相同。

I am wondering what does weight_norm do in inference?我想知道 weight_norm 在推理中做了什么? Is it usefull?有用吗?

I have finally figured out the problem.我终于弄清楚了问题所在。

Batch normalization learns two parameters during training and uses them for inference.批量归一化在训练期间学习两个参数并将它们用于推理。 Thus it is necessary to change its behaviour using eval() to tell not to modify them any further.因此,有必要使用eval()来更改其行为,以告知不要进一步修改它们。

I then scrutinizingly checked the weight normalization paper and found it to be 'inherently deterministic'.然后,我仔细检查了权重归一化论文,发现它“本质上是确定性的”。 It simply decouples the original weight vectors as product of two quantities as shown below.它只是将原始权重向量解耦为两个量的乘积,如下所示。

w = g . v

Obviously either you use LHS for computing output or RHS it does not matter.显然,您使用 LHS 来计算 output 或 RHS 都没有关系。 However by decoupling it into two vectors and passing them to optimizer and deleting the w parameter better training is achieved.然而,通过将其解耦为两个向量并将它们传递给优化器并删除w参数,可以实现更好的训练。 For reasons refer the paper where things are nicely described.出于原因,请参阅对事物进行了很好描述的论文。

Thus it does not matter if weight normalization is removed or not during testing.因此,在测试期间是否删除了权重归一化并不重要。 To validate this I tried the following small code.为了验证这一点,我尝试了以下小代码。

import torch
import torch.nn as nn
from torch.nn.utils import weight_norm as wn
from torch.nn.utils import remove_weight_norm as wnr

# define the model 'm'
m = wn(nn.Conv2d(in_channels=1, out_channels=1, kernel_size=3, padding=1, bias=True))

ip = torch.rand(1,1,5,5)
target = torch.rand(1,1,5,5)
l1 = torch.nn.L1Loss()
optimizer = torch.optim.Adam(m.parameters())



# begin training
for _ in range(5):
    out = m(ip)
    loss = l1(out,target)
    loss.backward()
    optimizer.step()

with torch.no_grad():
    m.eval()
    print('\no/p after training with wn: {}'.format(m(ip)))
    wnr(m)
    print('\no/p after training without wn: {}'.format(m(ip)))

# begin testing
m2 = nn.Conv2d(in_channels=1, out_channels=1, kernel_size=3,padding=1, bias=True)
m2.load_state_dict(m.state_dict())

with torch.no_grad():
    m2.eval()
    out = m2(ip)
    print('\nOutput during testing and without weight_norm: {}'.format(out))

And the output is below,下面是output,

o/p after training with wn: 
tensor([[[[0.0509, 0.3286, 0.4612, 0.1795, 0.0307],
          [0.1846, 0.3931, 0.5713, 0.2909, 0.4026],
          [0.1716, 0.5971, 0.4297, 0.0845, 0.6172],
          [0.2938, 0.2389, 0.4478, 0.5828, 0.6276],
          [0.1423, 0.2065, 0.5024, 0.3979, 0.3127]]]])

o/p after training without wn: 
tensor([[[[0.0509, 0.3286, 0.4612, 0.1795, 0.0307],
          [0.1846, 0.3931, 0.5713, 0.2909, 0.4026],
          [0.1716, 0.5971, 0.4297, 0.0845, 0.6172],
          [0.2938, 0.2389, 0.4478, 0.5828, 0.6276],
          [0.1423, 0.2065, 0.5024, 0.3979, 0.3127]]]])

Output during testing and without weight_norm: 
tensor([[[[0.0509, 0.3286, 0.4612, 0.1795, 0.0307],
          [0.1846, 0.3931, 0.5713, 0.2909, 0.4026],
          [0.1716, 0.5971, 0.4297, 0.0845, 0.6172],
          [0.2938, 0.2389, 0.4478, 0.5828, 0.6276],
          [0.1423, 0.2065, 0.5024, 0.3979, 0.3127]]]])

Please see that all the values are exactly same as only reparameterization is happening.请注意,所有值都与仅发生重新参数化完全相同。

Regarding,关于,

Then I tested two models using C++ code with libtorch.然后我使用 C++ 代码和 libtorch 测试了两个模型。 But the results are not the same.但结果并不相同。

See https://github.com/pytorch/pytorch/issues/21275 which reports a bug with TorchScript.请参阅报告 TorchScript 错误的https://github.com/pytorch/pytorch/issues/21275

And regarding,而关于,

I am wondering what does weight_norm do in inference?我想知道 weight_norm 在推理中做了什么? Is it usefull?有用吗?

The answer is it does nothing.答案是它什么都不做。 you do x * 2 or x * (1+1) does not matter.你做x * 2x * (1+1)没关系。 It is not useful but not harmful either.它没有用,但也无害。 So better remove it.所以最好去掉。

It should be active.它应该是活跃的。 .eval() effects your network layers (eg Dropout and BatchNorm layer). .eval() 影响您的网络层(例如 Dropout 和 BatchNorm 层)。 eval documentation 评估文档

.no_grad() reduces memory and speeds up computation during inference. .no_grad() 减少 memory 并在推理过程中加快计算速度。 no_grad documentation I Think weight_norm isn't effected by any of this. no_grad 文档我认为 weight_norm 不受任何影响。

Greetings问候

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM