[英]Reparametrization in tensorflow-probability: tf.GradientTape() doesn't calculate the gradient with respect to a distribution's mean
In tensorflow
version 2.0.0-beta1
, I am trying to implement a keras
layer which has weights sampled from a normal random distribution. 在
tensorflow
版本2.0.0-beta1
,我试图实现一个keras
层,它具有从正态随机分布中采样的权重。 I would like to have the mean of the distribution as trainable parameter. 我希望将分布的均值作为可训练参数。
Thanks to the "reparametrization trick" already implemented in tensorflow-probability
, the calculation of the gradient with respect to the mean of the distribution should be possible in principle, if I am not mistaken. 由于已经在
tensorflow-probability
实现的“再参数化技巧”,如果我没有弄错的话,原则上应该可以计算相对于分布均值的梯度。
However, when I try to calculate the gradient of the network output with respect to the mean value variable using tf.GradientTape()
, the returned gradient is None
. 但是,当我尝试使用
tf.GradientTape()
计算网络输出相对于平均值变量的梯度时,返回的渐变为None
。
I created two minimal examples, one of a layer with deterministic weights and one of a layer with random weights. 我创建了两个最小的例子,一个是具有确定性权重的层,另一个是具有随机权重的层。 The gradients of the deterministic layer's gradients are calculated as expected, but the gradients are
None
in case of the random layer. 确定性层的梯度的梯度按预期计算,但在随机层的情况下梯度为
None
。 There is no error message giving details on why the gradient is None
, and I am kind of stuck. 没有错误消息提供有关渐变为
None
原因的详细信息,而且我有点卡住了。
Minimal example code: 最小的示例代码:
A: Here is the minimal example for the deterministic network: 答:这是确定性网络的最小示例:
import tensorflow as tf; print(tf.__version__)
from tensorflow.keras import backend as K
from tensorflow.keras.layers import Layer,Input
from tensorflow.keras.models import Model
from tensorflow.keras.initializers import RandomNormal
import tensorflow_probability as tfp
import numpy as np
# example data
x_data = np.random.rand(99,3).astype(np.float32)
# # A: DETERMINISTIC MODEL
# 1 Define Layer
class deterministic_test_layer(Layer):
def __init__(self, output_dim, **kwargs):
self.output_dim = output_dim
super(deterministic_test_layer, self).__init__(**kwargs)
def build(self, input_shape):
self.kernel = self.add_weight(name='kernel',
shape=(input_shape[1], self.output_dim),
initializer='uniform',
trainable=True)
super(deterministic_test_layer, self).build(input_shape)
def call(self, x):
return K.dot(x, self.kernel)
def compute_output_shape(self, input_shape):
return (input_shape[0], self.output_dim)
# 2 Create model and calculate gradient
x = Input(shape=(3,))
fx = deterministic_test_layer(1)(x)
deterministic_test_model = Model(name='test_deterministic',inputs=[x], outputs=[fx])
print('\n\n\nCalculating gradients for deterministic model: ')
for x_now in np.split(x_data,3):
# print(x_now.shape)
with tf.GradientTape() as tape:
fx_now = deterministic_test_model(x_now)
grads = tape.gradient(
fx_now,
deterministic_test_model.trainable_variables,
)
print('\n',grads,'\n')
print(deterministic_test_model.summary())
B: The following example is very similar, but instead of deterministic weights I tried to use randomly sampled weights (randomly sampled at call()
time!) for the test layer: B:以下示例非常相似,但我尝试使用随机抽样的权重(在
call()
时间随机抽样!)而不是确定性权重,用于测试层:
# # B: RANDOM MODEL
# 1 Define Layer
class random_test_layer(Layer):
def __init__(self, output_dim, **kwargs):
self.output_dim = output_dim
super(random_test_layer, self).__init__(**kwargs)
def build(self, input_shape):
self.mean_W = self.add_weight('mean_W',
initializer=RandomNormal(mean=0.5,stddev=0.1),
trainable=True)
self.kernel_dist = tfp.distributions.MultivariateNormalDiag(loc=self.mean_W,scale_diag=(1.,))
super(random_test_layer, self).build(input_shape)
def call(self, x):
sampled_kernel = self.kernel_dist.sample(sample_shape=x.shape[1])
return K.dot(x, sampled_kernel)
def compute_output_shape(self, input_shape):
return (input_shape[0], self.output_dim)
# 2 Create model and calculate gradient
x = Input(shape=(3,))
fx = random_test_layer(1)(x)
random_test_model = Model(name='test_random',inputs=[x], outputs=[fx])
print('\n\n\nCalculating gradients for random model: ')
for x_now in np.split(x_data,3):
# print(x_now.shape)
with tf.GradientTape() as tape:
fx_now = random_test_model(x_now)
grads = tape.gradient(
fx_now,
random_test_model.trainable_variables,
)
print('\n',grads,'\n')
print(random_test_model.summary())
Expected/Actual Output: 预期/实际产出:
A: The deterministic network works as expected, and the gradients are calculated. 答:确定性网络按预期工作,并计算梯度。 The output is:
输出是:
2.0.0-beta1
Calculating gradients for deterministic model:
[<tf.Tensor: id=26, shape=(3, 1), dtype=float32, numpy=
array([[17.79845 ],
[15.764006 ],
[14.4183035]], dtype=float32)>]
[<tf.Tensor: id=34, shape=(3, 1), dtype=float32, numpy=
array([[16.22232 ],
[17.09122 ],
[16.195663]], dtype=float32)>]
[<tf.Tensor: id=42, shape=(3, 1), dtype=float32, numpy=
array([[16.382954],
[16.074356],
[17.718027]], dtype=float32)>]
Model: "test_deterministic"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 3)] 0
_________________________________________________________________
deterministic_test_layer (de (None, 1) 3
=================================================================
Total params: 3
Trainable params: 3
Non-trainable params: 0
_________________________________________________________________
None
B: However, in case of the similar random network, the gradients are not calculated as expected (using the reparametsization trick). B:然而,在类似的随机网络的情况下,不按预期计算梯度(使用重新组织化技巧)。 Instead, they are
None
. 相反,它们是
None
。 The full output is 完整的输出是
Calculating gradients for random model:
[None]
[None]
[None]
Model: "test_random"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_2 (InputLayer) [(None, 3)] 0
_________________________________________________________________
random_test_layer (random_te (None, 1) 1
=================================================================
Total params: 1
Trainable params: 1
Non-trainable params: 0
_________________________________________________________________
None
Can anybody point me at the problem here? 任何人都可以在这里指出我的问题吗?
It seems that tfp.distributions.MultivariateNormalDiag is not differentiable with respect to its input parameters (eg loc
). 看来tfp.distributions.MultivariateNormalDiag在输入参数(例如
loc
)方面是不可微分的。 In this particular case, the following would be equivalent: 在这种特殊情况下,以下内容是等效的:
class random_test_layer(Layer):
...
def build(self, input_shape):
...
self.kernel_dist = tfp.distributions.MultivariateNormalDiag(loc=0, scale_diag=(1.,))
super(random_test_layer, self).build(input_shape)
def call(self, x):
sampled_kernel = self.kernel_dist.sample(sample_shape=x.shape[1]) + self.mean_W
return K.dot(x, sampled_kernel)
In this case, however, the loss is differentiable with respect to self.mean_W
. 然而,在这种情况下,损失相对于
self.mean_W
是可self.mean_W
。
Be careful: Although this approach might work for your purposes, note that calling the density function self.kernel_dist.prob
would yield different results, since we took loc
outside. 注意:虽然这种方法可能适用于您的目的,但请注意调用密度函数
self.kernel_dist.prob
会产生不同的结果,因为我们将loc
放在外面。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.