对于变分自动编码器，重建损失应该计算为图像的总和还是平均值？

Question

I am following this variational autoencoder tutorial: https://keras.io/examples/generative/vae/ .我正在关注这个变分自动编码器教程： https://keras.io/examples/generative/vae/ 。

I know VAE's loss function consists of the reconstruction loss that compares the original image and reconstruction, as well as the KL loss.我知道 VAE 的损失 function 包括比较原始图像和重建的重建损失，以及 KL 损失。 However, I'm a bit confused about the reconstruction loss and whether it is over the entire image (sum of squared differences) or per pixel (average sum of squared differences).但是，我对重建损失以及它是在整个图像（平方差和）还是每个像素（平均平方差和）上感到有点困惑。 My understanding is that the reconstruction loss should be per pixel (MSE), but the example code I am following multiplies MSE by 28 x 28 , the MNIST image dimensions.我的理解是重建损失应该是每像素 (MSE)，但我遵循的示例代码将 MSE 乘以28 x 28 ，即 MNIST 图像尺寸。 Is that correct?那是对的吗？ Furthermore, my assumption is this would make the reconstruction loss term significantly larger than the KL loss and I'm not sure we want that.此外，我的假设是这会使重建损失项显着大于 KL 损失，我不确定我们是否想要这样。

I tried removing the multiplication by (28x28), but this resulted in extremely poor reconstructions.我尝试删除乘以 (28x28) 的乘法，但这导致重建效果极差。 Essentially all the reconstructions looked the same regardless of the input.无论输入如何，基本上所有的重建看起来都是一样的。 Can I use a lambda parameter to capture the tradeoff between kl divergence and reconstruction, or it that incorrect because the loss has a precise derivation (as opposed to just adding a regularization penalty).我可以使用 lambda 参数来捕获 kl 散度和重建之间的权衡，还是不正确，因为损失具有精确的推导（而不是仅添加正则化惩罚）。

reconstruction_loss = tf.reduce_mean(
    keras.losses.binary_crossentropy(data, reconstruction)
)
reconstruction_loss *= 28 * 28
kl_loss = 1 + z_log_var - tf.square(z_mean) - tf.exp(z_log_var)
kl_loss = tf.reduce_mean(kl_loss)
kl_loss *= -0.5
total_loss = reconstruction_loss + kl_loss

Answer 1

The example这个例子

I'm familiar with that example, and I think the 28x28 multiplier is justified because of the operation tf.reduce_mean(kl_loss) which takes the average loss of all the pixels in the image which would result in a number between 0 and 1 and then multiplies it by the number of pixels.我熟悉那个例子，我认为28x28乘数是合理的，因为操作tf.reduce_mean(kl_loss)计算图像中所有像素的平均损失，这将导致一个介于 0 和 1 之间的数字，然后将其乘以像素数。 Here'sanother take with an external training loop for creating a VAE.这是另一个用于创建 VAE 的外部训练循环。

The problem is posterior collapse问题是后塌陷

The above would not be an issue since it's just multiplication by a constant if not for as you point out the KL divergence term.以上不会是一个问题，因为它只是乘以一个常数，如果不是因为你指出了KL 散度项。 The KL loss acts as a regularizer that penalizes latent variable probability distributions that when sampled using a combination of Gaussians are different than the samples created by the encoder. KL 损失作为一个正则化器来惩罚潜在变量概率分布，当使用高斯组合进行采样时，该概率分布与编码器创建的样本不同。 Naturally, the question arises, how much should be reconstruction loss and how much should be the penalty.自然而然，问题来了，reconstruction loss应该多少，penalty应该多少。 This is an area of research.这是一个研究领域。 Consider β-VAE which purportedly serves to disentangle representations by increasing the importance of KL-loss, on the other hand, increase β too much and you get a phenomenon known as posterior collapse Re-balancing Variational Autoencoder Loss for Molecule Sequence Generation limits β to 0.1 to avoid the problem.考虑β-VAE ，据称它通过增加 KL-loss 的重要性来解开表征，另一方面，增加β太多，你会得到一种称为后崩溃的现象Re-balancing Variational Autoencoder Loss for Molecule Sequence Generation limits β to 0.1 来避免这个问题。 But it may not even be that simple as explained in The Usual Suspects?但它甚至可能不像《寻常嫌疑人》中解释的那么简单？ Reassessing Blame for VAE Posterior Collapse .重新评估 VAE 后部塌陷的责任。 A thorough solution is proposed in Diagnosing and Enhancing VAE Models . Diagnosing and Enhancing VAE Models中提出了一个彻底的解决方案。 While Balancing reconstruction error and Kullback-Leibler divergence in Variational Autoencoders suggest that there is a more simple deterministic (and better) way.虽然在变分自动编码器中平衡重构误差和 Kullback-Leibler 散度表明存在更简单的确定性（和更好）方法。

Experimentation and Extension实验和扩展

For something simple like Minst, and that example, in particular, try experimenting.对于像 Minst 这样简单的东西，尤其是那个例子，尝试试验。 Keep the 28x28 term, and arbitrarily multiply kl_loss by a constant B where 0 <= B < 28*28.保留 28x28 项，并将 kl_loss 任意乘以常数B ，其中 0 <= B < 28*28。 Follow the kl loss term and the reconstruction loss term during training and compare it to the first reference graphs.在训练过程中遵循 kl 损失项和重建损失项，并将其与第一个参考图进行比较。

Answer 2

It isn't really necessary to multiply by the number of pixels.确实没有必要乘以像素数。 However, whether you do so or not will affect the way your fitting algorithm behaves with respect to the other hyper parameters: your lambda parameter and the learning rate.但是，您是否这样做会影响您的拟合算法相对于其他超参数的行为方式：您的 lambda 参数和学习率。 In essence, if you want to remove the multiplication by 28 x 28 but retain the same fitting behavior, you should divide lambda by 28 x 28 and then multiply your learning rate by 28 x 28. I think you were already approaching this idea in your question, and the piece you were missing is the adjustment to the learning rate.本质上，如果你想去除 28 x 28 的乘法但保留相同的拟合行为，你应该将 lambda 除以 28 x 28，然后将你的学习率乘以 28 x 28。我认为你已经在你的问题，你缺少的部分是对学习率的调整。

对于变分自动编码器，重建损失应该计算为图像的总和还是平均值？

问题描述

2 个解决方案

解决方案1
3 2020-12-30 20:55:28

The example这个例子

The problem is posterior collapse问题是后塌陷

Experimentation and Extension实验和扩展

解决方案2
0 2020-09-01 01:43:24

对于变分自动编码器，重建损失应该计算为图像的总和还是平均值？

问题描述

2 个解决方案

解决方案1 3 2020-12-30 20:55:28

The example这个例子

The problem is posterior collapse问题是后塌陷

Experimentation and Extension实验和扩展

解决方案2 0 2020-09-01 01:43:24

解决方案1
3 2020-12-30 20:55:28

解决方案2
0 2020-09-01 01:43:24