如何在 TensorFlow 中选择交叉熵损失？

Question

Classification problems, such as logistic regression or multinomial logistic regression, optimize a cross-entropy loss.分类问题，例如逻辑回归或多项逻辑回归，可以优化交叉熵损失。 Normally, the cross-entropy layer follows the softmax layer, which produces probability distribution.通常情况下，交叉熵层跟在softmax层之后，产生概率分布。

In tensorflow, there are at least a dozen of different cross-entropy loss functions :在 tensorflow 中，至少有十几种不同的交叉熵损失函数：

tf.losses.softmax_cross_entropy
tf.losses.sparse_softmax_cross_entropy
tf.losses.sigmoid_cross_entropy
tf.contrib.losses.softmax_cross_entropy
tf.contrib.losses.sigmoid_cross_entropy
tf.nn.softmax_cross_entropy_with_logits
tf.nn.sigmoid_cross_entropy_with_logits
... ...

Which one works only for binary classification and which are suitable for multi-class problems?哪一个只适用于二元分类，哪一个适用于多类问题？ When should you use sigmoid instead of softmax ?什么时候应该使用sigmoid而不是softmax ？ How are sparse functions different from others and why is it only softmax ? sparse函数与其他函数有何不同，为什么只有softmax ？

Related (more math-oriented) discussion: What are the differences between all these cross-entropy losses in Keras and TensorFlow?相关（更面向数学）讨论： Keras 和 TensorFlow 中所有这些交叉熵损失之间有什么区别？ . .

Answer 1

Preliminary facts初步事实

In functional sense, the sigmoid is a partial case of the softmax function , when the number of classes equals 2. Both of them do the same operation: transform the logits (see below) to probabilities.在函数意义上， sigmoid 是 softmax 函数的部分情况，当类数等于 2 时。它们都执行相同的操作：将 logits（见下文）转换为概率。
In simple binary classification, there's no big difference between the two, however in case of multinomial classification, sigmoid allows to deal with non-exclusive labels (aka multi-labels ), while softmax deals with exclusive classes (see below).在简单的二元分类中，两者之间没有太大区别，但是在多项分类的情况下，sigmoid 允许处理非排他标签（又名multi-labels ），而 softmax 处理排他类（见下文）。
A logit (also called a score) is a raw unscaled value associated with a class , before computing the probability. logit （也称为分数）是在计算概率之前与 class 关联的原始未缩放值。 In terms of neural network architecture, this means that a logit is an output of a dense (fully-connected) layer.就神经网络架构而言，这意味着 logit 是密集（全连接）层的输出。
Tensorflow naming is a bit strange: all of the functions below accept logits, not probabilities , and apply the transformation themselves (which is simply more efficient). Tensorflow 命名有点奇怪：下面的所有函数都接受 logits，而不是 probabilities ，并自己应用转换（这只是更有效）。

Sigmoid functions family Sigmoid 函数族

tf.nn.sigmoid_cross_entropy_with_logits
tf.nn.weighted_cross_entropy_with_logits
tf.losses.sigmoid_cross_entropy
tf.contrib.losses.sigmoid_cross_entropy (DEPRECATED) tf.contrib.losses.sigmoid_cross_entropy （已弃用）

As stated earlier, sigmoid loss function is for binary classification.如前所述， sigmoid损失函数用于二元分类。 But tensorflow functions are more general and allow to do multi-label classification, when the classes are independent.但是当类别独立时，张量流函数更通用并且允许进行多标签分类。 In other words, tf.nn.sigmoid_cross_entropy_with_logits solves N binary classifications at once.换句话说， tf.nn.sigmoid_cross_entropy_with_logits一次解决了N二元分类。

The labels must be one-hot encoded or can contain soft class probabilities.标签必须是单热编码或可以包含软类概率。

tf.losses.sigmoid_cross_entropy in addition allows to set the in-batch weights , ie make some examples more important than others. tf.losses.sigmoid_cross_entropy还允许设置批内权重，即使某些示例比其他示例更重要。 tf.nn.weighted_cross_entropy_with_logits allows to set class weights (remember, the classification is binary), ie make positive errors larger than negative errors. tf.nn.weighted_cross_entropy_with_logits允许设置类权重（记住，分类是二进制的），即使正误差大于负误差。 This is useful when the training data is unbalanced.这在训练数据不平衡时很有用。

Softmax functions family Softmax 函数族

tf.nn.softmax_cross_entropy_with_logits (DEPRECATED IN 1.5) tf.nn.softmax_cross_entropy_with_logits （在 1.5 中已弃用）
tf.nn.softmax_cross_entropy_with_logits_v2
tf.losses.softmax_cross_entropy
tf.contrib.losses.softmax_cross_entropy (DEPRECATED) tf.contrib.losses.softmax_cross_entropy （已弃用）

These loss functions should be used for multinomial mutually exclusive classification, ie pick one out of N classes.这些损失函数应该用于多项互斥分类，即从N类中挑选一个。 Also applicable when N = 2 .当N = 2时也适用。

The labels must be one-hot encoded or can contain soft class probabilities: a particular example can belong to class A with 50% probability and class B with 50% probability.标签必须是单热编码或可以包含软类概率：特定示例可以 50% 的概率属于 A 类，50% 的概率属于 B 类。 Note that strictly speaking it doesn't mean that it belongs to both classes, but one can interpret the probabilities this way.请注意，严格来说，这并不意味着它属于两个类别，但可以这样解释概率。

Just like in sigmoid family, tf.losses.softmax_cross_entropy allows to set the in-batch weights , ie make some examples more important than others.就像在sigmoid家族中一样， tf.losses.softmax_cross_entropy允许设置批内权重，即让一些例子比其他例子更重要。 As far as I know, as of tensorflow 1.3, there's no built-in way to set class weights .据我所知，从 tensorflow 1.3 开始，没有设置类权重的内置方法。

[UPD] In tensorflow 1.5, v2 version was introduced and the original softmax_cross_entropy_with_logits loss got deprecated. [UPD]在 tensorflow 1.5 中，引入了v2版本并且不推荐使用原始的softmax_cross_entropy_with_logits损失。 The only difference between them is that in a newer version, backpropagation happens into both logits and labels ( here's a discussion why this may be useful).它们之间的唯一区别是，在较新的版本中，反向传播发生在 logits 和标签中（这是为什么这可能有用的讨论）。

Sparse functions family稀疏函数族

tf.nn.sparse_softmax_cross_entropy_with_logits
tf.losses.sparse_softmax_cross_entropy
tf.contrib.losses.sparse_softmax_cross_entropy (DEPRECATED) tf.contrib.losses.sparse_softmax_cross_entropy （已弃用）

Like ordinary softmax above, these loss functions should be used for multinomial mutually exclusive classification, ie pick one out of N classes.和上面普通的softmax一样，这些损失函数应该用于多项互斥分类，即从N类中挑一个。 The difference is in labels encoding: the classes are specified as integers (class index), not one-hot vectors.区别在于标签编码：类被指定为整数（类索引），而不是单热向量。 Obviously, this doesn't allow soft classes, but it can save some memory when there are thousands or millions of classes.显然，这不允许软类，但是当有数千或数百万个类时，它可以节省一些内存。 However, note that logits argument must still contain logits per each class, thus it consumes at least [batch_size, classes] memory.但是，请注意logits参数仍然必须包含每个类的 logits，因此它至少消耗[batch_size, classes]内存。

Like above, tf.losses version has a weights argument which allows to set the in-batch weights.像上面一样， tf.losses版本有一个weights参数，它允许设置批量权重。

Sampled softmax functions family采样的 softmax 函数族

These functions provide another alternative for dealing with huge number of classes.这些函数为处理大量类提供了另一种选择。 Instead of computing and comparing an exact probability distribution, they compute a loss estimate from a random sample.他们不是计算和比较精确的概率分布，而是从随机样本中计算损失估计。

The arguments weights and biases specify a separate fully-connected layer that is used to compute the logits for a chosen sample.的参数weights和biases指定一个用于计算一个选择的采样的logits单独的完全连接的层。

Like above, labels are not one-hot encoded, but have the shape [batch_size, num_true] .像上面一样， labels不是单热编码，而是具有形状[batch_size, num_true] 。

Sampled functions are only suitable for training.采样函数仅适用于训练。 In test time, it's recommended to use a standard softmax loss (either sparse or one-hot) to get an actual distribution.在测试时，建议使用标准的softmax损失（稀疏或单热）来获得实际分布。

Another alternative loss is tf.nn.nce_loss , which performs noise-contrastive estimation (if you're interested, see this very detailed discussion ).另一种替代损失是tf.nn.nce_loss ，它执行噪声对比估计（如果您有兴趣，请参阅此非常详细的讨论）。 I've included this function to the softmax family, because NCE guarantees approximation to softmax in the limit.我已将此函数包含在 softmax 系列中，因为 NCE 保证在极限内逼近 softmax。

Answer 2

However, for version 1.5, softmax_cross_entropy_with_logits_v2 must be used instead, while using its argument with the argument key=... , for example但是，对于 1.5 版，必须改用softmax_cross_entropy_with_logits_v2 ，同时将其参数与argument key=... ，例如

softmax_cross_entropy_with_logits_v2(_sentinel=None, labels=y,
                                    logits=my_prediction, dim=-1, name=None)

Answer 3

While it is great that the accepted answer contains lot more info than what is asked, I felt that sharing a few generic thumb rules will make the answer more compact and intuitive:虽然接受的答案包含的信息比所问的要多得多，但我觉得分享一些通用的拇指规则将使答案更加紧凑和直观：

There is just one real loss function.只有一个真正的损失函数。 This is cross-entropy (CE) .这是交叉熵（CE） 。 For a special case of a binary classification , this loss is called binary CE (note that the formula does not change) and for non-binary or multi-class situations the same is called categorical CE (CCE) .对于二元分类的特殊情况，这种损失称为二元 CE （注意公式不变），对于非二元或多类情况，也称为分类 CE (CCE) 。 Sparse functions are a special case of categorical CE where the expected values are not one-hot encoded but is an integer稀疏函数是分类 CE 的一种特殊情况，其中预期值不是单热编码而是整数
We have the softmax formula which is an activation for multi-class scenario.我们有softmax公式，它是多类场景的激活。 For binary scenario, same formula is given a special name - sigmoid activation对于二元场景，相同的公式被赋予一个特殊的名称——sigmoid激活
Because there are sometimes numerical instabilities (for extreme values) when dealing with logarithmic functions, TF recommends combining the activation layer and the loss layer into one single function.因为在处理对数函数时有时会出现数值不稳定性（对于极值），TF 建议将激活层和损失层合并为一个函数。 This combined function is numerically more stable.这种组合函数在数值上更稳定。 TF provides these combined functions and they are suffixed with _with_logits TF 提供了这些组合函数，它们都以_with_logits为后缀

With this, let us now approach some situations.有了这个，让我们现在处理一些情况。 Say there is a simple binary classification problem - Is a cat present or not in the image?假设有一个简单的二元分类问题- 图像中是否存在猫？ What is the choice of activation and loss function?激活函数和损失函数的选择是什么？ It will be a sigmoid activation and a (binary)CE.这将是一个 sigmoid 激活和一个（二进制）CE。 So one could use sigmoid_cross_entropy or more preferably sigmoid_cross_entropy_with_logits .所以可以使用sigmoid_cross_entropy或更优选sigmoid_cross_entropy_with_logits 。 The latter combines the activation and the loss function and is supposed to be numerically stable.后者结合了激活函数和损失函数，应该是数值稳定的。

How about a multi-class classification .多类分类怎么样。 Say we want to know if a cat or a dog or a donkey is present in the image.假设我们想知道图像中是否存在猫、狗或驴。 What is the choice of activation and loss function?激活函数和损失函数的选择是什么？ It will be a softmax activation and a (categorical)CE.这将是一个 softmax 激活和一个（分类）CE。 So one could use softmax_cross_entropy or more preferably softmax_cross_entropy_with_logits .所以可以使用softmax_cross_entropy或更优选softmax_cross_entropy_with_logits 。 We assume that the expected value is one-hot encoded (100 or 010 or 001).我们假设预期值是单热编码（100 或 010 或 001）。 If (for some weird reason), this is not the case and the expected value is an integer (either 1 or 2 or 3) you could use the 'sparse' counterparts of the above functions.如果（出于某种奇怪的原因），情况并非如此，并且预期值为整数（1 或 2 或 3），则您可以使用上述函数的“稀疏”对应项。

There could be a third case.可能还有第三种情况。 We could have a multi-label classification .我们可以有一个多标签分类。 So there could be a dog and a cat in the same image.因此，同一图像中可能有一只狗和一只猫。 How do we handle this?我们如何处理？ The trick here is to treat this situation as a multiple binary classification problems - basically cat or no cat / dog or no dog and donkey or no donkey.这里的技巧是将这种情况视为多重二元分类问题 - 基本上是猫或没有猫/狗或没有狗和驴或没有驴。 Find out the loss for each of the 3 (binary classifications) and then add them up.找出这 3 个（二元分类）中每一个的损失，然后将它们相加。 So essentially this boils down to using the sigmoid_cross_entropy_with_logits loss.所以基本上这归结为使用sigmoid_cross_entropy_with_logits损失。

This answers the 3 specific questions you have asked.这回答了您提出的 3 个具体问题。 The functions shared above are all that are needed.上面共享的功能是所有需要的。 You can ignore the tf.contrib family which is deprecated and should not be used.您可以忽略已弃用且不应使用的 tf.contrib 系列。

如何在 TensorFlow 中选择交叉熵损失？

问题描述

3 个解决方案

解决方案1
142 已采纳 2017-10-31 11:59:49

Preliminary facts初步事实

Sigmoid functions family Sigmoid 函数族

Softmax functions family Softmax 函数族

Sparse functions family稀疏函数族

Sampled softmax functions family采样的 softmax 函数族

解决方案2
5 2018-05-23 21:12:52

解决方案3
2 2021-03-15 13:41:51

如何在 TensorFlow 中选择交叉熵损失？

问题描述

3 个解决方案

解决方案1 142 已采纳 2017-10-31 11:59:49

Preliminary facts初步事实

Sigmoid functions family Sigmoid 函数族

Softmax functions family Softmax 函数族

Sparse functions family稀疏函数族

Sampled softmax functions family采样的 softmax 函数族

解决方案2 5 2018-05-23 21:12:52

解决方案3 2 2021-03-15 13:41:51

解决方案1
142 已采纳 2017-10-31 11:59:49

解决方案2
5 2018-05-23 21:12:52

解决方案3
2 2021-03-15 13:41:51