Keras 和 TensorFlow 中所有这些交叉熵损失之间有什么区别？

Question

What are the differences between all these cross-entropy losses?所有这些交叉熵损失之间有什么区别？

Keras is talking about Keras 正在谈论

Binary cross-entropy二元交叉熵
Categorical cross-entropy分类交叉熵
Sparse categorical cross-entropy稀疏分类交叉熵

While TensorFlow has虽然 TensorFlow 有

Softmax cross-entropy with logits带有 logits 的 Softmax 交叉熵
Sparse softmax cross-entropy with logits带有 logits 的稀疏 softmax 交叉熵
Sigmoid cross-entropy with logits带有 logits 的 Sigmoid 交叉熵

What are the differences and relationships between them?它们之间有什么区别和关系？ What are the typical applications for them?它们的典型应用是什么？ What's the mathematical background?数学背景是什么？ Are there other cross-entropy types that one should know?还有其他应该知道的交叉熵类型吗？ Are there any cross-entropy types without logits?有没有没有 logits 的交叉熵类型？

Answer 1

There is just one cross (Shannon) entropy defined as:只有一个交叉（香农）熵定义为：

H(P||Q) = - SUM_i P(X=i) log Q(X=i)

In machine learning usage, P is the actual (ground truth) distribution, and Q is the predicted distribution.在机器学习使用中， P是实际（ground truth）分布， Q是预测分布。 All the functions you listed are just helper functions which accepts different ways to represent P and Q .您列出的所有函数都只是辅助函数，它们接受不同的方式来表示P和Q 。

There are basically 3 main things to consider:基本上有3个主要的考虑事项：

there are either 2 possibles outcomes (binary classification) or more.有两种可能的结果（二元分类）或更多。 If there are just two outcomes, then Q(X=1) = 1 - Q(X=0) so a single float in (0,1) identifies the whole distribution, this is why neural network in binary classification has a single output (and so does logistic regresssion).如果只有两个结果，那么Q(X=1) = 1 - Q(X=0)所以 (0,1) 中的单个浮点数标识了整个分布，这就是二元分类中的神经网络具有单个输出的原因（逻辑回归也是如此）。 If there are K>2 possible outcomes one has to define K outputs (one per each Q(X=...) )如果有 K>2 个可能的结果，则必须定义 K 个输出（每个Q(X=...) ）
one either produces proper probabilities (meaning that Q(X=i)>=0 and SUM_i Q(X=i) =1 or one just produces a "score" and has some fixed method of transforming score to probability. For example a single real number can be "transformed to probability" by taking sigmoid, and a set of real numbers can be transformed by taking their softmax and so on.一个要么产生适当的概率（意味着Q(X=i)>=0和SUM_i Q(X=i) =1或者只是产生一个“分数”并且有一些将分数转换为概率的固定方法。例如一个实数可以通过采用sigmoid来“转换为概率”，一组实数可以通过他们的softmax等进行转换。
there is j such that P(X=j)=1 (there is one "true class", targets are "hard", like "this image represent a cat") or there are "soft targets" (like "we are 60% sure this is a cat, but for 40% it is actually a dog").有j使得P(X=j)=1 （有一个“真正的类别”，目标是“硬”的，比如“这张图片代表一只猫”）或者有“软目标”（比如“我们是 60 % 确定这是一只猫，但 40% 它实际上是一只狗”）。

Depending on these three aspects, different helper function should be used:根据这三个方面，应该使用不同的辅助函数：

                                  outcomes     what is in Q    targets in P   
-------------------------------------------------------------------------------
binary CE                                2      probability         any
categorical CE                          >2      probability         soft
sparse categorical CE                   >2      probability         hard
sigmoid CE with logits                   2      score               any
softmax CE with logits                  >2      score               soft
sparse softmax CE with logits           >2      score               hard

In the end one could just use "categorical cross entropy", as this is how it is mathematically defined, however since things like hard targets or binary classification are very popular - modern ML libraries do provide these additional helper functions to make things simpler.最后，人们只能使用“分类交叉熵”，因为这是数学定义的方式，但是由于硬目标或二元分类之类的东西非常流行 - 现代 ML 库确实提供了这些额外的辅助函数来使事情变得更简单。 In particular "stacking" sigmoid and cross entropy might be numerically unstable, but if one knows these two operations are applied together - there is a numerically stable version of them combined (which is implemented in TF).特别是“堆叠” sigmoid 和交叉熵可能在数值上不稳定，但如果知道这两个操作一起应用 - 它们组合的数值稳定版本（在 TF 中实现）。

It is important to notice that if you apply wrong helper function the code will usually still execute, but results will be wrong.重要的是要注意，如果应用错误的辅助函数，代码通常仍会执行，但结果将是错误的。 For example if you apply softmax_* helper for binary classification with one output your network will be considered to always produce "True" at the output.例如，如果您将 softmax_* 助手应用于具有一个输出的二进制分类，则您的网络将被视为始终在输出处产生“真”。

As a final note - this answer considers classification , it is slightly different when you consider multi label case (when a single point can have multiple labels), as then Ps do not sum to 1, and one should use sigmoid_cross_entropy_with_logits despite having multiple output units.最后一点 - 这个答案考虑了分类，当您考虑多标签情况时（当一个点可以有多个标签时）略有不同，因为 Ps 的总和不为 1，尽管有多个输出单元，但仍应使用 sigmoid_cross_entropy_with_logits .

Answer 2

Logits登录

For this purpose, "logits" can be seen as the non-activated outputs of the model.为此，“logits”可以被视为模型的非激活输出。

While Keras losses always take an "activated" output (you must apply "sigmoid" or "softmax" before the loss)虽然Keras损失总是采用“激活”输出（您必须在损失之前应用“sigmoid”或“softmax”）
Tensorflow takes them with "logits" or "non-activated" (you should not apply "sigmoid" or "softmax" before the loss) Tensorflow使用“logits”或“non-activated” （你不应该在损失之前应用“sigmoid”或“softmax”）

Losses "with logits" will apply the activation internally. “带有 logits”的损失将在内部应用激活。 Some functions allow you to choose logits=True or logits=False , which will tell the function whether to "apply" or "not apply" the activations.某些函数允许您选择logits=True或logits=False ，这将告诉函数是“应用”还是“不应用”激活。

Sparse疏

Sparse functions use the target data (ground truth) as "integer labels": 0, 1, 2, 3, 4.....稀疏函数使用目标数据（ground truth）作为“整数标签”：0, 1, 2, 3, 4.....
Non-sparse functions use the target data as "one-hot labels": [1,0,0], [0,1,0], [0,0,1]非稀疏函数使用目标数据作为“one-hot label”：[1,0,0], [0,1,0], [0,0,1]

Binary crossentropy = Sigmoid crossentropy二元交叉熵 = Sigmoid 交叉熵

Problem type:问题类型：
- single class (false/true);单类（假/真）； or或者
- non-exclusive multiclass (many classes may be correct)非排他性多类（许多类可能是正确的）
Model output shape: (batch, ..., >=1)模型输出形状： (batch, ..., >=1)
Activation: "sigmoid"激活： "sigmoid"

Categorical crossentropy = Softmax crossentropy分类交叉熵 = Softmax 交叉熵

Problem type: exclusive classes (only one class may be correct)问题类型：独占类（可能只有一个类是正确的）
Model output shape: (batch, ..., >=2)模型输出形状： (batch, ..., >=2)
Activation: "softmax"激活： "softmax"

Keras 和 TensorFlow 中所有这些交叉熵损失之间有什么区别？

问题描述

2 个解决方案

解决方案1
30 已采纳 2017-06-21 18:57:40

解决方案2
9 2020-01-30 14:41:58

Logits登录

Sparse疏

Binary crossentropy = Sigmoid crossentropy二元交叉熵 = Sigmoid 交叉熵

Categorical crossentropy = Softmax crossentropy分类交叉熵 = Softmax 交叉熵

Keras 和 TensorFlow 中所有这些交叉熵损失之间有什么区别？

问题描述

2 个解决方案

解决方案1 30 已采纳 2017-06-21 18:57:40

解决方案2 9 2020-01-30 14:41:58

Logits登录

Sparse疏

Binary crossentropy = Sigmoid crossentropy二元交叉熵 = Sigmoid 交叉熵

Categorical crossentropy = Softmax crossentropy分类交叉熵 = Softmax 交叉熵

解决方案1
30 已采纳 2017-06-21 18:57:40

解决方案2
9 2020-01-30 14:41:58