numpy - 为什么数值梯度 log(1-sigmoid(x)) 发散而 log(sigmoid(x)) 不发散？

Question

Question问题

Why the numeric gradient (f(x+k)-f(xk)) / 2k of the logistic log loss function f(x) = -np.log(1.0 - __sigmoid(x)) diverges but -np.log(__sigmoid(x)) does not?为什么逻辑对数损失 function f(x) = -np.log(1.0 - __sigmoid(x))的数值梯度(f(x+k)-f(xk)) / 2k发散但-np.log(__sigmoid(x))没有？ What are the potential cases and mechanisms or am I making mistakes?潜在的案例和机制是什么，还是我犯了错误？ The code is at the bottom.代码在底部。

Any suggestion, correction, insights, resources references, or advice/tip/hint on how to implement the numeric gradient will be appreciated.任何关于如何实现数字梯度的建议、更正、见解、资源参考或建议/提示/提示将不胜感激。

Background背景

Trying to implement a numeric gradient (f(x+k)-f(xk)) / 2k of the logistic log loss function.尝试实现逻辑对数损失 function 的数值梯度(f(x+k)-f(xk)) / 2k 。 y in the figure is binary true/false label T and p is the activation sigmoid(x) .图中的y是二进制真/假 label T和p是激活sigmoid(x) 。

Logistic log loss functions 逻辑对数损失函数

When k is relatively large such as 1e-5 , the issue does not happen, at least in the range of x .当k相对较大时，例如1e-5 ，问题不会发生，至少在x的范围内。

However when k gets smaller eg 1e-08 , -np.log(1.0 - __sigmoid(x)) started diverging.然而，当k变小时，例如1e-08 ， -np.log(1.0 - __sigmoid(x))开始发散。 However, it does not happen to -np.log(__sigmoid(x)) .但是，它不会发生在-np.log(__sigmoid(x))上。

Wonder if subtracting 1.0 - sigmoid(x) has something to do with in relation to how float numbers are stored and calculated in a computer in the binary manner .想知道减去1.0 - sigmoid(x)是否与浮点数在计算机中以二进制方式存储和计算的方式有关。

The reason trying to make k smaller is to prevent log(0) to become np.inf by adding a small number u eg 1e-5 but log(x+1e-5) causes a deviation of numerical gradient from the analytical one.试图使k更小的原因是通过添加一个小数u例如1e-5来防止log(0)变为np.inf ，但是log(x+1e-5)会导致数值梯度与解析梯度的偏差。 To minimize the impact, I try to make it smallest possible and start having this issue.为了将影响降到最低，我尝试将其降到最低并开始遇到此问题。

Code代码

Logistic log loss and analytical gradient.逻辑对数损失和分析梯度。

import numpy as np
import inspect
from itertools import product
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

def __sigmoid(X):
    return 1 / (1 + np.exp(-1 * X))

def __logistic_log_loss(X: np.ndarray, T: np.ndarray):
    return -(T * np.log(__sigmoid(X)) + (1-T) * np.log(1-__sigmoid(X)))

def __logistic_log_loss_gradient(X, T):
    Z = __sigmoid(X)
    return Z-T

N = 1000
left=-20
right=20

X = np.linspace(left,right,N)
T0 = np.zeros(N)
T1 = np.ones(N)

# --------------------------------------------------------------------------------
# T = 1
# --------------------------------------------------------------------------------
fig, ax = plt.subplots(figsize=(8,6))
ax.plot(
    X,
    __logistic_log_loss(X, T1),
    color='blue', linestyle='solid',
    label="logistic_log_loss(X, T=1)"
)
ax.plot(
    X, 
    __logistic_log_loss_gradient(X, T1),
    color='navy', linestyle='dashed',
    label="Analytical gradient(T=1)"
)

# --------------------------------------------------------------------------------
# T = 0
# --------------------------------------------------------------------------------
ax.plot(
    X, 
    __logistic_log_loss(X, T0), 
    color='magenta', linestyle='solid',
    label="logistic_log_loss(X, T=0)"
)
ax.plot(
    X, 
    __logistic_log_loss_gradient(X, T0),
    color='purple', linestyle='dashed', 
    label="Analytical gradient(T=0)"
)

ax.set_xlabel("X")
ax.set_ylabel("dL/dX")
ax.set_title("Logistic log loss and gradient")
ax.legend()
ax.grid(True)

Numerical gradient数值梯度

def t_0_loss(X):
    return [
        #logistic_log_loss(P=sigmoid(x), T=0)
        -np.log(1.0 - __sigmoid(x)) for x in X 
    ]

def t_1_loss(X):
    return [
        #logistic_log_loss(P=sigmoid(x), T=1)
        -np.log(__sigmoid(x)) for x in X 
    ]

N = 1000
left=-1
right=15

# Numerical gradient
# (f(x+k)-f(x-k)) / 2k 
k = 1e-9

X = np.linspace(left,right,N)
fig, axes = plt.subplots(1, 2, figsize=(10,8))

# --------------------------------------------------------------------------------
# T = 0
# --------------------------------------------------------------------------------
axes[0].plot(
    X,
    ((np.array(t_0_loss(X + k)) - np.array(t_0_loss(X - k))) / (2*k)),
    color='red', linestyle='solid',
    label="Diffed numerical gradient(T=0)"
)
axes[0].plot(
    X[0:-1:20],
    ((np.array(t_0_loss(X + k)) - np.array(t_0_loss(X))) / k)[0:-1:20],
    color='black', linestyle='dotted', marker='x', markersize=4,
    label="Left numerical gradient(T=0)"
)
axes[0].plot(
    X[0:-1:20],
    ((np.array(t_0_loss(X)) - np.array(t_0_loss(X - k))) / k)[0:-1:20],
    color='salmon', linestyle='dotted', marker='o', markersize=5,
    label="Right numerical gradient(T=0)"
)

axes[0].set_xlabel("X")
axes[0].set_ylabel("dL/dX")
axes[0].set_title("T=0: -log(1-sigmoid(x))")
axes[0].legend()
axes[0].grid(True)

# --------------------------------------------------------------------------------
# T = 1
# --------------------------------------------------------------------------------
axes[1].plot(
    X,
    ((np.array(t_1_loss(X + k)) - np.array(t_1_loss(X - k))) / (2*k)),
    color='blue', linestyle='solid',
    label="Diffed numerical gradient(T=1)"
)
axes[1].plot(
    X[0:-1:20],
    ((np.array(t_1_loss(X + k)) - np.array(t_1_loss(X))) / k)[0:-1:20],
    color='cyan', linestyle='dashed', marker='x', markersize=5,
    label="Left numerical gradient(T=1)"
)
axes[1].plot(
    X[0:-1:20],
    ((np.array(t_1_loss(X)) - np.array(t_1_loss(X - k))) / k)[0:-1:20],
    color='yellow', linestyle='dotted', marker='o', markersize=5,
    label="Right numerical gradient(T=1)"
)

axes[1].set_xlabel("X")
axes[1].set_ylabel("dL/dX")
axes[1].set_title("T=1: -log(sigmoid(x)")
axes[1].legend()
axes[1].grid(True)

Answer 1

Whenever a real number is converted to (or computed in) any limited-precision numerical format, there may be some amount of error.每当将实数转换为（或计算为）任何有限精度的数字格式时，都可能存在一定的误差。 Suppose that, in a particular interval, the numerical format is capable of representing values with a precision of one part in P .假设在特定区间内，数值格式能够以P中的一部分精度表示值。 In other words, the numbers that are representable in the format appear at distances that are about 1/ P apart relative to the magnitudes of the numbers.换句话说，格式中可表示的数字出现在相对于数字大小相距约 1/ P的距离处。

Then, when a real number is converted to the format, resulting in a representable number, the error (ignoring sign) may be at most ½ of 1/ P (relative to the magnitude) if we choose the nearest representable number.然后，当将实数转换为可表示数时，如果我们选择最接近的可表示数，则误差（忽略符号）最多可能为 1/ P的 ½（相对于幅度）。 It may be smaller, if the real number happens to be on or near a representable number.如果实数恰好在或接近可表示的数字，它可能会更小。

Now consider your expression f(x+k)-f(xk) .现在考虑您的表达式f(x+k)-f(xk) 。 f(x+k) and f(xk) will have some error around ¼ of 1/ P , maybe more if they are the results of several calculations, maybe less if you are lucky. f(x+k)和f(xk)在 1/ P的 ¼ 左右会有一些误差，如果它们是多次计算的结果，可能会更大，如果幸运的话，可能会更少。 But, for a simple model, we can figure the error will be somewhere in the region of 1/ P .但是，对于一个简单的 model，我们可以计算出错误将在 1/ P区域的某个地方。 When we subtract them, the error may still be somewhere in the region of 1/ P .当我们减去它们时，误差可能仍然在 1/ P范围内。 The errors in f(x+k) and f(xk) may reinforce or may cancel in the subtraction, so sometimes you will get very little total error, but it will often about somewhere around 1/ P . f(x+k)和f(xk)中的误差可能会在减法中加强或抵消，所以有时你会得到非常小的总误差，但它通常会在 1/ P左右。

In your situation, f(x+k) and f(xk) are very near each other.在您的情况下， f(x+k)和f(xk)非常接近。 So, when they are subtracted, the result is much smaller in magnitude than they are.因此，当它们被减去时，结果的大小要比它们小得多。 That error of around 1/ P is relative to the magnitudes of f(x+k) and f(xk) .大约 1/ P的误差与f(x+k)和f(xk)的大小有关。 Since f(x+k)-f(xk) is very small compared to f(x+k) and f(xk) , the error of 1/ P relative to f(x+k) and f(xk) is much larger relative to f(x+k)-f(xk) .由于f(x+k)-f(xk)与f(x+k)和f(xk)相比非常小，因此 1/ P相对于f(x+k)和f(xk) ) 的误差很大相对于f(x+k)-f(xk)更大。

This is the source of most of the noise in your graph.这是图表中大部分噪音的来源。

To avoid it, you need to calculate f(x+k)-f(xk) with more precision or you need to avoid that calculation.为避免这种情况，您需要更精确地计算f(x+k)-f(xk) ，或者您需要避免该计算。

Answer 2

Solution by Reza.B. Reza.B. 的解决方案

Let z=1/(1+p), p= e^(-x).令 z=1/(1+p), p= e^(-x)。 You can see then log(1-z)=log(p)-log(1+p), which is more stable in terms of rounding errors (we got rid of division, which is the main issue in numerical instabilities).然后你可以看到 log(1-z)=log(p)-log(1+p)，它在舍入误差方面更稳定（我们摆脱了除法，这是数值不稳定性的主要问题）。

Reformulation重新制定

Result结果

The errors have been resolved.错误已解决。

def t_0_loss(X):
    L = X + np.log(1 + np.exp(-X))
    return L.tolist()

def t_1_loss(X):
    L = np.log(1 + np.exp(-X))
    return L.tolist()

%%timeit
((np.array(t_0_loss(X + k)) - np.array(t_0_loss(X - k))) / (2*k))
((np.array(t_1_loss(X + k)) - np.array(t_1_loss(X - k))) / (2*k))
---
599 µs ± 65.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Previous erroneous version was:以前的错误版本是：

47 ms ± 617 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

numpy - 为什么数值梯度 log(1-sigmoid(x)) 发散而 log(sigmoid(x)) 不发散？

问题描述

Question问题

Background背景

Code代码

Logistic log loss and analytical gradient.逻辑对数损失和分析梯度。

Numerical gradient数值梯度

2 个解决方案

解决方案1
1 2021-02-27 14:20:24

解决方案2
0 2021-03-05 10:47:30

Reformulation重新制定

Result结果

numpy - 为什么数值梯度 log(1-sigmoid(x)) 发散而 log(sigmoid(x)) 不发散？

问题描述

Question问题

Background背景

Code代码

Logistic log loss and analytical gradient.逻辑对数损失和分析梯度。

Numerical gradient数值梯度

2 个解决方案

解决方案1 1 2021-02-27 14:20:24

解决方案2 0 2021-03-05 10:47:30

Reformulation重新制定

Result结果

解决方案1
1 2021-02-27 14:20:24

解决方案2
0 2021-03-05 10:47:30