scipy curve_fit 對於大 X 值不正確

Question

為了確定一段時間內的趨勢，我使用scipy curve_fit和來自time.time()的 X 值，例如1663847528.7147126 （16 億）。 進行線性插值有時會產生錯誤的結果，並且提供近似的初始p0值也無濟於事。 我發現 X 的大小是這個錯誤的關鍵因素，我想知道為什么？

這是一個簡單的片段，顯示了工作和非工作 X 偏移量：

import scipy.optimize

def fit_func(x, a, b):
    return a + b * x

y = list(range(5))

x = [1e8 + a for a in range(5)]
print(scipy.optimize.curve_fit(fit_func, x, y, p0=[-x[0], 0]))
# Result is correct:
#   (array([-1.e+08,  1.e+00]), array([[ 0., -0.],
#          [-0.,  0.]]))

x = [1e9 + a for a in range(5)]
print(scipy.optimize.curve_fit(fit_func, x, y, p0=[-x[0], 0.0]))
# Result is not correct:
#   OptimizeWarning: Covariance of the parameters could not be estimated
#   warnings.warn('Covariance of the parameters could not be estimated',
#   (array([-4.53788811e+08,  4.53788812e-01]), array([[inf, inf],
#          [inf, inf]]))

Almost perfect p0 for b removes the warning but still curve_fit doesn't work
print(scipy.optimize.curve_fit(fit_func, x, y, p0=[-x[0], 0.99]))
# Result is not correct:
#   (array([-7.60846335e+10,  7.60846334e+01]), array([[-1.97051972e+19,  1.97051970e+10],
#          [ 1.97051970e+10, -1.97051968e+01]]))
   
# ...but perfect p0 works
print(scipy.optimize.curve_fit(fit_func, x, y, p0=[-x[0], 1.0]))
#(array([-1.e+09,  1.e+00]), array([[inf, inf],
#       [inf, inf]]))

作為一個附帶問題，也許有一種更有效的線性擬合方法？ 不過，有時我想找到二階多項式擬合。

在 Windows 10 下使用 Python 3.9.6 和 SciPy 1.7.1 進行測試。

Answer 1

如果您只需要計算線性擬合，我相信curve_fit不是必需的，我也會使用linregress function 代替 SciPy ：

>>> from scipy import stats

>>> y = list(range(5))

>>> x = [1e8 + a for a in range(5)]
>>> stats.linregress(x, y)
LinregressResult(slope=1.0, intercept=-100000000.0, rvalue=1.0, pvalue=1.2004217548761408e-30, stderr=0.0, intercept_stderr=0.0)

>>> x2 = [1e9 + a for a in range(5)]
>>> stats.linregress(x2, y)
LinregressResult(slope=1.0, intercept=-1000000000.0, rvalue=1.0, pvalue=1.2004217548761408e-30, stderr=0.0, intercept_stderr=0.0)

一般來說，如果您需要多項式擬合，我會使用 NumPy polyfit 。

Answer 2

根本原因

你面臨兩個問題：

擬合過程是規模敏感的。 這意味着在特定變量上選擇的單位（例如，µA 而不是 kA）可以人為地阻止算法正確收斂（例如，一個變量比另一個變量大幾個數量級並主導回歸）；
浮點算術錯誤。 當從1e8切換到1e9時，當這種錯誤占主導地位時，您就會達到幅度。

第二個是非常重要的實現。 假設您僅限於 8 位有效數字表示，那么1 000 000 000和1 000 000 001是相同的數字，因為它們都僅限於這種寫作1.0000000e9並且我們無法准確表示1.0000000_e9需要多一個數字（ _ ） . 這就是您的第二個示例失敗的原因。

此外，您正在使用非線性最小二乘算法來解決線性最小二乘問題，但這與您的問題無關。

你有兩個解決方案：

在執行計算的同時提高機器精度；
規范你的問題。

我會選擇第二個，因為它更通用。

正常化

為了緩解這兩個問題，一個常見的解決方案是標准化。 在您的情況下，一個簡單的標准化就足夠了：

import numpy as np
import scipy.optimize

y = np.arange(5)
x = 1e9 + y

def fit_func(x, a, b):
    return a + b * x

xm = np.mean(x)         # 1000000002.0
xs = np.std(x)          # 1.4142135623730951

result = scipy.optimize.curve_fit(fit_func, (x - xm)/xs, y)

# (array([2.        , 1.41421356]),
# array([[0., 0.],
#        [0., 0.]]))

# Back transformation:
a = result[0][1]/xs                    # 1.0
b = result[0][0] - xm*result[0][1]/xs  # -1000000000.0

或者使用sklearn接口得到相同的結果：

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LinearRegression

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("regressor", LinearRegression())
])

pipe.fit(x.reshape(-1, 1), y)

pipe.named_steps["scaler"].mean_          # array([1.e+09])
pipe.named_steps["scaler"].scale_         # array([1.41421356])
pipe.named_steps["regressor"].coef_       # array([1.41421356])
pipe.named_steps["regressor"].intercept_  # 2.0

反向變換

實際上，當歸一化擬合結果時，然后用歸一化變量表示。 要獲得所需的擬合參數，您只需做一些數學運算即可將回歸參數轉換回原始變量比例。

簡單地寫下並解決轉換：

 y = x'*a' + b'
x' = (x - m)/s
 y = x*a + b

這為您提供了以下解決方案：

a = a'/s
b = b' - m/s*a'

scipy curve_fit 對於大 X 值不正確

問題描述

2 個解決方案

解決方案1
1 2022-09-22 13:15:23

解決方案2
0 2022-09-22 15:26:10

根本原因

正常化

反向變換

scipy curve_fit 對於大 X 值不正確

問題描述

2 個解決方案

解決方案1 1 2022-09-22 13:15:23

解決方案2 0 2022-09-22 15:26:10

根本原因

正常化

反向變換

解決方案1
1 2022-09-22 13:15:23

解決方案2
0 2022-09-22 15:26:10