scipy curve_fit incorrect for large X values

Question

To determine trends over time, I use scipy curve_fit with X values from time.time() , for example 1663847528.7147126 (1.6 billion). Doing a linear interpolation sometimes creates erroneous results, and providing approximate initial p0 values doesn't help. I found the magnitude of X to be a crucial element for this error and I wonder why?

Here is a simple snippet that shows working and non-working X offset:

import scipy.optimize

def fit_func(x, a, b):
    return a + b * x

y = list(range(5))

x = [1e8 + a for a in range(5)]
print(scipy.optimize.curve_fit(fit_func, x, y, p0=[-x[0], 0]))
# Result is correct:
#   (array([-1.e+08,  1.e+00]), array([[ 0., -0.],
#          [-0.,  0.]]))

x = [1e9 + a for a in range(5)]
print(scipy.optimize.curve_fit(fit_func, x, y, p0=[-x[0], 0.0]))
# Result is not correct:
#   OptimizeWarning: Covariance of the parameters could not be estimated
#   warnings.warn('Covariance of the parameters could not be estimated',
#   (array([-4.53788811e+08,  4.53788812e-01]), array([[inf, inf],
#          [inf, inf]]))

Almost perfect p0 for b removes the warning but still curve_fit doesn't work
print(scipy.optimize.curve_fit(fit_func, x, y, p0=[-x[0], 0.99]))
# Result is not correct:
#   (array([-7.60846335e+10,  7.60846334e+01]), array([[-1.97051972e+19,  1.97051970e+10],
#          [ 1.97051970e+10, -1.97051968e+01]]))
   
# ...but perfect p0 works
print(scipy.optimize.curve_fit(fit_func, x, y, p0=[-x[0], 1.0]))
#(array([-1.e+09,  1.e+00]), array([[inf, inf],
#       [inf, inf]]))

As a side question, perhaps there's a more efficient method for a linear fit? Sometimes I want to find the second-order polynomial fit, though.

Tested with Python 3.9.6 and SciPy 1.7.1 under Windows 10.

Answer 1

If you just need to compute a linear fit, I believe curve_fit is not necessary and I would just use the linregress function instead from SciPy as well:

>>> from scipy import stats

>>> y = list(range(5))

>>> x = [1e8 + a for a in range(5)]
>>> stats.linregress(x, y)
LinregressResult(slope=1.0, intercept=-100000000.0, rvalue=1.0, pvalue=1.2004217548761408e-30, stderr=0.0, intercept_stderr=0.0)

>>> x2 = [1e9 + a for a in range(5)]
>>> stats.linregress(x2, y)
LinregressResult(slope=1.0, intercept=-1000000000.0, rvalue=1.0, pvalue=1.2004217548761408e-30, stderr=0.0, intercept_stderr=0.0)

In general, if you need a polynomial fit I would use NumPy polyfit .

Answer 2

Root cause

You are facing two problems:

Fitting procedure are scale sensitive. It means chosen units on a specific variable (eg. µA instead of kA) can artificially prevent an algorithm to converge properly (eg. One variable is several order of magnitude bigger than another and dominate the regression);
Float Arithmetic Error. When switching from 1e8 to 1e9 you just hit the magnitude when such a kind of error become predominant.

The second one is very important to realize. Let's say you are limited to 8 significant digits representation, then 1 000 000 000 and 1 000 000 001 are the same numbers as they are both limited to this writing 1.0000000e9 and we cannot accurately represents 1.0000000_e9 which requires one more digit ( _ ). This is why your second example fails.

Additionally you are using an Non Linear Least Square algorithm to solve a Linear Least Square problem, but this is not related to your problem.

You have two solutions:

Increase the machine precision while performing computations;
Normalize your problem.

I'll choose the second one as it is more generic.

Normalization

To mitigate both problems, a common solution is normalization. In your case a simple standardization is enough:

import numpy as np
import scipy.optimize

y = np.arange(5)
x = 1e9 + y

def fit_func(x, a, b):
    return a + b * x

xm = np.mean(x)         # 1000000002.0
xs = np.std(x)          # 1.4142135623730951

result = scipy.optimize.curve_fit(fit_func, (x - xm)/xs, y)

# (array([2.        , 1.41421356]),
# array([[0., 0.],
#        [0., 0.]]))

# Back transformation:
a = result[0][1]/xs                    # 1.0
b = result[0][0] - xm*result[0][1]/xs  # -1000000000.0

Or the same result using sklearn interface:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LinearRegression

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("regressor", LinearRegression())
])

pipe.fit(x.reshape(-1, 1), y)

pipe.named_steps["scaler"].mean_          # array([1.e+09])
pipe.named_steps["scaler"].scale_         # array([1.41421356])
pipe.named_steps["regressor"].coef_       # array([1.41421356])
pipe.named_steps["regressor"].intercept_  # 2.0

Back transformation

Indeed when normalizing the fit result is then expressed in term of normalized variable. To get the required fit parameters, you just need to do a bit of math to convert back the regressed parameters into the original variable scales.

Simply write down and solve the transformation:

 y = x'*a' + b'
x' = (x - m)/s
 y = x*a + b

Which gives you the following solution:

a = a'/s
b = b' - m/s*a'

scipy curve_fit incorrect for large X values

Question

2 answers

solution1
1 2022-09-22 13:15:23

solution2
0 2022-09-22 15:26:10

Root cause

Normalization

Back transformation

scipy curve_fit incorrect for large X values

Question

2 answers

solution1 1 2022-09-22 13:15:23

solution2 0 2022-09-22 15:26:10

Root cause

Normalization

Back transformation

solution1
1 2022-09-22 13:15:23

solution2
0 2022-09-22 15:26:10