简体   繁体   中英

Exponential Regression in Python

I have a set of x and y data and I want to use exponential regression to find the line that best fits those set of points. ie:

y = P1 + P2 exp(-P0 x)

I want to calculate the values of P0 , P1 and P2 .

I use a software "Igor Pro" that calculates the values for me, but want a Python implementation. I used the curve_fit function, but the values that I get are nowhere near the ones calculated by Igor software. Here is the sets of data that I have:

Set1:

x = [ 1.06, 1.06, 1.06, 1.06, 1.06, 1.06, 0.91, 0.91, 0.91 ]
y = [ 476, 475, 476.5, 475.25, 480, 469.5, 549.25, 548.5, 553.5 ]

Values calculated by Igor:

P1=376.91, P2=5393.9, P0=3.7776

Values calculated by curve_fit :

P1=702.45, P2=-13.33. P0=-2.6744

Set2:

x = [ 1.36, 1.44, 1.41, 1.745, 2.25, 1.42, 1.45, 1.5, 1.58]
y = [ 648, 618, 636, 485, 384, 639, 630, 583, 529]

Values calculated by Igor:

P1=321, P2=4848, P0=-1.94

Values calculated by curve_fit:

No optimal values found

I use curve_fit as follow:

from scipy.optimize import curve_fit
popt, pcov = curve_fit(lambda t, a, b, c: a * np.exp(-b * t) + c, x, y)

where:

P1=c, P2=a and P0=b

Well, when comparing fit results, it is always important to include uncertainties in the fitted parameters. That is, when you say that the values from Igor (P1=376.91, P2=5393.9, P0=3.7776), and from curve_fit (P1=702.45, P2=-13.33. P0=-2.6744) are different, what is it that leads to conclude those values are actually different?

Of course, in everyday conversation, 376.91 and 702.45 are very different, mostly because simply stating a value to 2 decimal places implies accuracy at approximately that scale (the distance between New York and Tokyo is 10,850 km but is not really 10,847,024,31 cm -- that might be the distance between bus stops in the two cities). But when comparing fit results, that everyday knowledge cannot be assumed, and you have to include uncertainties. I don't know if Igor will give you those. scipy curve_fit can, but it requires some work to extract them -- a pity.

Allow me to recommend trying lmfit (disclaimer: I am an author). With that, you would set up and execute the fit like this:

import numpy as np
from lmfit import Model
    
x = [ 1.06, 1.06, 1.06, 1.06, 1.06, 1.06, 0.91, 0.91, 0.91 ]
y = [ 476, 475, 476.5, 475.25, 480, 469.5, 549.25, 548.5, 553.5 ]
# x = [ 1.36, 1.44, 1.41, 1.745, 2.25, 1.42, 1.45, 1.5, 1.58]
# y = [ 648, 618, 636, 485, 384, 639, 630, 583, 529]    

# Define the function that we want to fit to the data
def func(x, offset, scale, decay):
    return offset + scale * np.exp(-decay* x)
    
model = Model(func)
params = model.make_params(offset=375, scale=5000, decay=4)
    
result = model.fit(y, params, x=x)
    
print(result.fit_report())

This would print out the result of

[[Model]]
    Model(func)
[[Fit Statistics]]
    # fitting method   = leastsq
    # function evals   = 49
    # data points      = 9
    # variables        = 3
    chi-square         = 72.2604167
    reduced chi-square = 12.0434028
    Akaike info crit   = 24.7474672
    Bayesian info crit = 25.3391410
    R-squared          = 0.99362489
[[Variables]]
    offset:  413.168769 +/- 17348030.9 (4198775.95%) (init = 375)
    scale:   16689.6793 +/- 1.3337e+10 (79909638.11%) (init = 5000)
    decay:   5.27555726 +/- 1016721.11 (19272297.84%) (init = 4)
[[Correlations]] (unreported correlations are < 0.100)
    C(scale, decay)  = 1.000
    C(offset, decay) = 1.000
    C(offset, scale) = 1.000

indicating that the uncertainties in the parameter values are simply enormous and the correlations between all parameters are 1. This is because you have only 2 x values, which will make it impossible to accurately determine 3 independent variables.

And, note that with an uncertainty of 17 million, the values for P1 (offset) of 413 and 762 do actually agree. The problem is not that Igor and curve_fit disagree on the best value, it is that neither can determine the value with any accuracy at all.

For your other dataset, the situation is a little better, with a result:

[[Model]]
    Model(func)
[[Fit Statistics]]
    # fitting method   = leastsq
    # function evals   = 82
    # data points      = 9
    # variables        = 3
    chi-square         = 1118.19957
    reduced chi-square = 186.366596
    Akaike info crit   = 49.4002551
    Bayesian info crit = 49.9919289
    R-squared          = 0.98272310
[[Variables]]
    offset:  320.876843 +/- 42.0154403 (13.09%) (init = 375)
    scale:   4797.14487 +/- 2667.40083 (55.60%) (init = 5000)
    decay:   1.93560164 +/- 0.47764470 (24.68%) (init = 4)
[[Correlations]] (unreported correlations are < 0.100)
    C(scale, decay)  = 0.995
    C(offset, decay) = 0.940
    C(offset, scale) = 0.904

the correlations are still high, but the parameters are reasonably well determined. Also, note that the best-fit values here are much closer to those you got from Igor, and probably "within the uncertainty".

And this is why one always needs to include uncertainties with the best-fit values reported from a fit.

Set 1:

x = [ 1.06, 1.06, 1.06, 1.06, 1.06, 1.06, 0.91, 0.91, 0.91 ]

y = [ 476, 475, 476.5, 475.25, 480, 469.5, 549.25, 548.5, 553.5 ]

在此处输入图像描述

One observe that they are only two different values of x: 1.06 and 0.91

On the other hand they are three parameters to optimise: P0, P1 and P2. This is too much.

In other words an infinity of exponential curves can be found to fit the two clusters of points. The differences between the curves can be due to slight difference of the computation methods of non-linear regression especially due to the methods to chose the initial values of the iterative process.

In this particular case a simple linear regression would be without ambiguity.

By comparison:

在此处输入图像描述

Thus both Igor and Curve_fit give excellent fitting: The points are very close to both curves. One understand that infinity many other exponential fuctions would fit as well.


Set 2:

x = [ 1.36, 1.44, 1.41, 1.745, 2.25, 1.42, 1.45, 1.5, 1.58]

y = [ 648, 618, 636, 485, 384, 639, 630, 583, 529]

The difficulty that you meet might be due to the choice of "guessed" initial values of the parameters which are required to start the iterative process of nonlinear regression.

In order to check this hypothesis one can use a different method which doesn't need initial guessed values. The MathCad code and numerical calculus are shown below.

在此处输入图像描述

在此处输入图像描述

在此处输入图像描述

Don't be surprised if the values of the parameters that you get with your software are slightly different from the above values (a, b, c). The criteria of fitting implicitly set in your software is probably different from the criteria of fitting set in my software.

在此处输入图像描述

Blue curve: The method of regression is a Least Mean Square Errors wrt a linear integral equation to which the exponential equation is solution. Ref.: https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales

This non-standard method isn't iterative and doesn't require initial "guessed" values of parameters.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM