简体   繁体   English

如何从curve_fit获得置信区间

[英]How to get confidence intervals from curve_fit

My question involves statistics and python and I am a beginner in both.我的问题涉及统计和 Python,我是这两个方面的初学者。 I am running a simulation, and for each value for the independent variable (X) I produce 1000 values for the dependent variable (Y).我正在运行模拟,对于自变量 (X) 的每个值,我为因变量 (Y) 生成 1000 个值。 What I have done is that I calculated the average of Y for each value of X and fitted these averages using scipy.optimize.curve_fit.我所做的是计算每个 X 值的 Y 平均值,并使用 scipy.optimize.curve_fit 拟合这些平均值。 The curve fits nicely, but I want to draw also the confidence intervals.曲线拟合得很好,但我还想绘制置信区间。 I am not sure if what I am doing is correct or if what I want to do can be done, but my question is how can I get the confidence intervals from the covariance matrix produced by curve_fit.我不确定我所做的是否正确,或者我想做的是否可以完成,但我的问题是如何从curve_fit 生成的协方差矩阵中获得置信区间。 The code reads the averages from files first then it just simply uses curve_fit.代码首先从文件中读取平均值,然后它只是简单地使用 curve_fit。

import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit


def readTDvsTx(L, B, P, fileformat):
    # L should be '_Fixed_' or '_'
    TD = []
    infile = open(fileformat.format(L, B, P), 'r')
    infile.readline()  # To remove header
    for line in infile:
        l = line.split()  # each line contains TxR followed by CD followed by TD
        if eval(l[0]) >= 70 and eval(l[0]) <=190:
            td = eval(l[2])
            TD.append(td)
    infile.close()
    tdArray = np.array(TD)

    return tdArray


def rec(x, a, b):
    return a * (1 / (x**2)) + b



fileformat = 'Densities_file{}BS{}_PRNTS{}.txt'
txR = np.array(range(70, 200, 20))
parents = np.array(range(1,6))
disc_p1 = readTDvsTx('_Fixed_', 5, 1, fileformat)


popt, pcov = curve_fit(rec, txR, disc_p1)


plt.plot(txR, rec(txR, popt[0], popt[1]), 'r-')
plt.plot(txR, disc_p1, '.')

print(popt)
plt.show()

And here is the resulting fit:这是结果拟合: 在此处输入图片说明

Here's a quick and wrong answer: you can approximate the errors from the covariance matrix for your a and b parameters as the square root of its diagonals: np.sqrt(np.diagonal(pcov)) .这是一个快速而错误的答案:您可以将ab参数的协方差矩阵中的误差近似为其对角线的平方根: np.sqrt(np.diagonal(pcov)) The parameter uncertainties can then be used to draw the confidence intervals.然后可以使用参数不确定性来绘制置信区间。

The answer is wrong because you before you fit your data to a model, you'll need an estimate of the errors on your averaged disc_p1 points.答案是错误的,因为在将数据拟合到模型之前,您需要估计平均disc_p1点上的错误。 When averaging, you have lost the information about the scatter of the population, leading curve_fit to believe that the y-points you feed it are absolute and undisputable.在求平均时,您已经丢失了关于人口分散的信息,导致curve_fit相信您提供给它的 y 点是绝对的且无可争议。 This might cause an underestimation of your parameter errors.这可能会导致低估您的参数错误。

For an estimate of the uncertainties of your averaged Y values, you need to estimate their dispersion measure and pass it along to curve_fit while saying that your errors are absolute.为了估计平均 Y 值的不确定性,您需要估计它们的离散度量并将其传递给curve_fit同时说明您的错误是绝对的。 Below is an example of how to do this for a random dataset where each of your points consists of a 1000 samples drawn from a normal distribution.下面是如何对随机数据集执行此操作的示例,其中每个点都包含从正态分布中抽取的 1000 个样本。

from scipy.optimize import curve_fit
import matplotlib.pylab as plt
import numpy as np

# model function
func = lambda x, a, b: a * (1 / (x**2)) + b 

# approximating OP points
n_ypoints = 7 
x_data = np.linspace(70, 190, n_ypoints)

# approximating the original scatter in Y-data
n_nested_points = 1000
point_errors = 50
y_data = [func(x, 4e6, -100) + np.random.normal(x, point_errors,
          n_nested_points) for x in x_data]

# averages and dispersion of data
y_means = np.array(y_data).mean(axis = 1)
y_spread = np.array(y_data).std(axis = 1)

best_fit_ab, covar = curve_fit(func, x_data, y_means,
                               sigma = y_spread,
                               absolute_sigma = True)
sigma_ab = np.sqrt(np.diagonal(covar))

from uncertainties import ufloat
a = ufloat(best_fit_ab[0], sigma_ab[0])
b = ufloat(best_fit_ab[1], sigma_ab[1])
text_res = "Best fit parameters:\na = {}\nb = {}".format(a, b)
print(text_res)

# plotting the unaveraged data
flier_kwargs = dict(marker = 'o', markerfacecolor = 'silver',
                    markersize = 3, alpha=0.7)
line_kwargs = dict(color = 'k', linewidth = 1)
bp = plt.boxplot(y_data, positions = x_data,
                 capprops = line_kwargs,
                 boxprops = line_kwargs,
                 whiskerprops = line_kwargs,
                 medianprops = line_kwargs,
                 flierprops = flier_kwargs,
                 widths = 5,
                 manage_ticks = False)
# plotting the averaged data with calculated dispersion
#plt.scatter(x_data, y_means, facecolor = 'silver', alpha = 1)
#plt.errorbar(x_data, y_means, y_spread, fmt = 'none', ecolor = 'black')

# plotting the model
hires_x = np.linspace(50, 190, 100)
plt.plot(hires_x, func(hires_x, *best_fit_ab), 'black')
bound_upper = func(hires_x, *(best_fit_ab + sigma_ab))
bound_lower = func(hires_x, *(best_fit_ab - sigma_ab))
# plotting the confidence intervals
plt.fill_between(hires_x, bound_lower, bound_upper,
                 color = 'black', alpha = 0.15)
plt.text(140, 800, text_res)
plt.xlim(40, 200)
plt.ylim(0, 1000)
plt.show()

绝对加权最小二乘法

Edit: If you are not considering the intrinsic errors on the data points, you are probably fine with using the "qiuck and wrong" case I mentioned before.编辑:如果您没有考虑数据点的内在错误,那么使用我之前提到的“qiuck and wrong”案例可能没问题。 The square root of the diagonal entries of covariance matrix can then be used to calculate your confidence intervals.然后可以使用协方差矩阵的对角线项的平方根来计算您的置信区间。 However, note that the confidence intervals have shrunk now that we've dropped the uncertainties:但是,请注意,由于我们已经消除了不确定性,因此置信区间已经缩小

from scipy.optimize import curve_fit
import matplotlib.pylab as plt
import numpy as np

func = lambda x, a, b: a * (1 / (x**2)) + b

n_ypoints = 7
x_data = np.linspace(70, 190, n_ypoints)

y_data = np.array([786.31, 487.27, 341.78, 265.49,
                    224.76, 208.04, 200.22])
best_fit_ab, covar = curve_fit(func, x_data, y_data)
sigma_ab = np.sqrt(np.diagonal(covar))

# an easy way to properly format parameter errors
from uncertainties import ufloat
a = ufloat(best_fit_ab[0], sigma_ab[0])
b = ufloat(best_fit_ab[1], sigma_ab[1])
text_res = "Best fit parameters:\na = {}\nb = {}".format(a, b)
print(text_res)

plt.scatter(x_data, y_data, facecolor = 'silver',
            edgecolor = 'k', s = 10, alpha = 1)

# plotting the model
hires_x = np.linspace(50, 200, 100)
plt.plot(hires_x, func(hires_x, *best_fit_ab), 'black')
bound_upper = func(hires_x, *(best_fit_ab + sigma_ab))
bound_lower = func(hires_x, *(best_fit_ab - sigma_ab))
# plotting the confidence intervals
plt.fill_between(hires_x, bound_lower, bound_upper,
                 color = 'black', alpha = 0.15)
plt.text(140, 630, text_res)
plt.xlim(60, 200)
plt.ylim(0, 800)
plt.show()

无西格玛案例

If you're unsure whether to include the absolute errors or how to estimate them in your case, you'd be better off asking for advice at Cross Validated , as Stack Overflow is mainly for discussion on implementations of regression methods and not for discussion on the underlying statistics.如果您不确定是否包括绝对误差或如何在您的情况下估计它们,您最好在Cross Validated寻求建议,因为 Stack Overflow 主要用于讨论回归方法的实现而不是讨论基础统计数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM