简体   繁体   English

Python分布拟合平方误差总和(SSE)

[英]Python Distribution Fitting with Sum of Square Error (SSE)

I am trying to find an optimal distribution curve fit to my data consisting of我试图找到适合我的数据的最佳分布曲线,包括

y-axis = [0, 0, 0, 0, 0.24, 0.53, 0.49, 0.64, 0.54, 0.78, 0.59, 0.44, 
          0.34, 0.88, 0.2, 0.49, 0.39, 0.39, 0.29, 0.2, 0.05, 0.05, 
          0.25, 0.05, 0.1, 0.15, 0.1, 0.1, 0.1, 0, 0, 0, 0, 0]

y-axis are probabilities of an event occurring in x-axis time bins: y 轴是事件在 x 轴时间段中发生的概率:

x-axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 
          12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 
          22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 
          32.0, 33.0, 34.0]

I am doing this in python following example provided on Fitting empirical distribution to theoretical ones with Scipy (Python)?我正在 python 中执行此操作,下面是使用 Scipy (Python) 将经验分布拟合到理论分布上提供的示例

Specifically I am attempting to recreate the part called 'Distribution Fitting with Sum of Square Error (SSE)', where you run through the different distributions to find the right fit to the data.具体来说,我试图重新创建名为“具有平方误差总和 (SSE) 的分布拟合”的部分,您可以在其中运行不同的分布以找到对数据的正确拟合。

How can I modify that example in order to make this work on my data inputs?我如何修改该示例以使其对我的数据输入起作用? answered回答

Update version based on Bill's response, but now trying to plot the fitted curve against the data and seeing something off:根据 Bill 的响应更新版本,但现在尝试根据数据绘制拟合曲线并查看某些内容:

%matplotlib inline
import matplotlib.pyplot as plt
import scipy
import scipy.stats
import numpy as np
from scipy.stats import gamma, lognorm, loglaplace
from scipy.optimize import curve_fit

x_axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0]
y_axis = [0, 0, 0, 0, 0.24, 0.53, 0.49, 0.64, 0.54, 0.78, 0.59, 0.44, 0.34, 0.88, 0.2, 0.49, 0.39, 0.39, 0.29, 0.2, 0.05, 0.05, 0.25, 0.05, 0.1, 0.15, 0.1, 0.1, 0.1, 0, 0, 0, 0, 0]

matplotlib.rcParams['figure.figsize'] = (16.0, 12.0)
matplotlib.style.use('ggplot')

def f(x, a, loc, scale):
    return gamma.pdf(x, a, loc, scale)

result, pcov = curve_fit(f, x_axis, y_axis)

# get curve shape, location, scale
shape = result[:-2]
loc = result[-2]
scale = result[-1]

# construct the curve
x = np.linspace(0, 36, 100)
y = f(x, *result)

plt.bar(x_axis, y_axis, width, alpha=0.75)
plt.plot(x, y, c='g')

Your situation is not the same as that in the one treated in the question you cited.您的情况与您引用的问题中处理的情况不同。 You have both the ordinates and the abscissae of the data points, rather than the usual iid sample.您拥有数据点的纵坐标和横坐标,而不是通常的 iid 样本。 I would suggest that you use scipy curve_fit .我建议你使用scipy curve_fit Here's a sample.这是一个示例。

x_axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0]
y_axis = [0, 0, 0, 0, 0.24, 0.53, 0.49, 0.64, 0.54, 0.78, 0.59, 0.44, 0.34, 0.88, 0.2, 0.49, 0.39, 0.39, 0.29, 0.2, 0.05, 0.05, 0.25, 0.05, 0.1, 0.15, 0.1, 0.1, 0.1, 0, 0, 0, 0, 0]

## y_axis values must be normalised
sum_ys = sum(y_axis)
y_axis = [_/sum_ys for _ in y_axis]
print (sum(y_axis))

from scipy.stats import gamma, norm
from scipy.optimize import curve_fit

def gamma_f(x, a, loc, scale):
    return gamma.pdf(x, a, loc, scale)

def norm_f(x, loc, scale):
    return norm.pdf(x, loc, scale)

fitting = norm_f

result = curve_fit(fitting, x_axis, y_axis)
print (result)

import matplotlib.pyplot as plt

plt.plot(x_axis, y_axis, 'ro')
plt.plot(x_axis, [fitting(_, *result[0]) for _ in x_axis], 'b-')
plt.axis([0,35,0,.5])
plt.show()

This version shows how to do one plot, for the normal fit to the data.此版本展示了如何绘制一幅图,以便对数据进行正常拟合。 (The gamma provides a poor fit.) Only two parameters are needed for the normal. (伽马提供了一个很差的拟合。)法线只需要两个参数。 In general you would need only the first part of the output results, the estimates of the parameters, shape, location and scale.一般来说,您只需要输出结果的第一部分,即参数、形状、位置和比例的估计值。

(array([  2.3352639 ,  -3.08105104,  10.15024823]), array([[   5954.86532869,  -27818.92220973,  -19675.22421994],
       [ -27818.92220973,  133161.76500251,   90741.43608615],
       [ -19675.22421994,   90741.43608615,   66054.79087992]]))

Notice that the pdf of the gamma distribution is also available in scipy, as are the others that you need, I think, saving you the work of coding them.请注意,伽马分布的 pdf 也可以在 scipy 中获得,我认为您需要的其他人也是如此,从而为您节省了编码工作。

The most important thing I omitted from the first code was the need to normalise the y-values, that is, to make them sum to one, since they should approximate histogram heights.我在第一个代码中省略的最重要的事情是需要对 y 值进行归一化,也就是说,使它们的总和为 1,因为它们应该近似于直方图的高度。

I tried your example using OpenTURNS platform Here what I got.我使用OpenTURNS平台尝试了您的示例 这是我得到的。

I started with the same data as you after importing openturns and openturs.viewer.View for plotting在导入 openturns 和 openturs.viewer.View 进行绘图后,我开始使用与您相同的数据

    import openturns as ot
    from openturns.viewer import View

    x_axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 
          12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 
          22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 
          32.0, 33.0, 34.0]

    y_axis = [0, 0, 0, 0, 0.24, 0.53, 0.49, 0.64, 0.54, 0.78, 0.59, 0.44, 
          0.34, 0.88, 0.2, 0.49, 0.39, 0.39, 0.29, 0.2, 0.05, 0.05, 
          0.25, 0.05, 0.1, 0.15, 0.1, 0.1, 0.1, 0, 0, 0, 0, 0]

First step: we can define the corresponding distribution第一步:我们可以定义对应的分布

    distribution = ot.UserDefined(ot.Sample([[s] for s in x_axis]), y_axis)
    graph = distribution.drawPDF()
    graph.setColors(["black"])
    graph.setLegends(["your input"])

at this stage, if you View(graph) you would get:在这个阶段,如果你View(graph)你会得到:

在此处输入图片说明

Second step: we can derive a sample from the obtained distibution第二步:我们可以从获得的分布中推导出一个样本

    sample = distribution.getSample(10000)

this sample will be used to fit any kind of distributions.此样本将用于拟合任何类型的分布。 I tried with WeibullMin and Gamma distributions我尝试使用 WeibullMin 和 Gamma 分布

    # WeibullMin Factory
    distribution2 = ot.WeibullMinFactory().build(sample)
    print(distribution2)
    graph2 = distribution2.drawPDF() ; graph2.setLegends(["Best WeibullMin"])
    >>> WeibullMin(beta = 8.83969, alpha = 1.48142, gamma = 4.76832)

    # Gamma Factory
    distribution3 = ot.GammaFactory().build(sample)
    print(distribution3)
    >>> Gamma(k = 2.08142, lambda = 0.25157, gamma = 4.9995)
    graph3 = distribution3.drawPDF() ; graph3.setLegends(["Best Gamma"]) ; 
    graph3.setColors(["blue"])

    # plotting all the results
    graph.add(graph2) ; graph.add(graph3)
    View(graph)

在此处输入图片说明

I think its the best and simple way to calculate the sum of square error:我认为它是计算平方误差总和的最好和最简单的方法:

#write the function #编写函数

def SSE(y_true, y_pred):

     sse= np.sum((y_true-y_pred)**2)

     print(sse)

#now call the function and get results #现在调用函数并获取结果

SSE(y_true, y_pred)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM