Python Distribution Fitting with Sum of Square Error (SSE)

Question

I am trying to find an optimal distribution curve fit to my data consisting of

y-axis = [0, 0, 0, 0, 0.24, 0.53, 0.49, 0.64, 0.54, 0.78, 0.59, 0.44, 
          0.34, 0.88, 0.2, 0.49, 0.39, 0.39, 0.29, 0.2, 0.05, 0.05, 
          0.25, 0.05, 0.1, 0.15, 0.1, 0.1, 0.1, 0, 0, 0, 0, 0]

y-axis are probabilities of an event occurring in x-axis time bins:

x-axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 
          12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 
          22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 
          32.0, 33.0, 34.0]

I am doing this in python following example provided on Fitting empirical distribution to theoretical ones with Scipy (Python)?

Specifically I am attempting to recreate the part called 'Distribution Fitting with Sum of Square Error (SSE)', where you run through the different distributions to find the right fit to the data.

How can I modify that example in order to make this work on my data inputs? answered

Update version based on Bill's response, but now trying to plot the fitted curve against the data and seeing something off:

%matplotlib inline
import matplotlib.pyplot as plt
import scipy
import scipy.stats
import numpy as np
from scipy.stats import gamma, lognorm, loglaplace
from scipy.optimize import curve_fit

x_axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0]
y_axis = [0, 0, 0, 0, 0.24, 0.53, 0.49, 0.64, 0.54, 0.78, 0.59, 0.44, 0.34, 0.88, 0.2, 0.49, 0.39, 0.39, 0.29, 0.2, 0.05, 0.05, 0.25, 0.05, 0.1, 0.15, 0.1, 0.1, 0.1, 0, 0, 0, 0, 0]

matplotlib.rcParams['figure.figsize'] = (16.0, 12.0)
matplotlib.style.use('ggplot')

def f(x, a, loc, scale):
    return gamma.pdf(x, a, loc, scale)

result, pcov = curve_fit(f, x_axis, y_axis)

# get curve shape, location, scale
shape = result[:-2]
loc = result[-2]
scale = result[-1]

# construct the curve
x = np.linspace(0, 36, 100)
y = f(x, *result)

plt.bar(x_axis, y_axis, width, alpha=0.75)
plt.plot(x, y, c='g')

Answer 1

Your situation is not the same as that in the one treated in the question you cited. You have both the ordinates and the abscissae of the data points, rather than the usual iid sample. I would suggest that you use scipy curve_fit . Here's a sample.

x_axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0]
y_axis = [0, 0, 0, 0, 0.24, 0.53, 0.49, 0.64, 0.54, 0.78, 0.59, 0.44, 0.34, 0.88, 0.2, 0.49, 0.39, 0.39, 0.29, 0.2, 0.05, 0.05, 0.25, 0.05, 0.1, 0.15, 0.1, 0.1, 0.1, 0, 0, 0, 0, 0]

## y_axis values must be normalised
sum_ys = sum(y_axis)
y_axis = [_/sum_ys for _ in y_axis]
print (sum(y_axis))

from scipy.stats import gamma, norm
from scipy.optimize import curve_fit

def gamma_f(x, a, loc, scale):
    return gamma.pdf(x, a, loc, scale)

def norm_f(x, loc, scale):
    return norm.pdf(x, loc, scale)

fitting = norm_f

result = curve_fit(fitting, x_axis, y_axis)
print (result)

import matplotlib.pyplot as plt

plt.plot(x_axis, y_axis, 'ro')
plt.plot(x_axis, [fitting(_, *result[0]) for _ in x_axis], 'b-')
plt.axis([0,35,0,.5])
plt.show()

This version shows how to do one plot, for the normal fit to the data. (The gamma provides a poor fit.) Only two parameters are needed for the normal. In general you would need only the first part of the output results, the estimates of the parameters, shape, location and scale.

(array([  2.3352639 ,  -3.08105104,  10.15024823]), array([[   5954.86532869,  -27818.92220973,  -19675.22421994],
       [ -27818.92220973,  133161.76500251,   90741.43608615],
       [ -19675.22421994,   90741.43608615,   66054.79087992]]))

Notice that the pdf of the gamma distribution is also available in scipy, as are the others that you need, I think, saving you the work of coding them.

The most important thing I omitted from the first code was the need to normalise the y-values, that is, to make them sum to one, since they should approximate histogram heights.

Answer 2

I tried your example using OpenTURNS platform Here what I got.

I started with the same data as you after importing openturns and openturs.viewer.View for plotting

    import openturns as ot
    from openturns.viewer import View

    x_axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 
          12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 
          22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 
          32.0, 33.0, 34.0]

    y_axis = [0, 0, 0, 0, 0.24, 0.53, 0.49, 0.64, 0.54, 0.78, 0.59, 0.44, 
          0.34, 0.88, 0.2, 0.49, 0.39, 0.39, 0.29, 0.2, 0.05, 0.05, 
          0.25, 0.05, 0.1, 0.15, 0.1, 0.1, 0.1, 0, 0, 0, 0, 0]

First step: we can define the corresponding distribution

    distribution = ot.UserDefined(ot.Sample([[s] for s in x_axis]), y_axis)
    graph = distribution.drawPDF()
    graph.setColors(["black"])
    graph.setLegends(["your input"])

at this stage, if you View(graph) you would get:

Second step: we can derive a sample from the obtained distibution

    sample = distribution.getSample(10000)

this sample will be used to fit any kind of distributions. I tried with WeibullMin and Gamma distributions

    # WeibullMin Factory
    distribution2 = ot.WeibullMinFactory().build(sample)
    print(distribution2)
    graph2 = distribution2.drawPDF() ; graph2.setLegends(["Best WeibullMin"])
    >>> WeibullMin(beta = 8.83969, alpha = 1.48142, gamma = 4.76832)

    # Gamma Factory
    distribution3 = ot.GammaFactory().build(sample)
    print(distribution3)
    >>> Gamma(k = 2.08142, lambda = 0.25157, gamma = 4.9995)
    graph3 = distribution3.drawPDF() ; graph3.setLegends(["Best Gamma"]) ; 
    graph3.setColors(["blue"])

    # plotting all the results
    graph.add(graph2) ; graph.add(graph3)
    View(graph)

Answer 3

I think its the best and simple way to calculate the sum of square error:

#write the function

def SSE(y_true, y_pred):

     sse= np.sum((y_true-y_pred)**2)

     print(sse)

#now call the function and get results
SSE(y_true, y_pred)

Python Distribution Fitting with Sum of Square Error (SSE)

Question

3 answers

solution1
0 ACCPTED 2017-04-01 19:28:29

solution2
0 2020-10-25 19:36:02

solution3
0 2021-06-15 08:10:56

Python Distribution Fitting with Sum of Square Error (SSE)

Question

3 answers

solution1 0 ACCPTED 2017-04-01 19:28:29

solution2 0 2020-10-25 19:36:02

solution3 0 2021-06-15 08:10:56

solution1
0 ACCPTED 2017-04-01 19:28:29

solution2
0 2020-10-25 19:36:02

solution3
0 2021-06-15 08:10:56