简体   繁体   English

使用 CURVE_FIT 在 Python 中拟合对数正态分布

[英]Fitting a Lognormal Distribution in Python using CURVE_FIT

I have a hypothetical y function of x and trying to find/fit a lognormal distribution curve that would shape over the data best.我有一个假设的 x y 函数,并试图找到/拟合一个对数正态分布曲线最好的数据。 I am using curve_fit function and was able to fit normal distribution, but the curve does not look optimized.我正在使用 curve_fit 函数并且能够拟合正态分布,但曲线看起来没有优化。

Below are the give y and x data points where y = f(x).下面是给出 y 和 x 数据点,其中 y = f(x)。

y_axis = [0.00032425299473065838, 0.00063714106162861229, 0.00027009331177605913, 0.00096672396877715144, 0.002388766809835889, 0.0042233337680543182, 0.0053072824980722137, 0.0061291327849408699, 0.0064555344006149871, 0.0065601228278316746, 0.0052574034010282218, 0.0057924488798939255, 0.0048154093097913355, 0.0048619350036057446, 0.0048154093097913355, 0.0045114840997070331, 0.0034906838696562147, 0.0040069911024866456, 0.0027766995669134334, 0.0016595801819374015, 0.0012182145074882836, 0.00098231827111984341, 0.00098231827111984363, 0.0012863691645616997, 0.0012395921040321833, 0.00093554121059032721, 0.0012629806342969417, 0.0010057068013846018, 0.0006081017868837127, 0.00032743942370661445, 4.6777060529516312e-05, 7.0165590794274467e-05, 7.0165590794274467e-05, 4.6777060529516745e-05]

y-axis are probabilities of an event occurring in x-axis time bins: y 轴是事件在 x 轴时间段中发生的概率:

x_axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0]

I was able to get a better fit on my data using excel and lognormal approach.我能够使用 excel 和对数正态方法更好地拟合我的数据。 When I attempt to use lognormal in python, the fit does not work and I am doing something wrong.当我尝试在 python 中使用对数正态时,拟合不起作用,我做错了什么。

Below is the code I have for fitting a normal distribution, which seems to be the only one that I can fit in python (hard to believe):下面是我用于拟合正态分布的代码,这似乎是我唯一可以在 python 中拟合的代码(难以置信):

#fitting distributino on top of savitzky-golay
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import scipy
import scipy.stats
import numpy as np
from scipy.stats import gamma, lognorm, halflogistic, foldcauchy
from scipy.optimize import curve_fit

matplotlib.rcParams['figure.figsize'] = (16.0, 12.0)
matplotlib.style.use('ggplot')
# results from savgol
x_axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0,     13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0]
y_axis = [0.00032425299473065838, 0.00063714106162861229, 0.00027009331177605913, 0.00096672396877715144, 0.002388766809835889, 0.0042233337680543182, 0.0053072824980722137, 0.0061291327849408699, 0.0064555344006149871, 0.0065601228278316746, 0.0052574034010282218, 0.0057924488798939255, 0.0048154093097913355, 0.0048619350036057446, 0.0048154093097913355, 0.0045114840997070331, 0.0034906838696562147, 0.0040069911024866456, 0.0027766995669134334, 0.0016595801819374015, 0.0012182145074882836, 0.00098231827111984341, 0.00098231827111984363, 0.0012863691645616997, 0.0012395921040321833, 0.00093554121059032721, 0.0012629806342969417, 0.0010057068013846018, 0.0006081017868837127, 0.00032743942370661445, 4.6777060529516312e-05, 7.0165590794274467e-05, 7.0165590794274467e-05, 4.6777060529516745e-05]

## y_axis values must be normalised
sum_ys = sum(y_axis)

# normalize to 1
y_axis = [_/sum_ys for _ in y_axis]

# def gamma_f(x, a, loc, scale):
#     return gamma.pdf(x, a, loc, scale)

def norm_f(x, loc, scale):
#     print 'loc: ', loc, 'scale: ', scale, "\n"
    return norm.pdf(x, loc, scale)

fitting = norm_f

# param_bounds = ([-np.inf,0,-np.inf],[np.inf,2,np.inf])
result = curve_fit(fitting, x_axis, y_axis)
result_mod = result

# mod scale
# results_adj  = [result_mod[0][0]*.75, result_mod[0][1]*.85]

plt.plot(x_axis, y_axis, 'ro')
plt.bar(x_axis, y_axis, 1, alpha=0.75)
plt.plot(x_axis, [fitting(_, *result[0]) for _ in x_axis], 'b-')
plt.axis([0,35,0,.1])

# convert back into probability
y_norm_fit = [fitting(_, *result[0]) for _ in x_axis]
y_fit = [_*sum_ys for _ in y_norm_fit]
print list(y_fit)

plt.show()

I am trying to get answers two questions:我试图回答两个问题:

  1. Is this the best fit I will get from normal distribution curve?这是我从正态分布曲线中得到的最佳拟合吗? How can I imporve my the fit?我怎样才能改善我的身材?

Normal distribution result:正态分布结果: 在此处输入图片说明

  1. How can I fit a lognormal distribution to this data or is there a better distribution that I can use?如何将对数正态分布拟合到这些数据中,或者是否有更好的分布可以使用?

I was playing around with lognormal distribution curve adjust mu and sigma, it looks like that there is possible a better fit.我在玩对数正态分布曲线调整 mu 和 sigma,看起来可能有更好的拟合。 I don't understand what I am doing wrong to get similar results in python.我不明白在 python 中获得类似结果我做错了什么。

Note that if a lognormal curve is correct and you take logs of both variables, you should have a quadratic relationship;请注意,如果对数正态曲线是正确的并且您对两个变量都取对数,则应该具有二次关系; even if that's not a suitable scale for a final model (because of variance effects -- if your variance is near constant on the original scale it will overweight the small values) it should at least give a good starting point for a nonlinear fit.即使这不是最终模型的合适尺度(由于方差效应——如果你的方差在原始尺度上接近恒定,它会超重小值)它至少应该为非线性拟合提供一个好的起点。

Indeed aside from the first two points this looks fairly good:事实上,除了前两点,这看起来还不错:

显示近二次关系的对数对数标度图

-- a quadratic fit to the solid points would describe that data quite well and should give suitable starting values if you then want to do a nonlinear fit. -- 对实体点的二次拟合可以很好地描述该数据,并且如果您想要进行非线性拟合,应该给出合适的起始值。

(If error in x is at all possible, the lack of fit at the lowest x may be as much issues with error in x as error in y) (如果 x 中的错误是完全可能的,那么最低 x 处的不拟合可能与 x 中的错误和 y 中的错误一样多)

Incidentally, that plot seems to hint that a gamma curve may fit a little better overall than a lognormal one (in particular if you don't want to reduce the impact of those first two points relative to points 4-6).顺便说一句,这情节似乎暗示,伽马曲线可配合一点点更好的整体比对数正态分布一个(尤其是如果你希望减少相对于分4-6的第一个两分的影响)。 A good initial fit for that can be had by regressing log(y) on x and log(x):通过在 x 和 log(x) 上回归 log(y) 可以得到一个很好的初始拟合:

对数-对数标度上伽马曲线的拟合

The scaled gamma density is g = cx^(a-1) exp(-bx) ... taking logs, you get log(g) = log(c) + (a-1) log(x) - bx = b0 + b1 log(x) + b2 x ... so supplying log(x) and x to a linear regression routine will fit that.缩放后的伽马密度是 g = cx^(a-1) exp(-bx) ...取对数,你得到 log(g) = log(c) + (a-1) log(x) - bx = b0 + b1 log(x) + b2 x ...因此将 log(x) 和 x 提供给线性回归例程将适合。 The same caveats about variance effects apply (so it might be best as a starting point for a nonlinear least squares fit if your relative error in y isn't nearly constant).同样适用于方差效应的警告(因此,如果 y 中的相对误差不是几乎恒定,则最好作为非线性最小二乘拟合的起点)。

Actually, Gamma distribution might be good fit as @Glen_b proposed.实际上, Gamma 分布可能很适合@Glen_b 提出的。 I'm using second definition with \\alpha and \\beta.我正在使用 \\alpha 和 \\beta 的第二个定义。

NB: trick I use for a quick fit is to compute mean and variance and for typical two-parametric distribution it is enough to recover parameters and get quick idea if it is good fit or not.注意:我用于快速拟合的技巧是计算均值和方差,对于典型的双参数分布,它足以恢复参数并快速了解它是否适​​合。

在此处输入图片说明

Code代码

import math
from scipy.misc import comb

import matplotlib.pyplot as plt

y_axis = [0.00032425299473065838, 0.00063714106162861229, 0.00027009331177605913, 0.00096672396877715144, 0.002388766809835889, 0.0042233337680543182, 0.0053072824980722137, 0.0061291327849408699, 0.0064555344006149871, 0.0065601228278316746, 0.0052574034010282218, 0.0057924488798939255, 0.0048154093097913355, 0.0048619350036057446, 0.0048154093097913355, 0.0045114840997070331, 0.0034906838696562147, 0.0040069911024866456, 0.0027766995669134334, 0.0016595801819374015, 0.0012182145074882836, 0.00098231827111984341, 0.00098231827111984363, 0.0012863691645616997, 0.0012395921040321833, 0.00093554121059032721, 0.0012629806342969417, 0.0010057068013846018, 0.0006081017868837127, 0.00032743942370661445, 4.6777060529516312e-05, 7.0165590794274467e-05, 7.0165590794274467e-05, 4.6777060529516745e-05]
x_axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0]

## y_axis values must be normalised
sum_ys = sum(y_axis)

# normalize to 1
y_axis = [_/sum_ys for _ in y_axis]

m = 0.0
for k in range(0, len(x_axis)):
    m += y_axis[k] * x_axis[k]

v = 0.0
for k in range(0, len(x_axis)):
    t = (x_axis[k] - m)
    v += y_axis[k] * t * t

print(m, v)

b = m/v
a = m * b

print(a, b)

z = []
for k in range(0, len(x_axis)):
    q = b**a * x_axis[k]**(a-1.0) * math.exp( - b*x_axis[k] ) / math.gamma(a)
    z.append(q)

plt.plot(x_axis, y_axis, 'ro')
plt.plot(x_axis, z, 'b*')
plt.axis([0, 35, 0, .1])
plt.show()

Discrete distribution might look better - your x are all integers after all.离散分布可能看起来更好 - 毕竟你的x都是整数。 You have distribution with variance about 3 times higher than mean, asymmetric - so most likely something like Negative Binomial might work quite well.您的分布的方差比均值高出约 3 倍,不对称 - 所以很可能像负二项式这样的东西可能会很好地工作。 Here is quick fit这是快速配合

在此处输入图片说明

r is a bit above 6, so you might want to move to distribution with real r - Polya distribution. r略高于 6,因此您可能希望使用真正的r - Polya 分布进行分布。

Code代码

from scipy.misc import comb

import matplotlib.pyplot as plt

y_axis = [0.00032425299473065838, 0.00063714106162861229, 0.00027009331177605913, 0.00096672396877715144, 0.002388766809835889, 0.0042233337680543182, 0.0053072824980722137, 0.0061291327849408699, 0.0064555344006149871, 0.0065601228278316746, 0.0052574034010282218, 0.0057924488798939255, 0.0048154093097913355, 0.0048619350036057446, 0.0048154093097913355, 0.0045114840997070331, 0.0034906838696562147, 0.0040069911024866456, 0.0027766995669134334, 0.0016595801819374015, 0.0012182145074882836, 0.00098231827111984341, 0.00098231827111984363, 0.0012863691645616997, 0.0012395921040321833, 0.00093554121059032721, 0.0012629806342969417, 0.0010057068013846018, 0.0006081017868837127, 0.00032743942370661445, 4.6777060529516312e-05, 7.0165590794274467e-05, 7.0165590794274467e-05, 4.6777060529516745e-05]
x_axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0]

## y_axis values must be normalised
sum_ys = sum(y_axis)

# normalize to 1
y_axis = [_/sum_ys for _ in y_axis]

s = 1.0 # shift by 1 to have them all at 0
m = 0.0
for k in range(0, len(x_axis)):
    m += y_axis[k] * (x_axis[k] - s)

v = 0.0
for k in range(0, len(x_axis)):
    t = (x_axis[k] - s - m)
    v += y_axis[k] * t * t

print(m, v)

p = 1.0 - m/v
r = int(m*(1.0 - p) / p)

print(p, r)

z = []
for k in range(0, len(x_axis)):
    q = comb(k + r - 1, k) * (1.0 - p)**r * p**k
    z.append(q)

plt.plot(x_axis, y_axis, 'ro')
plt.plot(x_axis, z, 'b*')
plt.axis([0, 35, 0, .1])
plt.show()

In Python, I explained a trick here of how to fit a LogNormal very simply using OpenTURNS library:在 Python 中,我在这里解释了如何使用OpenTURNS库非常简单地拟合 LogNormal 的技巧

import openturns as ot

n_times = [int(y_axis[i] * N) for i in range(len(y_axis))]
S = np.repeat(x_axis, n_times)

sample = ot.Sample([[p] for p in S])
fitdist = ot.LogNormalFactory().buildAsLogNormal(sample)

That's it!就是这样!

print(fitdist) will show you >>> LogNormal(muLog = 2.92142, sigmaLog = 0.305, gamma = -6.24996) print(fitdist)会告诉你>>> LogNormal(muLog = 2.92142, sigmaLog = 0.305, gamma = -6.24996)

and the fitting seems good:并且配件看起来不错:

import matplotlib.pyplot as plt

plt.hist(S, density =True, color = 'grey', bins = 34, alpha = 0.5)
plt.scatter(x_axis, y_axis, color= 'red')
plt.plot(x_axis, fitdist.computePDF(ot.Sample([[p] for p in x_axis])), color = 'black')
plt.show()

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM