简体   繁体   English

使用scipy拟合给定直方图的分布

[英]Fitting a distribution given the histogram using scipy

I would like to fit a distribution using scipy (in my case, using weibull_min) to my data. 我想使用scipy(在我的情况下,使用weibull_min)适合数据分布。 Is it possible to do this given the Histogram, and not the data points? 在直方图而不是数据点的情况下,是否可以这样做? In my case, because the histogram has integer bins of size 1, I know that I can extrapolate my data in the following way: 就我而言,由于直方图具有大小为1的整数箱,所以我知道可以按以下方式推断数据:

import numpy as np
orig_hist = np.array([10, 5, 3, 2, 1])

ext_data = reduce(lambda x,y: x+y, [[i]*x for i, x in enumerate(orig_hist)])

In this case, ext_data would hold this: 在这种情况下,ext_data将保存以下内容:

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]

And building the histogram using: 并使用以下方法构建直方图:

np.histogram(ext_data, bins=5)

would be equivalent to orig_hist 相当于orig_hist

Yet, given that I already have the histogram built, I would like to avoid extrapolating the data and use orig_hist to fit the distribution, but I don't know if it is possible to use it directly in the fitting procedure. 但是,鉴于已经建立了直方图,我想避免外推数据并使用orig_hist拟合分布,但是我不知道是否可以在拟合过程中直接使用它。 Additionally, is there a numpy function that can be used to perform something similar to the extrapolation I showed? 另外,是否有一个numpy函数可用于执行与我所示的推断类似的操作?

I might be misunderstanding something, but I believe that fitting to the histogram is exactly what you should do: you're trying to approximate the probability density. 我可能会误解某些内容,但是我相信拟合直方图正是您应该做的事情:您正在尝试估算概率密度。 And the histogram is as close as you can get to the underlying probability density. 直方图尽可能接近潜在的概率密度。 You just have to normalize it in order to have an integral of 1, or allow your fitted model to contain an arbitrary prefactor. 您只需要对其进行归一化即可获得1的整数,或者允许您的拟合模型包含任意前置因子。

import numpy as np
import scipy.stats as stats
import scipy.optimize as opt
import matplotlib.pyplot as plt

orig_hist = np.array([10, 5, 3, 2, 1])
norm_hist = orig_hist/float(sum(orig_hist))

popt,pcov = opt.curve_fit(lambda x,c: stats.weibull_min.pdf(x,c), np.arange(len(norm_hist)),norm_hist)

plt.figure()
plt.plot(norm_hist,'o-',label='norm_hist')
plt.plot(stats.weibull_min.pdf(np.arange(len(norm_hist)),popt),'s-',label='Weibull_min fit')
plt.legend()

Of course for your given input the Weibull fit will be far from satisfactory: 当然,对于您给定的输入,Weibull拟合将远远不能令人满意:

适合数据

Update 更新

As I mentioned above, Weibull_min is a poor fit to your sample input. 正如我上面提到的,Weibull_min不适合您的样本输入。 The bigger problem is that it is also a poor fit to your actual data: 更大的问题是它也不适合您的实际数据:

orig_hist = np.array([ 23., 14., 13., 12., 12., 12., 11., 11., 11., 11., 10., 10., 10., 9., 9., 8., 8., 8., 8., 8., 8., 8., 8., 8., 8., 8., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 6., 6., 6., 6., 6., 6., 6., 6., 6., 6., 6.], dtype=np.float32)

新的直方图数据

There are two main problems with this histogram. 该直方图存在两个主要问题。 The first, as I said, is that it is unlikely to correspond to a Weibull_min distribution: it is maximal near zero and has a long tail, so it needs a non-trivial combination of Weibull parameters. 正如我所说,第一个是不太可能与Weibull_min分布相对应:它最大接近零且尾巴很长,因此需要一个非平凡的Weibull参数组合。 Furthermore, your histogram clearly only contains a part of the distribution. 此外,直方图显然仅包含分布的一部分。 This implies that my normalizing suggestion above is guaranteed to fail. 这意味着我的上述规范化建议肯定会失败。 You can't avoid using an arbitrary scale parameter in your fit. 您不可避免地要使用适合自己的任意比例尺参数。

I manually defined a scaled Weibull fitting function according to the formula on Wikipedia : 根据Wikipedia上的公式手动定义了缩放的Weibull拟合函数:

my_weibull = lambda x,l,c,A: A*float(c)/l*(x/float(l))**(c-1)*np.exp(-(x/float(l))**c)

In this function x is the independent variable, l is lambda (the scale parameter), c is k (the shape parameter) and A is a scaling prefactor. 在此函数中, x是自变量, llambda (比例参数), ck (形状参数), A是比例系数。 The faint upside of introducing A is that you don't have to normalize your histogram. 引入A的隐含优势是您不必标准化直方图。

Now, when I dropped this function into scipy.optimize.curve_fit , I found what you did: it doesn't actually perform a fit, but sticks with the initial fitting parameters, whatever you set (using the p0 parameter; the default guesses are all 1 for every parametr). 现在,当我将此函数放到scipy.optimize.curve_fit ,我发现了您所做的事情:它实际上并不执行拟合,而是坚持使用初始的拟合参数,无论您设置了什么(使用p0参数;默认猜测是每个参数都设为1)。 And curve_fit seems to think that the fitting converged. curve_fit似乎认为拟合收敛了。

After more than an hour's wall-related head-banging, I realized that the problem is that the singular behaviour at x=0 throws off the nonlinear least-squares algorithm. 经过一个多小时的与墙壁相关的头部撞击,我意识到问题在于x=0处的奇异行为引发了非线性最小二乘算法。 By excluding your very first data point, you get an actual fit to your data. 通过排除您的第一个数据点,您可以对数据进行实际拟合。 I suspect that if we set c=1 and don't allow that to fit, then this problem might go away, but it is probably more informative to allow that to be fitted (so I didn't check). 我怀疑如果我们设置c=1并不允许它适合,那么这个问题可能会消失,但是允许它适合可能更有意义(所以我没有检查)。

Here's the corresponding code: 这是相应的代码:

import numpy as np
import scipy.optimize as opt
import matplotlib.pyplot as plt

orig_hist = np.array([ 23., 14., 13., 12., 12., 12., 11., 11., 11., 11., 10., 10., 10., 9., 9., 8., 8., 8., 8., 8., 8., 8., 8., 8., 8., 8., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 6., 6., 6., 6., 6., 6., 6., 6., 6., 6., 6.], dtype=np.float32)

my_weibull = lambda x,l,c,A: A*float(c)/l*(x/float(l))**(c-1)*np.exp(-(x/float(l))**c)

popt,pcov = opt.curve_fit(my_weibull,np.arange(len(orig_hist))[1:],orig_hist[1:]) #throw away x=0!

plt.figure()
plt.plot(np.arange(len(orig_hist)),orig_hist,'o-',label='orig_hist')
plt.plot(np.arange(len(orig_hist)),my_weibull(np.arange(len(orig_hist)),*popt),'s-',label='Scaled Weibull fit')
plt.legend()

Result: 结果:

新适应

In [631]: popt
Out[631]: array([  1.10511850e+02,   8.82327822e-01,   1.05206207e+03])

the final fitted parameters are in the order (l,c,A) , with the shape parameter of around 0.88 . 最终拟合参数的顺序为(l,c,A) ,形状参数约为0.88 This corresponds to a diverging probability density, which explains why a few errors pop up saying 这对应于发散的概率密度,这解释了为什么会弹出一些错误的原因:

RuntimeWarning: invalid value encountered in power RuntimeWarning:电源中遇到无效值

and why there isn't a data point from the fitting for x=0 . 以及为什么从x=0的拟合中没有数据点。 But judging from the visual agreement between data and fit, you can assess whether the result is acceptable or not. 但是从数据和拟合之间的视觉一致性来看,您可以评估结果是否可接受。

If you want to overdo it, you can probably try generating points using np.random.weibull with these parameters, then comparing the resulting histograms with your own. 如果您想过度使用它,可以尝试使用np.random.weibull和这些参数生成点,然后将生成的直方图与您自己的比较。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM