[英]Fit distribution to data with uneven bins python
I have a set of data in histogram format with uneven bin sizes, which represents the weight of horses at a certain point in their lifetimes when they are switched from grazing to a racing diet.我有一组直方图格式的数据,它们的大小不均匀,代表了马在一生中某个时刻从放牧转向赛跑饮食时的体重。 Here is a data sample:
这是一个数据示例:
Weight - Headcount重量 - 人数
0-600lb: 340,000 0-600 磅:340,000
600-699lb: 365,000 600-699 磅:365,000
700-799lb: 494,000 700-799 磅:494,000
800-899lb: 430,000 800-899 磅:430,000
900-999lb: 110000 900-999 磅:110000
1000-3000lb: 40,000 1000-3000 磅:40,000
I know that the majority of the 0-600lb category will be towards the heavier end, and the opposite would be true for the 1000-3000lb category, so I'm looking for a decreasing distribution with a peak around the middle.我知道 0-600lb 类别的大部分将朝向较重的一端,而 1000-3000lb 类别则相反,所以我正在寻找一个在中间有一个峰值的递减分布。 Additionally, this may be a combination of two distributions, as it's possible male and female horses have their diets switched at different times.
此外,这可能是两种分布的组合,因为公马和母马的饮食可能在不同的时间切换。 Then again, maybe not so if a solution without considering this factor would still be fantastic!
再说一次,如果不考虑这个因素的解决方案仍然很棒,那么也许不是这样!
How can I try a series of distributions to see which best fits my data in python?如何尝试一系列分布以查看哪种分布最适合我在 python 中的数据?
I would assume that this data would follow a normal distribution, so that is where I would start.我会假设这些数据将遵循正态分布,所以这就是我要开始的地方。
When the bin width is even, you can use the bin center as the x
value and the bin height as the y
.当 bin 宽度为偶数时,您可以将 bin 中心用作
x
值,将 bin 高度用作y
。 In your case, since the bins are uneven, you should use the bin integral of the objective function to compare to your data.在您的情况下,由于 bin 不均匀,您应该使用目标函数的 bin积分与您的数据进行比较。 For example the code below:
例如下面的代码:
import scipy.optimize
import scipy.stats
import numpy as np
import matplotlib.pyplot as plt
bins = [0, 600, 700, 800, 900, 1000, 3000]
binc = [ 300, 650, 750, 850, 950, 2000]
weights = [340000, 365000, 494000, 430000, 110000, 40000]
def fGaussianCDF(bins, *params):
N = params[0]
mu = params[1]
sigma = params[2]
binwidth = np.diff(bins)
return N*(scipy.stats.norm.cdf(bins[1:], mu, sigma) - scipy.stats.norm.cdf(bins[:-1], mu, sigma) )
fig, ax = plt.subplots(1, 1)
ax.plot(binc, weights, "ok")
ax.set_xlabel("Weight (lbs.)", fontsize=16)
ax.set_ylabel("Counts", fontsize=16)
popt, _ = scipy.optimize.curve_fit(fGaussianCDF, bins, weights, p0=[1.8e6, 730, 150])
plt.plot(binc, fGaussianCDF(bins, *popt), "rx")
print(popt)
plt.show()
Which gives the best fit result of a mean value of mu=736
lb and sigma=146
.这给出了
mu=736
lb 和sigma=146
平均值的最佳拟合结果。 The results plotted look like:绘制的结果如下所示:
Which is not a perfect fit, but hopefully is something that you are looking for.这不是一个完美的选择,但希望是您正在寻找的东西。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.