简体   繁体   English

将分布拟合到具有不均匀 bin 的数据 python

[英]Fit distribution to data with uneven bins python

I have a set of data in histogram format with uneven bin sizes, which represents the weight of horses at a certain point in their lifetimes when they are switched from grazing to a racing diet.我有一组直方图格式的数据,它们的大小不均匀,代表了马在一生中某个时刻从放牧转向赛跑饮食时的体重。 Here is a data sample:这是一个数据示例:

Weight - Headcount重量 - 人数

0-600lb: 340,000 0-600 磅:340,000

600-699lb: 365,000 600-699 磅:365,000

700-799lb: 494,000 700-799 磅:494,000

800-899lb: 430,000 800-899 磅:430,000

900-999lb: 110000 900-999 磅:110000

1000-3000lb: 40,000 1000-3000 磅:40,000

I know that the majority of the 0-600lb category will be towards the heavier end, and the opposite would be true for the 1000-3000lb category, so I'm looking for a decreasing distribution with a peak around the middle.我知道 0-600lb 类别的大部分将朝向较重的一端,而 1000-3000lb 类别则相反,所以我正在寻找一个在中间有一个峰值的递减分布。 Additionally, this may be a combination of two distributions, as it's possible male and female horses have their diets switched at different times.此外,这可能是两种分布的组合,因为公马和母马的饮食可能在不同的时间切换。 Then again, maybe not so if a solution without considering this factor would still be fantastic!再说一次,如果不考虑这个因素的解决方案仍然很棒,那么也许不是这样!

How can I try a series of distributions to see which best fits my data in python?如何尝试一系列分布以查看哪种分布最适合我在 python 中的数据?

I would assume that this data would follow a normal distribution, so that is where I would start.我会假设这些数据将遵循正态分布,所以这就是我要开始的地方。

When the bin width is even, you can use the bin center as the x value and the bin height as the y .当 bin 宽度为偶数时,您可以将 bin 中心用作x值,将 bin 高度用作y In your case, since the bins are uneven, you should use the bin integral of the objective function to compare to your data.在您的情况下,由于 bin 不均匀,您应该使用目标函数的 bin积分与您的数据进行比较。 For example the code below:例如下面的代码:

import scipy.optimize
import scipy.stats
import numpy as np
import matplotlib.pyplot as plt

bins = [0, 600, 700, 800, 900, 1000, 3000]
binc = [ 300, 650, 750, 850, 950, 2000]
weights = [340000, 365000, 494000, 430000, 110000, 40000]


def fGaussianCDF(bins, *params):
    
    N = params[0]
    mu = params[1]
    sigma = params[2]

    binwidth = np.diff(bins)

    return N*(scipy.stats.norm.cdf(bins[1:], mu, sigma) - scipy.stats.norm.cdf(bins[:-1], mu, sigma) )


fig, ax = plt.subplots(1, 1)
ax.plot(binc, weights, "ok")
ax.set_xlabel("Weight (lbs.)", fontsize=16)
ax.set_ylabel("Counts", fontsize=16)


popt, _ = scipy.optimize.curve_fit(fGaussianCDF, bins, weights, p0=[1.8e6, 730, 150])
plt.plot(binc, fGaussianCDF(bins, *popt), "rx")
print(popt)

plt.show()

Which gives the best fit result of a mean value of mu=736 lb and sigma=146 .这给出了mu=736 lb 和sigma=146平均值的最佳拟合结果。 The results plotted look like:绘制的结果如下所示:

在此处输入图片说明

Which is not a perfect fit, but hopefully is something that you are looking for.这不是一个完美的选择,但希望是您正在寻找的东西。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM