将分布拟合到具有不均匀 bin 的数据 python

Question

I have a set of data in histogram format with uneven bin sizes, which represents the weight of horses at a certain point in their lifetimes when they are switched from grazing to a racing diet.我有一组直方图格式的数据，它们的大小不均匀，代表了马在一生中某个时刻从放牧转向赛跑饮食时的体重。 Here is a data sample:这是一个数据示例：

Weight - Headcount重量 - 人数

0-600lb: 340,000 0-600 磅：340,000

600-699lb: 365,000 600-699 磅：365,000

700-799lb: 494,000 700-799 磅：494,000

800-899lb: 430,000 800-899 磅：430,000

900-999lb: 110000 900-999 磅：110000

1000-3000lb: 40,000 1000-3000 磅：40,000

I know that the majority of the 0-600lb category will be towards the heavier end, and the opposite would be true for the 1000-3000lb category, so I'm looking for a decreasing distribution with a peak around the middle.我知道 0-600lb 类别的大部分将朝向较重的一端，而 1000-3000lb 类别则相反，所以我正在寻找一个在中间有一个峰值的递减分布。 Additionally, this may be a combination of two distributions, as it's possible male and female horses have their diets switched at different times.此外，这可能是两种分布的组合，因为公马和母马的饮食可能在不同的时间切换。 Then again, maybe not so if a solution without considering this factor would still be fantastic!再说一次，如果不考虑这个因素的解决方案仍然很棒，那么也许不是这样！

How can I try a series of distributions to see which best fits my data in python?如何尝试一系列分布以查看哪种分布最适合我在 python 中的数据？

Answer 1

I would assume that this data would follow a normal distribution, so that is where I would start.我会假设这些数据将遵循正态分布，所以这就是我要开始的地方。

When the bin width is even, you can use the bin center as the x value and the bin height as the y .当 bin 宽度为偶数时，您可以将 bin 中心用作x值，将 bin 高度用作y 。 In your case, since the bins are uneven, you should use the bin integral of the objective function to compare to your data.在您的情况下，由于 bin 不均匀，您应该使用目标函数的 bin积分与您的数据进行比较。 For example the code below:例如下面的代码：

import scipy.optimize
import scipy.stats
import numpy as np
import matplotlib.pyplot as plt

bins = [0, 600, 700, 800, 900, 1000, 3000]
binc = [ 300, 650, 750, 850, 950, 2000]
weights = [340000, 365000, 494000, 430000, 110000, 40000]


def fGaussianCDF(bins, *params):
    
    N = params[0]
    mu = params[1]
    sigma = params[2]

    binwidth = np.diff(bins)

    return N*(scipy.stats.norm.cdf(bins[1:], mu, sigma) - scipy.stats.norm.cdf(bins[:-1], mu, sigma) )


fig, ax = plt.subplots(1, 1)
ax.plot(binc, weights, "ok")
ax.set_xlabel("Weight (lbs.)", fontsize=16)
ax.set_ylabel("Counts", fontsize=16)


popt, _ = scipy.optimize.curve_fit(fGaussianCDF, bins, weights, p0=[1.8e6, 730, 150])
plt.plot(binc, fGaussianCDF(bins, *popt), "rx")
print(popt)

plt.show()

Which gives the best fit result of a mean value of mu=736 lb and sigma=146 .这给出了mu=736 lb 和sigma=146平均值的最佳拟合结果。 The results plotted look like:绘制的结果如下所示：

Which is not a perfect fit, but hopefully is something that you are looking for.这不是一个完美的选择，但希望是您正在寻找的东西。

将分布拟合到具有不均匀 bin 的数据 python

问题描述

1 个解决方案

解决方案1
0

将分布拟合到具有不均匀 bin 的数据 python

问题描述

1 个解决方案

解决方案1 0

解决方案1
0