简体   繁体   English

使用 scipy.stats 拟合非标准化分布

[英]Fitting an un-normalised distribution with scipy.stats

I'm tryng to fit a histogram but the fit only works with normalised data, ie with option normed=True in the histogram.我正在尝试拟合直方图,但拟合仅适用于标准化数据,即直方图中的选项normed=True Is there a way of doing this with scipy stats (or other method)?有没有办法用 scipy stats(或其他方法)来做到这一点? Here is a MWE using a uniform distribution:这是使用均匀分布的 MWE:

import matplotlib.pyplot as plt
import numpy as np
import random
from scipy.stats import uniform

data = []
for i in range(1000):
    data.append(random.uniform(-1,1))

loc, scale = uniform.fit(data)

x = np.linspace(-1,1, 1000)
y = uniform.pdf(x, loc, scale)

plt.hist(data, bins=100, normed=False)
plt.plot(x, y, 'r-')
plt.show()

在此处输入图片说明

I also tried defining my own function (below) but I'm getting a bad fit.我也尝试定义我自己的函数(如下),但我觉得不合适。

import matplotlib.pyplot as plt
import numpy as np
import random
from scipy import optimize

data = []
for i in range(1000):
    data.append(random.uniform(-1,1))

def unif(x,avg,sig):
    return avg*x + sig

y, base = np.histogram(data,bins=100)
x = [0.5 * (base[i] + base[i+1]) for i in xrange(len(base)-1)]

popt, pcov = optimize.curve_fit(unif, x, y)
x_fit = np.linspace(x[0], x[-1], 100)
y_fit = unif(x_fit, *popt)

plt.hist(data, bins=100, normed=False)
plt.plot(x_fit, y_fit, 'r-')
plt.show()

在此处输入图片说明

Note that it is generally a bad idea to fit a distribution to the histogram.请注意,将分布拟合到直方图通常是一个坏主意。 Compared to the raw data the histogram contains less information so the fit will most likely be worse.与原始数据相比,直方图包含的信息较少,因此拟合很可能会更差。 Thus, the first MWE in the question actually contains the best approach.因此,问题中的第一个 MWE 实际上包含最佳方法。 Simply normalize the histogram and it will match the distribution of the data: plt.hist(data, bins=100, normed=True) .简单地对直方图进行归一化,它将匹配数据的分布: plt.hist(data, bins=100, normed=True)

However, it seems you actually want to work with the unnormalized histogram.但是,您似乎实际上想要使用非标准化直方图。 In that case take the normalization that the histogram would normally use and apply it inverted to the fitted distribution.在这种情况下,采用直方图通常使用的归一化,并将其反向应用于拟合分布。 The documentation describes the normalization as文档将规范化描述为

n/(len(x)`dbin) n/(len(x)`dbin)

which is verbose for saying dividing by the number of observations times the bin width .这是说除以观察次数乘以 bin 宽度的冗长。

Multiplying the distribution by this value results in the expected counts per bin:将分布乘以该值得出每个 bin 的预期计数:

loc, scale = uniform.fit(data)

x = np.linspace(-1,1, 1000)
y = uniform.pdf(x, loc, scale)

n_bins = 100      
bin_width = np.ptp(data) / n_bins

plt.hist(data, bins=n_bins, normed=False)
plt.plot(x, y * len(data) * bin_width, 'r-')

在此处输入图片说明


The second MWE is interesting because you describe the line aa bad fit , but actually it is a very good fit :).第二个 MWE 很有趣,因为您描述了 aa bad fit 行,但实际上它非常适合:)。 You simply overfit the histogram because although you expect a horizontal line (one degree of freedom) you fit an arbitrary line (two degrees of freedom).您只是过度拟合直方图,因为尽管您期望一条水平线(一个自由度),但您拟合了一条任意线(两个自由度)。

So if you want a horizontal line fit a horizontal line and don't be surprised to get something else if you fit something else...因此,如果您想要一条水平线适合一条水平线,并且如果您适合其他东西,请不要对得到其他东西感到惊讶......

def unif(x, sig):
    return 0 * x + sig  # slope is zero -> horizontal line

However, there is a much simpler way of obtaining the height of the unnormalized uniform distribution.然而,有一种更简单的方法来获得非归一化均匀分布的高度。 Just average the histogram over all bins:只需平均所有箱的直方图:

y, base = np.histogram(data,bins=100)
y_hat = np.mean(y)
print(y_hat)
# 10.0

Or, even simpler use the theoretical value of len(data) / n_bins == 10 .或者,更简单地使用len(data) / n_bins == 10的理论值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM