简体   繁体   English

Python 中拟合检验的卡方优度:p 值太低,但拟合函数是正确的

[英]Chi-squared goodness of fit test in Python: way too low p-values, but the fitting function is correct

Despite having searched for two day in related questions, I have not really found an answer to this Problem yet...尽管在相关问题中搜索了两天,但我还没有真正找到这个问题的答案......

In the following code, I generate n normally distributed random variables, which are then represented in a histogram:在下面的代码中,我生成了 n 个正态分布的随机变量,然后用直方图表示:

import numpy as np
import matplotlib.pyplot as plt

n = 10000                        # number of generated random variables 
x = np.random.normal(0,1,n)      # generate n random variables

# plot this in a non-normalized histogram:
plt.hist(x, bins='auto', normed=False)    

# get the arrays containing the bin counts and the bin edges:
histo, bin_edges = np.histogram(x, bins='auto', normed=False)
number_of_bins = len(bin_edges)-1

After that, a curve fitting function and its parameters are found.之后,找到曲线拟合函数及其参数。 It is normally distributed with the parameters a1 and b1, and scaled with scaling_factor to meet the fact that the sample is unnormalized.它以参数a1和b1正态分布,并用scaling_factor进行缩放以满足样本未归一化的事实。 It indeed fits the histogram quite well:它确实非常符合直方图:

import scipy as sp

a1, b1 = sp.stats.norm.fit(x)

scaling_factor = n*(x.max()-x.min())/number_of_bins

plt.plot(x_achse,scaling_factor*sp.stats.norm.pdf(x_achse,a1,b1),'b')

Here's the plot of the histogram with the fitting function in red.这是拟合函数为红色的直方图。

After that, I want to test how well this function fits the histogram using the chi-squared test.之后,我想使用卡方检验来测试该函数与直方图的拟合程度。 This test uses the observed values and the expected values in those points.此测试使用这些点中的观测值和预期值。 To calculate the expected values, I first calculate the location of the middle of each bin, this information is contained in the array x_middle.为了计算期望值,我首先计算每个 bin 的中间位置,这个信息包含在数组 x_middle 中。 I then calculate the value of the fitting function at the middle point of each bin, which gives the expected_value array:然后我在每个 bin 的中间点计算拟合函数的值,它给出了 expected_value 数组:

observed_values = histo

bin_width = bin_edges[1] - bin_edges[0]

# array containing the middle point of each bin:
x_middle = np.linspace(  bin_edges[0] + 0.5*bin_width,    
           bin_edges[0] + (0.5 + number_of_bins)*bin_width,
           num = number_of_bins) 

expected_values = scaling_factor*sp.stats.norm.pdf(x_middle,a1,b1)

Plugging this into the chisquare function of Scipy, I get p-values of approximately e-5 to e-15 order of magnitude, which tells me the fitting function does not describe the histogram:将其插入 Scipy 的卡方函数中,我得到大约 e-5 到 e-15 数量级的 p 值,这告诉我拟合函数没有描述直方图:

print(sp.stats.chisquare(observed_values,expected_values,ddof=2)) 

But this is not true, the function fits the histogram very well!但事实并非如此,该函数非常适合直方图!

Does anybody know where I made a mistake?有谁知道我哪里出错了?

Thanks a lot!!非常感谢!! Charles查尔斯

ps: I set the number of delta degrees of freedom to 2, because the 2 parameters a1 and b1 are estimated from the sample. ps:我将delta自由度数设置为2,因为2个参数a1和b1是从样本中估计出来的。 I tried using other ddof, but the results were still as poor!我尝试使用其他 ddof,但结果仍然很差!

Your calculation of the end-point of the array x_middle is off by one;您对数组x_middle端点的计算偏离了 1; it should be:它应该是:

x_middle = np.linspace(bin_edges[0] + 0.5*bin_width,    
                       bin_edges[0] + (0.5 + number_of_bins - 1)*bin_width,
                       num=number_of_bins)

Note the extra - 1 in the second argument of linspace() .注意linspace()的第二个参数中的 extra - 1

A more concise version is更简洁的版本是

x_middle = 0.5*(bin_edges[1:] + bin_edges[:-1])

A different (and possibly more accurate) approach to computing expected_values is to use the differences of the CDF, instead of approximating those differences using the PDF in the middle of each interval:计算expected_values一种不同(并且可能更准确)的方法是使用 CDF 的差异,而不是在每个间隔的中间使用 PDF 来近似这些差异:

In [75]: from scipy import stats

In [76]: cdf = stats.norm.cdf(bin_edges, a1, b1)

In [77]: expected_values = n * np.diff(cdf)

With that calculation, I get the following result from the chi-squared test:通过该计算,我从卡方检验中得到以下结果:

In [85]: stats.chisquare(observed_values, expected_values, ddof=2)
Out[85]: Power_divergenceResult(statistic=61.168393496775181, pvalue=0.36292223875686402)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM