[英]Chi-Squared test in Python
I've used the following code in R
to determine how well observed values (20, 20, 0 and 0 for example) fit expected values/ratios (25% for each of the four cases, for example): 我在R
使用了以下代码来确定观察值(例如20,20,0和0)与预期值/比率的匹配程度(例如,四个案例中每个案例的25%):
> chisq.test(c(20,20,0,0), p=c(0.25, 0.25, 0.25, 0.25))
Chi-squared test for given probabilities
data: c(20, 20, 0, 0)
X-squared = 40, df = 3, p-value = 1.066e-08
How can I replicate this in Python? 我怎样才能在Python中复制它? I've tried using the chisquare
function from scipy
but the results I obtained were very different; 我已经使用试过chisquare
从功能scipy
,但我得到的是非常不同的结果; I'm not sure if this is even the correct function to use. 我不确定这是否是正确使用的功能。 I've searched through the scipy
documentation, but it's quite daunting as it runs to 1000+ pages; 我搜索过scipy
文档,但它运行到1000多页时非常令人生畏; the numpy
documentation is almost 50% more than that. numpy
文档几乎比这多50%。
scipy.stats.chisquare
expects observed and expected absolute frequencies, not ratios. scipy.stats.chisquare
期望观察到的和预期的绝对频率,而不是比率。 You can obtain what you want with 你可以获得你想要的东西
>>> observed = np.array([20., 20., 0., 0.])
>>> expected = np.array([.25, .25, .25, .25]) * np.sum(observed)
>>> chisquare(observed, expected)
(40.0, 1.065509033425585e-08)
Although in the case that the expected values are uniformly distributed over the classes, you can leave out the computation of the expected values: 虽然在期望值均匀分布在类上的情况下,您可以省略预期值的计算:
>>> chisquare(observed)
(40.0, 1.065509033425585e-08)
The first returned value is the χ² statistic, the second the p -value of the test. 第一个返回值是χ2统计量,第二个是测试值的p值。
Just wanted to point out that while the answer appears to be correct syntactically, you should not be using a Chi-squared distribution with your example because you have observed frequencies that are too small for an accurate Chi-square test. 只是想指出虽然答案似乎在句法上是正确的,但你不应该在你的例子中使用卡方分布,因为你观察到的频率太小而无法进行准确的卡方检验。
"This test is invalid when the observed or expected frequencies in each category are too small. A typical rule is that all of the observed and expected frequencies should be at least 5." “当每个类别的观测或预期频率太小时,此测试无效。典型的规则是所有观测和预期的频率应至少为5。” see: http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html#scipy.stats.chisquare 请参阅: http : //docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html#scipy.stats.chisquare
An alternative would be to call your R code from python. 另一种方法是从python调用你的R代码。 You can do this: 你可以这样做:
Rscript
. 有关使用Rscript
从命令行运行R脚本的更多信息,请参阅此链接 。 From python you can then run an R script by executing a system call using either subprocess
or os.system
. 然后,您可以通过使用subprocess
os.system
或os.system
执行系统调用来运行R脚本。 Any data exchange is done through text or binary files. 任何数据交换都是通过文本或二进制文件完成的。 I like this approach because it is very simple, and it is easy to debug the R script separate from the python code. 我喜欢这种方法,因为它非常简单,并且很容易调试与Python代码分开的R脚本。 The downside is that all data goes through the harddrive, which could prove to be very slow. 缺点是所有数据都通过硬盘驱动器,这可能会非常慢。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.