用Python进行Chi-Squared测试

Question

I've used the following code in R to determine how well observed values (20, 20, 0 and 0 for example) fit expected values/ratios (25% for each of the four cases, for example): 我在R使用了以下代码来确定观察值（例如20,20,0和0）与预期值/比率的匹配程度（例如，四个案例中每个案例的25％）：

> chisq.test(c(20,20,0,0), p=c(0.25, 0.25, 0.25, 0.25))

    Chi-squared test for given probabilities

data:  c(20, 20, 0, 0)

X-squared = 40, df = 3, p-value = 1.066e-08

How can I replicate this in Python? 我怎样才能在Python中复制它？ I've tried using the chisquare function from scipy but the results I obtained were very different; 我已经使用试过chisquare从功能scipy ，但我得到的是非常不同的结果; I'm not sure if this is even the correct function to use. 我不确定这是否是正确使用的功能。 I've searched through the scipy documentation, but it's quite daunting as it runs to 1000+ pages; 我搜索过scipy文档，但它运行到1000多页时非常令人生畏; the numpy documentation is almost 50% more than that. numpy文档几乎比这多50％。

Answer 1

scipy.stats.chisquare expects observed and expected absolute frequencies, not ratios. scipy.stats.chisquare期望观察到的和预期的绝对频率，而不是比率。 You can obtain what you want with 你可以获得你想要的东西

>>> observed = np.array([20., 20., 0., 0.])
>>> expected = np.array([.25, .25, .25, .25]) * np.sum(observed)
>>> chisquare(observed, expected)
(40.0, 1.065509033425585e-08)

Although in the case that the expected values are uniformly distributed over the classes, you can leave out the computation of the expected values: 虽然在期望值均匀分布在类上的情况下，您可以省略预期值的计算：

>>> chisquare(observed)
(40.0, 1.065509033425585e-08)

The first returned value is the χ² statistic, the second the p -value of the test. 第一个返回值是χ2统计量，第二个是测试值的p值。

Answer 2

Just wanted to point out that while the answer appears to be correct syntactically, you should not be using a Chi-squared distribution with your example because you have observed frequencies that are too small for an accurate Chi-square test. 只是想指出虽然答案似乎在句法上是正确的，但你不应该在你的例子中使用卡方分布，因为你观察到的频率太小而无法进行准确的卡方检验。

"This test is invalid when the observed or expected frequencies in each category are too small. A typical rule is that all of the observed and expected frequencies should be at least 5." “当每个类别的观测或预期频率太小时，此测试无效。典型的规则是所有观测和预期的频率应至少为5。” see: http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html#scipy.stats.chisquare 请参阅： http ： //docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html#scipy.stats.chisquare

Answer 3

An alternative would be to call your R code from python. 另一种方法是从python调用你的R代码。 You can do this: 你可以这样做：

by making an R script run as a command line tool. 通过将R脚本作为命令行工具运行。 See this link for more information on running R scripts form the command line using Rscript . 有关使用Rscript从命令行运行R脚本的更多信息，请参阅此链接。 From python you can then run an R script by executing a system call using either subprocess or os.system . 然后，您可以通过使用subprocess os.system或os.system执行系统调用来运行R脚本。 Any data exchange is done through text or binary files. 任何数据交换都是通过文本或二进制文件完成的。 I like this approach because it is very simple, and it is easy to debug the R script separate from the python code. 我喜欢这种方法，因为它非常简单，并且很容易调试与Python代码分开的R脚本。 The downside is that all data goes through the harddrive, which could prove to be very slow. 缺点是所有数据都通过硬盘驱动器，这可能会非常慢。
by using rpy , or rpy2 to run R code directly from within python. 通过使用rpy或rpy2直接从python中运行R代码。 In this way the integration is more tight, but this link also introduces its own little quirks. 通过这种方式，集成更紧凑，但这个链接也引入了自己的小怪癖。 For example, in my experience debugging R code called through rpy is a little harder to debug. 例如，根据我的经验，调试通过rpy调用的R代码有点难以调试。

用Python进行Chi-Squared测试

问题描述

3 个解决方案

解决方案1
35 已采纳 2012-02-17 14:51:56

解决方案2
7 2012-12-17 14:37:22

解决方案3
2 2012-02-17 15:51:54

用Python进行Chi-Squared测试

问题描述

3 个解决方案

解决方案1 35 已采纳 2012-02-17 14:51:56

解决方案2 7 2012-12-17 14:37:22

解决方案3 2 2012-02-17 15:51:54

解决方案1
35 已采纳 2012-02-17 14:51:56

解决方案2
7 2012-12-17 14:37:22

解决方案3
2 2012-02-17 15:51:54