简体   繁体   English

用Python进行Chi-Squared测试

[英]Chi-Squared test in Python

I've used the following code in R to determine how well observed values (20, 20, 0 and 0 for example) fit expected values/ratios (25% for each of the four cases, for example): 我在R使用了以下代码来确定观察值(例如20,20,0和0)与预期值/比率的匹配程度(例如,四个案例中每个案例的25%):

> chisq.test(c(20,20,0,0), p=c(0.25, 0.25, 0.25, 0.25))

    Chi-squared test for given probabilities

data:  c(20, 20, 0, 0)

X-squared = 40, df = 3, p-value = 1.066e-08

How can I replicate this in Python? 我怎样才能在Python中复制它? I've tried using the chisquare function from scipy but the results I obtained were very different; 我已经使用试过chisquare从功能scipy ,但我得到的是非常不同的结果; I'm not sure if this is even the correct function to use. 我不确定这是否是正确使用的功能。 I've searched through the scipy documentation, but it's quite daunting as it runs to 1000+ pages; 我搜索过scipy文档,但它运行到1000多页时非常令人生畏; the numpy documentation is almost 50% more than that. numpy文档几乎比这多50%。

scipy.stats.chisquare expects observed and expected absolute frequencies, not ratios. scipy.stats.chisquare期望观察到的和预期的绝对频率,而不是比率。 You can obtain what you want with 你可以获得你想要的东西

>>> observed = np.array([20., 20., 0., 0.])
>>> expected = np.array([.25, .25, .25, .25]) * np.sum(observed)
>>> chisquare(observed, expected)
(40.0, 1.065509033425585e-08)

Although in the case that the expected values are uniformly distributed over the classes, you can leave out the computation of the expected values: 虽然在期望值均匀分布在类上的情况下,您可以省略预期值的计算:

>>> chisquare(observed)
(40.0, 1.065509033425585e-08)

The first returned value is the χ² statistic, the second the p -value of the test. 第一个返回值是χ2统计量,第二个是测试值的p值。

Just wanted to point out that while the answer appears to be correct syntactically, you should not be using a Chi-squared distribution with your example because you have observed frequencies that are too small for an accurate Chi-square test. 只是想指出虽然答案似乎在句法上是正确的,但你不应该在你的例子中使用卡方分布,因为你观察到的频率太小而无法进行准确的卡方检验。

"This test is invalid when the observed or expected frequencies in each category are too small. A typical rule is that all of the observed and expected frequencies should be at least 5." “当每个类别的观测或预期频率太小时,此测试无效。典型的规则是所有观测和预期的频率应至少为5。” see: http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html#scipy.stats.chisquare 请参阅: http//docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html#scipy.stats.chisquare

An alternative would be to call your R code from python. 另一种方法是从python调用你的R代码。 You can do this: 你可以这样做:

  • by making an R script run as a command line tool. 通过将R脚本作为命令行工具运行。 See this link for more information on running R scripts form the command line using Rscript . 有关使用Rscript从命令行运行R脚本的更多信息,请参阅此链接 From python you can then run an R script by executing a system call using either subprocess or os.system . 然后,您可以通过使用subprocess os.systemos.system执行系统调用来运行R脚本。 Any data exchange is done through text or binary files. 任何数据交换都是通过文本或二进制文件完成的。 I like this approach because it is very simple, and it is easy to debug the R script separate from the python code. 我喜欢这种方法,因为它非常简单,并且很容易调试与Python代码分开的R脚本。 The downside is that all data goes through the harddrive, which could prove to be very slow. 缺点是所有数据都通过硬盘驱动器,这可能会非常慢。
  • by using rpy , or rpy2 to run R code directly from within python. 通过使用rpyrpy2直接从python中运行R代码。 In this way the integration is more tight, but this link also introduces its own little quirks. 通过这种方式,集成更紧凑,但这个链接也引入了自己的小怪癖。 For example, in my experience debugging R code called through rpy is a little harder to debug. 例如,根据我的经验,调试通过rpy调用的R代码有点难以调试。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM