简体   繁体   English

SAS Fisher测试大样本量的p值

[英]SAS Fisher test p values for large sample sizes

I'm trying to calculate some odds ratios and significance forsomething that can be out into a 2x2 table. 我正在尝试计算一些优势比和有意义的东西,可以在2x2表中。 The problem is the Fisher test in Sas is taking a long time. 问题是Sas的Fisher测试需要很长时间。

I already have the cell counts. 我已经有细胞计数了。 I could calculate a chi square if not for the fact that done of the sample sizes are extremely small. 我可以计算一个卡方,如果不是因为样本大小的完成非常小。 And yet some are extremely large, with cell sizes in the hundreds of thousands. 然而有些非常大,细胞大小达数十万。

When I try to compute these in R, I have no problem. 当我尝试在R中计算这些时,我没有问题。 However, when I try to compute them in Sas, it either tasks way too long, out simply errors out with the message "Fishers exact test cannot be computed with sufficient precision for this sample size." 但是,当我尝试在Sas中计算它们时,它要么任务方式太长,要么只是错误输出消息“Fishers exact test无法以足够的精度计算此样本大小”。

When I create a toy example (pull one instance from the dataset, and calculate it) it does calculate, but takes a long time. 当我创建一个玩具示例(从数据集中拉出一个实例并计算它)时,它会计算,但需要很长时间。 Data Bob; Input targ $ status $ wt; Cards; A c 4083 A d 111 B c 376494 B d 114231 ; Run;

Proc freq data = Bob; Weight wt; Tables targ*status; Exact Fisher; Run;

What is going wrong here? 这里出了什么问题?

That's funny. 那很好笑。 SAS calculates the Fisher's exact test p-value the exact way, by enumerating the hypergeometric probability of every single table in which the odds ratio is at least as big or bigger in favor of the alternative hypothesis. SAS计算Fisher精确检验p值的确切方式,通过列举的每一个表,其中比值比是至少有利于备择假设的一样大或更大的超几何概率。 There's probably a way for me to calculate how many tables that is, but knowing that it's big enough to slow SAS down is enough. 我可能有一种方法可以计算出有多少个表,但是知道它足以让SAS减速就足够了。

R does not do this. R不这样做。 R uses Monte Carlo methods which work just as fine in small sample sizes as large sample sizes. R使用蒙特卡罗方法,在小样本量和大样本量下一样好。

tab <- matrix(c(4083, 111, 376494, 114231), 2, 2)
pc <- proc.time()
fisher.test(tab)
proc.time()-pc

gives us 给我们

> tab <- matrix(c(4083, 111, 376494, 114231), 2, 2)
> pc <- proc.time()
> fisher.test(tab)

        Fisher's Exact Test for Count Data

data:  tab
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
  9.240311 13.606906
sample estimates:
odds ratio 
  11.16046 

> proc.time()-pc
   user  system elapsed 
   0.08    0.00    0.08 
> 

A fraction of a second. 几分之一秒。

That said, the smart statistician would realize, in tables such as yours, that the normal approximation to the log odds ratio is fairly good, and as such the Pearson Chi-square test should give approximately very similar results. 也就是说,聪明的统计学家会在像你这样的表中意识到,对数比值比的正常近似值是相当好的,因此Pearson卡方检验应该得到近似非常相似的结果。

People claim two very different advantages to the Fisher's exact test: some say it's good in small sample sizes. 人们声称Fisher精确测试有两个非常不同的优点:有人说它在小样本中很好。 Others say it's good when cell counts are very small in specific margins of the table. 其他人说,当细胞计数在表格的特定边缘非常小时,它是好的。 The way that I've come to understand it is that Fisher's exact test is a nice alternative to the Chi Square test when bootstrapped datasets are somewhat likely to generate tables with infinite odds ratios. 我开始理解它的方式是,当引导数据集有可能生成具有无限比值比的表时,Fisher的精确测试是Chi Square测试的一个很好的替代方案。 Visually you can imagine that the normal approximation to the log odds ratio is breaking down. 从视觉上你可以想象,对数比值比的正常近似值正在下降。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM