有人能告诉我为什么R没有使用整个data.frame这个chisq.test？

Question

在尝试创建自己的data.frame并在其上运行定量分析（例如chisq.test ）时，我无法找到解决问题的方法。

背景如下：我总结了收到的与两家医院有关的数据。 两者都测量了相同的分类变量n次。 在这种情况下，它是在特定观察期内发现与卫生保健相关的细菌的频率。

在表格中，汇总数据如下所示，其中％表示在该时间段内进行的所有测量的百分比。

                                    n Hospital 1 (%)      n Hospital 2 (%)
Healthcare associated bacteria          829 (59.4)            578 (57.6)
Community associated bacteria           473 (33.9)            372 (37.1)
Contaminants                             94 (6.7)              53 (5.3)
Total                                  1396 (100.0)          1003 (100.0)

现在看一下百分比，显然比例非常相似，你可能想知道为什么我要在统计上比较这两家医院。 但我有其他数据，比例不同，所以这个问题的目的是：

如何比较医院1和医院2的测量类别。

由于数据是以汇总方式和数组格式提供的，因此我决定为每个单个变量/类别创建一个data.frame 。

hosp1 <- rep(c("Yes", "No"), times=c(829,567))
hosp2 <- rep(c("Yes", "No"), times=c(578,425))
all <- cbind(hosp1, c(hosp2,rep(NA, length(hosp1)-length(hosp2))))
all <- data.frame(all)
names(all)[2]<-"hosp2"
summary(all)

到目前为止这么好，因为总结看起来似乎能够现在运行chisq.test() 。 但现在，事情变得奇怪了。

with(all, chisq.test(hosp1, hosp2, correct=F))

    Pearson's Chi-squared test

data:  hosp1 and hosp2
X-squared = 286.3087, df = 1, p-value < 2.2e-16

结果，似乎表明存在显着差异。 如果您交叉数据，您会看到R以非常奇怪的方式对其进行汇总：

with(all, table(hosp1, hosp2))

       No Yes
  No  174   0
  Yes 251 578

因此，如果以这种方式汇总数据，那么将会有一个统计上显着的发现 - 因为一个类别被概括为根本没有任何测量。 为什么发生这种情况，我该怎么做才能纠正它？ 最后，不是data.frame为每个类别创建单独的data.frame ，还是有明显的循环功能吗？ 我想不出一个。

谢谢你的帮助！

更新基于THELATEMAIL的RAW DATA.FRAME请求

dput(SO_Example_v1)
structure(list(Type = structure(c(3L, 1L, 2L), .Label = c("Community", 
"Contaminant", "Healthcare"), class = "factor"), hosp1_WoundAssocType = c(464L, 
285L, 24L), hosp1_BloodAssocType = c(73L, 40L, 26L), hosp1_UrineAssocType = c(75L, 
37L, 18L), hosp1_RespAssocType = c(137L, 77L, 2L), hosp1_CathAssocType = c(80L, 
34L, 24L), hosp2_WoundAssocType = c(171L, 115L, 17L), hosp2_BloodAssocType = c(127L, 
62L, 12L), hosp2_UrineAssocType = c(50L, 29L, 6L), hosp2_RespAssocType = c(135L, 
142L, 6L), hosp2_CathAssocType = c(95L, 24L, 12L)), .Names = c("Type", 
"hosp1_WoundAssocType", "hosp1_BloodAssocType", "hosp1_UrineAssocType", 
"hosp1_RespAssocType", "hosp1_CathAssocType", "hosp2_WoundAssocType", 
"hosp2_BloodAssocType", "hosp2_UrineAssocType", "hosp2_RespAssocType", 
"hosp2_CathAssocType"), class = "data.frame", row.names = c(NA, 
-3L))

说明：此数据data.frame实际上比上表中总结的更复杂，因为它还包含培养的特定细菌类型（即伤口，血培养物，导管等）。 所以我正在制作的表格看起来如下：

                                                 All locations
                                n Hospital 1 (%)      n Hospital 2 (%)  p-val
Healthcare associated bacteria     829 (59.4)            578 (57.6)     0.39
Community associated bacteria      473 (33.9)            372 (37.1)     ...
Contaminants                       94 (6.7)              53 (5.3)       ...
Total                              1396 (100.0)          1003 (100.0)   -

标题“所有位置”随后将被伤口，血液，尿液，导管等取代。

Answer 1

关于如何使p值工作的问题的答案有点简单。 您可以使用与@thelatemail相同的语法获取您正在寻找的其他两个p值，如下所示：

#community (p = 0.1049)
chisq.test(cbind(c(473,923),c(372,631)),correct=FALSE)

#contaminants (p = 0.1443)
chisq.test(cbind(c(94,1302),c(53,950)),correct=FALSE)

您可以按以下方式更加以编程方式获得这些答案：

out <- cbind(rowSums(SO_Example_v1[,2:6]),rowSums(SO_Example_v1[,7:11]))
chisq.test(rbind(out[1,],colSums(out[2:3,])),correct=FALSE)
chisq.test(rbind(out[2,],colSums(out[c(1,3),])),correct=FALSE)
chisq.test(rbind(out[3,],colSums(out[1:2,])),correct=FALSE)

当然，我们现在已经超出了SO的范围，但考虑到数据的性质，可能是一个更相关的问题，如果整体医院之间存在差异，您可以回答（从常见的角度来看）使用基于所有三种类型的卡方检验：

chisq.test(out,correct=FALSE)

有人能告诉我为什么R没有使用整个data.frame这个chisq.test？

问题描述

1 个解决方案

解决方案1
1 2015-11-19 18:36:25

有人能告诉我为什么R没有使用整个data.frame这个chisq.test？

问题描述

1 个解决方案

解决方案1 1 2015-11-19 18:36:25

解决方案1
1 2015-11-19 18:36:25