[英]R: Testing each level of a factor without creating new variables
Suppose I have a data frame with a binary grouping variable and a factor. 假设我有一个带有二进制分组变量和因子的数据帧。 An example of such a grouping variable could specify assignment to the treatment and control conditions of an experiment.
这种分组变量的一个例子可以指定对实验的处理和控制条件的分配。 In the below, b is the grouping variable while a is an arbitrary factor variable:
在下面, b是分组变量,而a是任意因子变量:
a <- c("a","a","a","b","b")
b <- c(0,0,1,0,1)
df <- data.frame(a,b)
I want to complete two-sample t-tests to assess the below: 我想完成双样本t检验以评估以下内容:
I have used the dummies package to create separate dummies for each level of the factor and then manually performed t-tests on the resulting variables: 我使用了虚拟包来为每个级别的因子创建单独的虚拟对象,然后对结果变量进行手动执行t检验:
library(dummies)
new <- dummy.data.frame(df, names = "a")
t.test(new$aa, new$b)
t.test(new$ab, new$b)
I am looking for help with the following: 我正在寻求以下方面的帮助:
This is similar to but different from R - How to perform the same operation on multiple variables and nearly the same as this question Apply t-test on many columns in a dataframe split by factor but the solution of that question no longer works. 这与R相似但不同- 如何对多个变量执行相同的操作 ,几乎与此问题相同在数据帧中的许多列上应用t检验按因子分割,但该问题的解决方案不再有效。
Here is a base R
solution implementing a chi-squired test for equality of proportions , which I believe is more likely to answer whatever question you're asking of your data (see my comment above): 这是一个基本的
R
解决方案,实现了比例相等的chi-squired测试 ,我相信这更有可能回答你对数据提出的任何问题(参见上面的评论):
set.seed(1)
## generate similar but larger/more complex toy dataset
a <- sample(letters[1:4], 100, replace = T)
b <- sample(0:1, 10, replace = T)
head((df <- data.frame(a,b)))
a b
1 b 1
2 b 0
3 c 0
4 d 1
5 a 1
6 d 0
## create a set of contingency tables for proportions
## of each level of df$a to the others
cTbls <- lapply(unique(a), function(x) table(df$a==x, df$b))
## apply chi-squared test to each contingency table
results <- lapply(cTbls, prop.test, correct = FALSE)
## preserve names
names(results) <- unique(a)
## only one result displayed for sake of space:
results$b
2-sample test for equality of proportions without continuity
correction
data: X[[i]]
X-squared = 0.18382, df = 1, p-value = 0.6681
alternative hypothesis: two.sided
95 percent confidence interval:
-0.2557295 0.1638177
sample estimates:
prop 1 prop 2
0.4852941 0.5312500
Be aware, however, that is you might not want to interpret your p-values without correcting for multiple comparisons . 但请注意,您可能不想在不更正多重比较的情况下解释您的p值。 A quick simulation demonstrates that the chance of incorrectly rejecting the null hypothesis with at least one of of your tests can be dramatically higher than 5%(!) :
快速模拟表明,至少有一次测试错误地拒绝零假设的可能性可能会大大超过5%(!):
set.seed(11)
sum(
replicate(1e4, {
a <- sample(letters[1:4], 100, replace = T)
b <- sample(0:1, 100, replace = T)
df <- data.frame(a,b)
cTbls <- lapply(unique(a), function(x) table(df$a==x, df$b))
results <- lapply(cTbls, prop.test, correct = FALSE)
any(lapply(results, function(x) x$p.value < .05))
})
) / 1e4
[1] 0.1642
I dont exactly understand what this is doing from a statistical standpoint, but this code generates a list where each element is the output from the t.test()
you run above: 从统计角度来看,我并不完全理解这是做什么的,但是这段代码会生成一个列表,其中每个元素都是上面运行的
t.test()
的输出:
a <- c("a","a","a","b","b")
b <- c(0,0,1,0,1)
df <- data.frame(a,b)
library(dplyr)
library(tidyr)
dfNew<-df %>% group_by(a) %>% summarise(count = n()) %>% spread(a, count)
lapply(1:ncol(dfNew), function (x)
t.test(c(rep(1, dfNew[1,x]), rep(0, length(b)-dfNew[1,x])), b))
This will save you the typing of t.test(foo, bar)
continuously, and also eliminates the need for dummy variables. 这将为您节省连续输入
t.test(foo, bar)
,并且不需要虚拟变量。
Edit: I dont think the above method preserves the order of the columns, only the frequency of values measured as 0 or 1. If the order is important (again, I dont know the goal of this procedure) then you can use the dummy method and lapply
through the data.frame
you named new.
编辑:我不认为上面的方法保留了列的顺序 ,只测量了0或1的值的频率。如果顺序很重要(再次,我不知道这个程序的目标)那么你可以使用虚方法和
lapply
通过data.frame
你命名new.
library(dummies)
new <- dummy.data.frame(df, names = "a")
lapply(1:(ncol(new)-1), function(x)
t.test(new[,x], new[,ncol(new)]))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.