简体   繁体   English

如何使用R计算单个语句中数据框中一列的配对t检验与所有其他列

[英]How to calculate a paired t-test for one column in a data frame to all other columns in a single statement using R

I have a data frame with about 20 different columns of data.我有一个包含大约 20 列不同数据的数据框。 The first column has two options: the result being true or false.第一列有两个选项:结果为真或假。

I want to do a paired t.test between the first column and the rest for a total of 19 tests, with the goal of ranking how well those other 19 columns can predict a true value.我想在第一列和其余列之间进行配对 t.test,总共进行 19 次测试,目的是对其他 19 列预测真值的能力进行排名。

I'm hoping there is a way to essentially loop through the columns while keeping the first column the whole time.我希望有一种方法可以基本上循环遍历列,同时始终保持第一列。

This would iterate through the columns left to right, but not keep the first column (a) static the whole time while incrementing the second column.这将遍历从左到右的列,但不会在增加第二列的同时保持第一列 (a) 始终保持静态。 Such as A&B, B&C, C&D, etc.如A&B、B&C、C&D等。

Code:代码:

tests = lapply(seq(1,(length(df)-1)),function(x){t.test(df[,x],df[,x+1])}) 

Instead what I want is: A&B, A&C, A&D, etc.相反,我想要的是:A&B、A&C、A&D 等。

I'm wondering if you really want to do an unpaired t-test.我想知道你是否真的想做一个未配对的t 检验。 The reason I say this is that you described the first column as being TRUE or FALSE and then said your goal was to see how well the other columns could predict a TRUE value.我这么说的原因是您将第一列描述为 TRUE 或 FALSE,然后说您的目标是查看其他列如何预测 TRUE 值。 Or in other words, whether the means of the 19 other columns are significantly different between the TRUE and FALSE groups.或者换句话说,其他 19 个列的平均值是否在 TRUE 和 FALSE 组之间显着不同。 If you really wanted to do a paired t-test, then your data, as described, is not quite in the correct format.如果您真的想进行配对 t 检验,那么您的数据(如上所述)的格式并不完全正确。 Unless you want to compare x2 and x3 or x3 and x4 etc. Then you'd use the following:除非您想比较 x2 和 x3 或 x3 和 x4 等,否则您将使用以下内容:

t.test(df$x2, df$x3, paired=TRUE)

Performing an unpaired t-tests on the second column with the first column as the group variable is achieved using the formula method.使用公式方法实现以第一列作为组变量对第二列执行未配对t 检验。 For example, to compare the means of the second variable between the TRUE and FALSE groups, you can do:例如,要比较 TRUE 和 FALSE 组之间第二个变量的均值,您可以执行以下操作:

t.test(x1 ~ group, data=df)

And this is an unpaired, two-sample t-test.这是一个未配对的双样本 t 检验。 It can also be done slightly differently for reasons which will become evident later.由于稍后将变得明显的原因,它也可以稍微不同地完成。

t.test(df$x1 ~ df$group)
t.test(df[,2] ~ df[,1])

The latter version allows you to then perform repeated tests using the lapply function as mentioned.后一个版本允许您使用前面提到的lapply函数执行重复测试。

tests <- lapply(2:20, function(x) t.test(df[,x] ~ df[,1]))

This returns an un-named list, which can be named using the names of the data frame.这将返回一个未命名的列表,可以使用数据框的名称对其进行命名。

names(tests) <- names(df)[2:20]
tests[1]

$x1

    Welch Two Sample t-test

data:  df[, x] by df[, 1]
t = -0.83536, df = 94.695, p-value = 0.4056
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -5.339658  2.176944
sample estimates:
mean in group FALSE  mean in group TRUE 
           48.46547            50.04683

You can also tidy this using the broom package.您也可以使用扫帚包来整理它。

lapply(tests,  broom::tidy)

$x1
# A tibble: 1 x 10
  estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high method      alternative
     <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl> <chr>       <chr>      
1    -1.58      48.5      50.0    -0.835   0.406      94.7    -5.34      2.18 Welch Two ~ two.sided  

The dplyr version would be to use the do function instead of lapply , but first the data frame needs to be tidied into a long format. dplyr版本将使用do函数而不是lapply ,但首先需要将数据帧整理成长格式。

library(dplyr)
library(tidyr)

df %>% pivot_longer(cols=starts_with("x")) %>%
  group_by(name) %>%
  do(tidy(t.test(.$value ~ .$group)))

# A tibble: 19 x 11
# Groups:   name [19]
   name  estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
   <chr>    <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl>
 1 x1     -1.58        48.5      50.0   -0.835   0.406       94.7   -5.34      2.18 
 2 x10    -0.377       49.3      49.6   -0.194   0.847       95.1   -4.24      3.49 
 3 x11     4.49        53.1      48.6    2.08    0.0400      97.8    0.209     8.77 
 4 x12    -1.05        51.1      52.2   -0.450   0.654       88.9   -5.70      3.59 
 5 x13    -0.743       49.4      50.1   -0.360   0.720       96.8   -4.84      3.35 
 6 x14     0.908       51.5      50.6    0.487   0.627       93.3   -2.79      4.61 

Data :数据

set.seed(123)
n <- 100; m=19  # number of subjects (rows) and number of "x" columns
X <- data.frame(matrix(rnorm(n*m, mean=50, sd=10), byrow=TRUE, nc=m))
colnames(X) <- paste0("x", 1:19)
df <- data.frame(group=sample(c(TRUE, FALSE), size=n, replace=TRUE), X)
str(df)

'data.frame':   100 obs. of  20 variables:
 $ group: logi  FALSE FALSE FALSE FALSE TRUE FALSE ...
 $ x1   : num  44.4 45.3 46.9 55.8 47.2 ...
 $ x2   : num  47.7 39.3 46.2 51.2 37.8 ...
 $ x3   : num  65.6 47.8 43.1 52.2 51.8 ...
 $ x4   : num  50.7 39.7 47.9 53.8 48.6 ...
 $ x5   : num  51.3 42.7 37.3 45 50.1 ...
 $ x6   : num  67.2 43.7 71.7 46.7 53.9 ...

As the comments note, this is a two-sample t-test not a paired t-test unless you add paired=TRUE , but it fixes the first column and runs through the rest:正如评论所指出的,这是一个双样本 t 检验而不是配对 t 检验,除非您添加paired=TRUE ,但它修复了第一列并贯穿其余列:

tests <- lapply(seq(2, length(df)), function(x){t.test(df[,1], df[,x])})

If you are using the first column to define two groups then then it would be as follows:如果您使用第一列来定义两个组,则如下所示:

tests <- lapply(seq(2, length(df)), function(x){t.test(df[,x]~df[,1])})

This would be a two-sample t-test with each column split into two groups defined by column 1.这将是一个双样本 t 检验,每列分为由第 1 列定义的两组。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM