简体   繁体   English

R-在两个数据帧中找到匹配列以进行t检验统计(R初学者)

[英]R- find matching columns in two data frames for t-test statistics (R beginner)

I would like to perform a two-sample t-test on my data within R. Given two high-dimensional data frames, I need to somehow loop through matching columns (String colnames() in header) over all rows and perform the test for each column pair - one from df1 and df2, respectively. 我想对R中的数据执行双样本t检验。给定两个高维数据框,我需要以某种方式遍历所有行的匹配列(标题中的String colnames())并执行测试每列对 - 分别来自df1和df2。 The problem is that the columns from the data frames are not in right order, ie col1 form df1 doesn't match col1 from df2, and df2 has additional columns that don't exist in df1. 问题是数据框中的列不是正确的顺序,即col1形式df1与df2中的col1不匹配,而df2具有df1中不存在的其他列。 I've never used R for such tasks and I wonder if there is a fast and handy solution to find matching column pairs in the data frames for the t-test. 我从来没有使用R来完成这些任务,我想知道是否有一个快速而方便的解决方案,可以在数据帧中找到匹配的列对进行t检验。

I thought about for-loops but I think this would be very inefficient for large data frames. 我考虑过for循环,但我认为这对于大型数据帧来说效率非常低。

Thank you in advance for any help. 预先感谢您的任何帮助。

*EDITED-------Two small example dataframes, df1 and df2-------------------------------- * EDITED -------两个小的示例数据帧,df1和df2 --------------------------------

****df1**** **** **** DF1

"Row\Column"    "A2"    "A1"    "A4"    "A3"
"id_1"           10      20      0       40
"id_2"           5       15      25      35
"id_3"           8       0       12      16
"id_4"           17      25      0       40

****df2**** **** **** DF2

"Row\Column"    "A3"    "A8"    "A5"    "A6"    "A1"    "A7"    "A4"    "A2"
"id_1"           0       2       0       4       0       1       2       3
"id_2"           1       5       8       3       4       5       6       7
"id_3"           2       10      6       9       8       9       10      11
"id_4"           7       2       10      2       55      0       0       0
"id_5"           0       1       0       0       9       1       3       4
"id_6"           8       0       1       2       7       2       3       0  

Matching columns are nothing but the columns names in df1 matching with the columnsnames in df2. 匹配列只是df1中的列名与df2中的列名匹配。 For example Two matching columns in df1 and df2 are eg "A1" and "A1", "A2" and "A2" ... you get the idea... 例如,df1和df2中的两个匹配列是例如“A1”和“A1”,“A2”和“A2”......你明白了......

mapply is the function you are looking for. mapply是你正在寻找的功能。
if the columns of your data.frame s matched up, you could simply use 如果你的data.frame的列匹配,你可以简单地使用

mapply(t.test, df1, df2)

However, since they do not, you somehow need to identify which column of df1 goes with which column of df2 . 但是,由于它们没有,您需要确定df1哪一列与df2哪一列一致。 Fortunately, the indexing options in R are clever, and if you feed in a vector ( a collection ) of column names, you will get back those columns in the order given. 幸运的是, R中的索引选项很聪明,如果您输入列名称的向量( 集合 ),您将按给定的顺序返回这些列。 This makes life easy. 这让生活变得轻松。

# find the matching names
## this will give you those names in df1 that are also in df2
## and *only* such names (ie, strict intersect)
matchingNames <- names(df1)[names(df1) %in% names(df2)]

Notice that matchingNames has some order to it Now look what happens when you use the matchingNames vector as an index to the columns of each of df1 and df2 (note also the column order) 请注意, matchingNames有一些顺序现在看看当你使用matchingNames向量作为每个df1和df2的列的索引时会发生什么(还要注意列顺序)

df1[, matchingNames]
df2[, matchingNames]
matchingNames    

Therefore, we now have two data.frames with properly matched columns, which we can use to mapply over. 因此,我们现在有两个data.frames具有正确匹配的列,我们可以使用它们进行mapply

## mapply will apply a function to each data.frame, one pair of columns at a time

## The first argument to `mapply` is your function, in this example, `t.test`
## The second and third arguments are the data.frames (or lists) to simultaneously iterate over
mapply(t.test, df1[, matchingNames], df2[, matchingNames])

Very hard to give you a good answer without a reproducible example. 没有可重复的例子,很难给你一个好的答案。 You need to define also what do you mean by matching columns. 您还需要通过matching列来定义您的意思。

Here an example of 2 data.frames that have some columns names in common. 这里有2个data.frames的例子,它们有一些共同的列名。

df1 <- matrix(sample(1:100,5*5,rep=TRUE),ncol=5,nrow=5)
df2 <- matrix(sample(1:100,5*8,rep=TRUE),ncol=8,nrow=5)
colnames(df1) <- letters[6:10]
colnames(df2) <- rev(letters[1:8])

Then I define a wrapper of t.test , to limit for example the ouput to the p-values and the degree of freedom. 然后我定义了t.test的包装器,以限制例如p值的输出和自由度。

f <- function(x,y){
  test <- t.test(x,y)
  data.frame(df   = test$parameter,
                    pval = test$p.value)
}

Then using sapply I iterate over common columns that I get using intersect 然后使用sapply迭代我使用intersect常见列

sapply(intersect(colnames(df1),colnames(df2)), 
                 function(x) f(df1[,x], df2[,x]))

     f         g         h        
df   7.85416   6.800044  7.508915 
pval 0.5792354 0.2225824 0.4392895

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM