简体   繁体   English

通过循环将来自多个data.frames的列组合

[英]combining columns from multiple data.frames with a loop

I have 600 tab-delimited .txt files that look like this: 我有600个制表符分隔的.txt文件,如下所示:

                       barcode gene.symbol    value
1 TCGA-61-2610-02A-01R-1141-07      15E1.2 -0.78175
2 TCGA-61-2610-02A-01R-1141-07      2'-PDE  -1.0155
3 TCGA-61-2610-02A-01R-1141-07         7A5    0.029
4 TCGA-61-2610-02A-01R-1141-07        A1BG  0.96575
5 TCGA-61-2610-02A-01R-1141-07       A2BP1   -0.301
6 TCGA-61-2610-02A-01R-1141-07         A2M -2.21575

I want to put together all the 600 files in one data frame such that gene.symbol will be the row names and values will be combined with first 12 characters of the barcode being the column name. 我想将所有600个文件放到一个数据帧中,这样,gene.symbol将成为行名,值将与条形码的前12个字符组合成列名。 Searching through SO I think I've got a loop that does this with one caveat. 通过SO搜索,我想我有一个循环可以做到这一点。 Here's what I have (I'm still learning R so the code might look very crude): 这就是我所拥有的(我仍在学习R,所以代码看起来可能很粗糙):

n = 600
df <- read.delim(file=paste("agilent1.txt")
df.tmp <- data.frame()
colnames(df) = c("barcode", "gene.symbol", levels(df$barcode))
df = df[2 :3]

once I have df with the first file's values, the loop starts adding the other files' value columns (the files are named as agilent1.txt, agilent2.txt etc): 一旦有了第一个文件的df值,循环便开始添加其他文件的value列(这些文件名为agilent1.txt,agilent2.txt等):

for (i in 2:n) {
  df.tmp <- read.delim(file=paste("agilent", i, ".txt", sep="")
  a <- as.character(levels(df.tmp$barcode))
  a <- substr(a, 1, 12)
  df <- cbind(df, a = df.tmp$value)
}

everything work BUT in cbind command, a = df.tmp$value makes the column name a (which makes sense) but I want the value of a to be the column name. 一切正常,但在cbind命令中,a = df.tmp $ value使列名为a(这很有意义),但我希望a的值为列名。

  gene.symbol                 TCGA-61-2614                   a                  a                  a        a
1      15E1.2                      0.80475            -0.47375           -0.26825           -0.13425 -0.78175
2      2'-PDE                   -0.1348125          -0.1565625            0.19475         -0.3819375  -1.0155
3         7A5                       2.2735              2.4405              0.902              1.248    0.029
4        A1BG            0.817166666666667 -0.0471666666666667            -0.1005 -0.283333333333333  0.96575
5       A2BP1           -0.811333333333333   -1.02566666666667 -0.494833333333333             -0.948   -0.301
6         A2M                       -0.719            -1.00575           -1.07275              0.517 -2.21575

It sounds so easy in my mind but I can't seem to find the answer. 这听起来很容易,但是我似乎找不到答案。 Any help would be greatly appreciated. 任何帮助将不胜感激。

Cheers, 干杯,

Ahmet 艾哈迈德

You don't need to use an explicit loop if you use the reshape package. 如果使用reshape包,则不需要使用显式循环。 Here is a two liner which will do exactly what you are seeking (if i understand correctly) 这是两个衬板,它将完全满足您的要求(如果我理解正确的话)

require(plyr); require(reshape);
files = paste('agilent', 1:600, '.txt', sep = "") # create list of files
dfs   = ldply(files, read.delim)                  # read files into data frame
cast(dfs, gene ~ barcode)                         # reshape to required format

I suggest you to read the 600 data files and put the toghether: 我建议您阅读600个数据文件并将其放在一起:

myfiles <- list.files()
mydat <- c()
for(i in 1:length(myfiles)) {
    temp <- read.table(myfiles[i], header=T)
    mydat <- rbind(mydat, temp)
}

library(reshape2)
newdat <- cast(mydat, gene.symbol ~ barcode, value=value)

If you want the colnames have only 12 characters, you could follow the response of joran 如果您想要的姓氏只有12个字符,您可以按照joran的回答

You could always just set the column name in a separate step at the end of the loop: 您总是可以在循环结束时在单独的步骤中设置列名称:

df <- cbind(df, a = df.tmp$value)
colnames(df)[i+1] <- a

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM