简体   繁体   English

带 dplyr 的 for 循环

[英]for loop with dplyr

I have a bunch of files I read in manually as such:我有一堆手动读取的文件,例如:

# gel above replicates

    A_gel <-read.delim("XL1_3_S35_L004_R1_001_w_XL2_3_S37_L004_R1_001_01.basedon.peaks.l2inputnormnew.bed.compressed.bed")
    
    B_gel <-read.delim("XL2_3_S37_L004_R1_001_w_XL2_3_S37_L004_R1_001_01.basedon.peaks.l2inputnormnew.bed.compressed.bed")
    
    C_gel <- read.delim("XL2_3_S37_L004_R1_001_w_XL1_3_S35_L004_R1_001_01.basedon.peaks.l2inputnormnew.bed.compressed.bed")
    
    D_gel <- read.delim("XL1_3_S35_L004_R1_001_w_XL1_3_S35_L004_R1_001_01.basedon.peaks.l2inputnormnew.bed.compressed.bed")
    
# gel below replicates
    
    A_below_gel <- read.delim("XL1_3b_S36_L004_R1_001_w_XL2_3b_S38_L004_R1_001_01.basedon.peaks.l2inputnormnew.bed.compressed.bed")
    
    B_below_gel <- read.delim("XL2_3b_S38_L004_R1_001_w_XL2_3b_S38_L004_R1_001_01.basedon.peaks.l2inputnormnew.bed.compressed.bed")
    
    C_below_gel <- read.delim("XL2_3b_S38_L004_R1_001_w_XL1_3b_S36_L004_R1_001_01.basedon.peaks.l2inputnormnew.bed.compressed.bed")
    
    D_below_gel <- read.delim("XL1_3b_S36_L004_R1_001_w_XL1_3b_S36_L004_R1_001_01.basedon.peaks.l2inputnormnew.bed.compressed.bed")

I would like to change all the columns of these files and arrange by the start column with something like this:我想更改这些文件的所有列,并按如下方式按开始列排列:

colnames(A_gel) <- c("Chromosome", "Start", "End", "LogPVal", "LogFC", "Strand")
    
A_gel <- A_gel %>%
      arrange(A_gel$Start)

Instead, I would like to use a for loop for all files using R.相反,我想对所有使用 R 的文件使用 for 循环。

Never create multiple variables following the same pattern.永远不要按照相同的模式创建多个变量。 The properly supported solution for this general problem is the use of lists (ie instead of having variables A_gel , B_gel , …, you have one variable gel , which is a list that contains your individual data.frame s; you can also assign names to these individual items, though in your case that doesn't seem necessary).这个普遍问题的正确支持的解决方案是使用列表(即,不是有变量A_gelB_gel ,...,你有一个变量gel ,它是一个包含你个人data.frame s 的列表;你也可以将名称分配给这些单独的项目,尽管在您的情况下似乎没有必要)。

Then you can use eg lapply to run over your file paths and read the data of the different files into that list:然后你可以使用例如lapply来运行你的文件路径并将不同文件的数据读入该列表:

gel = lapply(gel_filenames, read.delim)
below_gel = lapply(below_gel_filenames, read.delim)

… and likewise you can put your arrangement code into a function and apply that, changing the above to: ……同样,您可以将您的排列代码放入一个函数中并应用它,将上面的内容更改为:

read_bed = function (filename) {
    read.delim(filename) %>%
        setNames(c("Chromosome", "Start", "End", "LogPVal", "LogFC", "Strand")) %>%
        arrange(Start)
}

# …

gel = lapply(gel_filenames, read_bed)

Better yet, use purrr::map_dfr to read all data into a single combined table:更好的是,使用purrr::map_dfr将所有数据读取到单个组合表中:

gel = gel_filenames %>%
    setNames(., .) %>%
    map_dfr(read_bed, .id = 'Filename')

(The setNames(., .) step is necessary since read_dfr assigns the names of the input vector to the added ID column.) setNames(., .)步骤是必要的,因为read_dfr将输入向量的名称分配给添加的 ID 列。)

This will create one master table for the “GEL” dat, which has an added ID column for the original filename (you'll probably want to extract just some ID from that, using tidyr::extract ).这将为“GEL”数据创建一个主表,它为原始文件名添加了一个 ID 列(您可能只想从中提取一些 ID,使用tidyr::extract )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM