简体   繁体   中英

Subset rows in multiple tsv files based on column values; combine results in one df; add one column for source file name

I have >300 tsv files with the same dimensions. Column headers are the same for all files. I would like to filter each of these files to include only rows which satisfy criteria specific to one column. For example, I would want to include any rows where 'ColumnA' = 'gene1' or 'gene2' or 'gene3'. I would also like to combine the subsetted data from all files into 1 df and add an additional column so that I know which file each row came from. I am fairly new to R so I don't have experience with writing script to be applied to multiple files.
I have used the filter function before but how can this be applied to more than 1 file? Should I merge the data in all files and then filter? If so, then how do I add an additional column to each file for the file name before merging all of the files? I read about for-loops but when I tried this I got an error message: "Error in file(file, "rt"): invalid 'description' argument". Any help would be greatly appreciated!

for(i in M_Data){
  +   dat <- read_tsv (paste("/Users/EM/Desktop/Files",i,sep="\t"))
  +   dat$Source <- i
  +   M_Data_Merge <- rbind(M_Data_Merge,dat)
  }

data.table is your friend for this task:

 library(data.table)

 fileNames = dir("path/to/files", pattern = "*.tsv", full.names = TRUE)
 listOfTables = lapply(fileNames, fread)

 # convert all tables to a big df
 df = rbindlist(listOfTables)

 # filter only the rows you want:
 newDf = df[ColumnA %chin% c("gene1", "gene2", "gene3"), ]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM