Subset rows in multiple tsv files based on column values; combine results in one df; add one column for source file name

Question

I have >300 tsv files with the same dimensions. Column headers are the same for all files. I would like to filter each of these files to include only rows which satisfy criteria specific to one column. For example, I would want to include any rows where 'ColumnA' = 'gene1' or 'gene2' or 'gene3'. I would also like to combine the subsetted data from all files into 1 df and add an additional column so that I know which file each row came from. I am fairly new to R so I don't have experience with writing script to be applied to multiple files.
I have used the filter function before but how can this be applied to more than 1 file? Should I merge the data in all files and then filter? If so, then how do I add an additional column to each file for the file name before merging all of the files? I read about for-loops but when I tried this I got an error message: "Error in file(file, "rt"): invalid 'description' argument". Any help would be greatly appreciated!

for(i in M_Data){
  +   dat <- read_tsv (paste("/Users/EM/Desktop/Files",i,sep="\t"))
  +   dat$Source <- i
  +   M_Data_Merge <- rbind(M_Data_Merge,dat)
  }

Answer 1

data.table is your friend for this task:

 library(data.table)

 fileNames = dir("path/to/files", pattern = "*.tsv", full.names = TRUE)
 listOfTables = lapply(fileNames, fread)

 # convert all tables to a big df
 df = rbindlist(listOfTables)

 # filter only the rows you want:
 newDf = df[ColumnA %chin% c("gene1", "gene2", "gene3"), ]

Subset rows in multiple tsv files based on column values; combine results in one df; add one column for source file name

Question

1 answers

solution1
0 2022-12-31 01:48:52

Subset rows in multiple tsv files based on column values; combine results in one df; add one column for source file name

Question

1 answers

solution1 0 2022-12-31 01:48:52

solution1
0 2022-12-31 01:48:52