简体   繁体   中英

R for loop to extract info from a file and add it into tibble?

I am not great with tidyverse so forgive me if this is a simple question. I have a bunch of files with data that I need to extract and add into distinct columns in a tibble I created.

I want the the row names to start with the file IDs which I did manage to create:

filelist <- list.fileS(pattern=".txt") # Gives me the filenames in current directory.
# The filenames are something like AA1230.report.txt for example

file_ID <- trimws(filelist, whitespace="\\..*") # Gives me the ID which is before the "report.txt"

metadata <- as_tibble(file_ID[1:181]) # create dataframe with IDs as row names for 180 files.

Now in these report files are information on species and abundance (kraken report files for those familiar with kraken) and all I need is to extract the number of reads for each domain. I can easily search up in each file the domains and number of reads that fall into that domain using something like:

sample_data <- as_tibble(read.table("AA1230.report.txt", sep="\t", header=FALSE, strip.white=TRUE))

sample_data <- rename(sample_data, Percentage=V1, Num_reads_root=V2, Num_reads_taxon=V3, Rank=V4, NCBI_ID=V5, Name=V6) # Just renaming the column headers for clarity

sample_data %>% filter(Rank=="D") # D for domain

This gives me a clear output such as:

Percentage Num_Reads_Root Num_Reads_Taxon Rank  NCBI_ID Name     
       <dbl>          <int>           <int> <fct>   <int> <fct>    
1      75.9           60533              28 D           2 Bacteria 
2       0.48            386               0 D        2759 Eukaryota
3       0.01              4               0 D        2157 Archaea  
4       0.02             19               0 D       10239 Viruses  

Now, I want to just grab the info in the second column and final column and save this info into my tibble so that I can get something like:

> metadata
value     Bacteria_Counts    Eukaryota_Counts    Viruses_Counts     Archaea_Counts
<chr>     <int>              <int>               <int>               <int>
 1 AA1230  60533             386                 19                   4 
 2 AB0566
 3 AA1231
 4 AB0567
 5 BC1148
 6 AW0001
 7 AW0002
 8 BB1121
 9 BC0001
10 BC0002
....with 171 more rows

I'm just having trouble coming up with a for loop to create these sample_data outputs, then from that, extract the info and place into a tibble. I guess my first loop should create these sample_data outputs so something like:

for (files in file.list()) {
  >> get_domains <<
}

Then another loop to extract that info from the above loop and insert it into my metadata tibble. Any suggestions? Thank you so much: PS, If regular dataframes in R is better for this let me know. I have just recently learned that tidyverse is a better way to organize dataframes in R but I have to learn more about it.

You could also do:

library(tidyverse)
filelist <- list.files(pattern=".txt") 
nms <- c("Percentage", "Num_reads_root", "Num_reads_taxon", "Rank", "NCBI_ID", "Name")

set_names(filelist,filelist) %>%
  map_dfr(read_table, col_names = nms, .id = 'file_ID') %>%
  filter(Rank == 'D') %>%
  select(file_ID, Name, Num_reads_root) %>%
  pivot_wider(id_cols = file_ID, names_from = Name, values_from = Num_reads_root) %>%
  mutate(file_ID = str_remove(file_ID, '.txt'))

I've found that using a for loop is nice sometimes because saves all the progress along the way in case you hit an error. Then you can find the problem file and debug it or use try() but throw a warning() .

library(tidyverse)
filelist <- list.files(pattern=".txt") #list files

tmp_list <- list()
for (i in seq_along(filelist)) {
  my_table <- read_tsv(filelist[i]) %>% # It looks like your files are all .tsv's
    rename(Percentage=V1, Num_reads_root=V2, Num_reads_taxon=V3, Rank=V4, NCBI_ID=V5, Name=V6) %>%
    filter(Rank=="D") %>%
    mutate(file_ID <- trimws(filelist[i], whitespace="\\..*")) %>%
    select(file_ID, everything())
  tmp_list[[i]] <- my_table
}
out <- bind_rows(tmp_list)
out

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM