简体   繁体   English

R for 循环从文件中提取信息并将其添加到 tibble?

[英]R for loop to extract info from a file and add it into tibble?

I am not great with tidyverse so forgive me if this is a simple question.我对 tidyverse 不是很好,所以如果这是一个简单的问题,请原谅我。 I have a bunch of files with data that I need to extract and add into distinct columns in a tibble I created.我有一堆文件,其中包含我需要提取并添加到我创建的小标题中的不同列中的数据。

I want the the row names to start with the file IDs which I did manage to create:我希望行名以我设法创建的文件 ID 开头:

filelist <- list.fileS(pattern=".txt") # Gives me the filenames in current directory.
# The filenames are something like AA1230.report.txt for example

file_ID <- trimws(filelist, whitespace="\\..*") # Gives me the ID which is before the "report.txt"

metadata <- as_tibble(file_ID[1:181]) # create dataframe with IDs as row names for 180 files.

Now in these report files are information on species and abundance (kraken report files for those familiar with kraken) and all I need is to extract the number of reads for each domain.现在在这些报告文件中是关于物种和丰度的信息(对于那些熟悉 kraken 的人来说,kraken 报告文件),我所需要的只是提取每个域的读取数。 I can easily search up in each file the domains and number of reads that fall into that domain using something like:我可以使用以下方法轻松地在每个文件中搜索属于该域的域和读取次数:

sample_data <- as_tibble(read.table("AA1230.report.txt", sep="\t", header=FALSE, strip.white=TRUE))

sample_data <- rename(sample_data, Percentage=V1, Num_reads_root=V2, Num_reads_taxon=V3, Rank=V4, NCBI_ID=V5, Name=V6) # Just renaming the column headers for clarity

sample_data %>% filter(Rank=="D") # D for domain

This gives me a clear output such as:这给了我一个清晰的 output 例如:

Percentage Num_Reads_Root Num_Reads_Taxon Rank  NCBI_ID Name     
       <dbl>          <int>           <int> <fct>   <int> <fct>    
1      75.9           60533              28 D           2 Bacteria 
2       0.48            386               0 D        2759 Eukaryota
3       0.01              4               0 D        2157 Archaea  
4       0.02             19               0 D       10239 Viruses  

Now, I want to just grab the info in the second column and final column and save this info into my tibble so that I can get something like:现在,我只想获取第二列和最后一列中的信息,并将这些信息保存到我的 tibble 中,这样我就可以获得如下信息:

> metadata
value     Bacteria_Counts    Eukaryota_Counts    Viruses_Counts     Archaea_Counts
<chr>     <int>              <int>               <int>               <int>
 1 AA1230  60533             386                 19                   4 
 2 AB0566
 3 AA1231
 4 AB0567
 5 BC1148
 6 AW0001
 7 AW0002
 8 BB1121
 9 BC0001
10 BC0002
....with 171 more rows

I'm just having trouble coming up with a for loop to create these sample_data outputs, then from that, extract the info and place into a tibble.我只是想出一个 for 循环来创建这些 sample_data 输出,然后从中提取信息并将其放入一个小标题中。 I guess my first loop should create these sample_data outputs so something like:我想我的第一个循环应该创建这些 sample_data 输出,如下所示:

for (files in file.list()) {
  >> get_domains <<
}

Then another loop to extract that info from the above loop and insert it into my metadata tibble.然后另一个循环从上述循环中提取该信息并将其插入到我的元数据小标题中。 Any suggestions?有什么建议么? Thank you so much: PS, If regular dataframes in R is better for this let me know.非常感谢:PS,如果 R 中的常规数据帧对此更好,请告诉我。 I have just recently learned that tidyverse is a better way to organize dataframes in R but I have to learn more about it.我最近才了解到 tidyverse 是在 R 中组织数据帧的更好方法,但我必须了解更多信息。

You could also do:你也可以这样做:

library(tidyverse)
filelist <- list.files(pattern=".txt") 
nms <- c("Percentage", "Num_reads_root", "Num_reads_taxon", "Rank", "NCBI_ID", "Name")

set_names(filelist,filelist) %>%
  map_dfr(read_table, col_names = nms, .id = 'file_ID') %>%
  filter(Rank == 'D') %>%
  select(file_ID, Name, Num_reads_root) %>%
  pivot_wider(id_cols = file_ID, names_from = Name, values_from = Num_reads_root) %>%
  mutate(file_ID = str_remove(file_ID, '.txt'))

I've found that using a for loop is nice sometimes because saves all the progress along the way in case you hit an error.我发现有时使用 for 循环很不错,因为它可以保存所有的进度,以防你遇到错误。 Then you can find the problem file and debug it or use try() but throw a warning() .然后你可以找到问题文件并调试它或使用try()但抛出一个warning()

library(tidyverse)
filelist <- list.files(pattern=".txt") #list files

tmp_list <- list()
for (i in seq_along(filelist)) {
  my_table <- read_tsv(filelist[i]) %>% # It looks like your files are all .tsv's
    rename(Percentage=V1, Num_reads_root=V2, Num_reads_taxon=V3, Rank=V4, NCBI_ID=V5, Name=V6) %>%
    filter(Rank=="D") %>%
    mutate(file_ID <- trimws(filelist[i], whitespace="\\..*")) %>%
    select(file_ID, everything())
  tmp_list[[i]] <- my_table
}
out <- bind_rows(tmp_list)
out

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM