简体   繁体   中英

Collapsing one list of dataframes and combining with another list of dataframes in R

Long time lurker on the forum, this will be my first post. I appreciate your patience in advance, I have limited formal training in computer science and am definitely a biologist by day.

My question is regarding how to handle processing two lists with multiple dataframes each in R. Please find example data below.

set.seed(1)
set1 <- data.frame(NAME = paste("row_", 1:10, sep = ""),
                 SYMBOL = paste(c(sample(LETTERS, 10))),
                 SIGNIFICANT = sample(c("yes", "no"), 10, replace = TRUE))
set2 <- data.frame(NAME = paste("row_", 1:10, sep = ""),
               SYMBOL = paste(c(sample(LETTERS, 10))),
               SIGNIFICANT = sample(c("yes", "no"), 10, replace = TRUE))
set3 <- data.frame(NAME = paste("row_", 1:10, sep = ""),
                  SYMBOL = paste(c(sample(LETTERS, 10))),
                  SIGNIFICANT = sample(c("yes", "no"), 10, replace = TRUE))
set4 <- data.frame(NAME = paste("row_", 1:10, sep = ""),
                 SYMBOL = paste(c(sample(LETTERS, 10))),
                 SIGNIFICANT = sample(c("yes", "no"), 10, replace = TRUE))
files <- list(set1, set2, set3, set4)
names(files) <- paste("Set", 1:4, sep = "")
reports <- list(data.frame(SETS = c("Set1", "Set3"),
                        STATISTIC = runif(2)),
             data.frame(SETS = c("Set2", "Set4"),
                        STATISTIC = runif(2)))
names(reports) <- c("Report1", "Report2")

files is a list containing many dataframes of metadata from an analysis.

> files$Set1
     NAME SYMBOL SIGNIFICANT
1   row_1      Y          no
2   row_2      D          no
3   row_3      G          no
4   row_4      A         yes
5   row_5      B         yes
6   row_6      K         yes
7   row_7      N         yes
8   row_8      R         yes
9   row_9      W         yes
10 row_10      J         yes

reports is also a list containing 2 dataframes with primary outputs from a two-way analysis and associated statistics.

> reports$Report1
  SETS STATISTIC
1 Set1 0.4100841
2 Set3 0.8108702

Note that the names of the dataframes within the files list correspond with column 2 of the dataframes within the reports list.

I wish to collapse these files metadata in a particular way. If files$Set1$SIGNIFICANT == 'yes' , I would like to append the corresponding SYMBOL to a comma delimited string. Then, I would like to append the string to the corresponding Set within reports . Thus, my desired output would be as follows:

> head(reports$Report1)
  SETS STATISTIC              SYMBOL
1 Set1 0.4100841 A, V, K, N, R, W, J
2 Set3 0.8108702          F, S, J, V

and likewise for Report2

Easy enough to do manually for this example, but in my actual project, length(files)=600

I am attempting to parse this through a for loop but keep running into errors. Here is my current iteration

output <- data.frame()
for(i in 1:length(files)){
  for(j in 1:nrow(files[[i]])){
    if(files[j, 3] == "Yes"){
      output[i, 1]=i;
      output[i, 2]=paste0(i[,2], collapse = ", ")
    }
  }
}

And my current error:

Error in i[[j, 3]] : incorrect number of subscripts

I have been working with R for ~4 years now and if I know one thing, its that people avoid loops like the plague more often than not. I know some variation of apply , lapply , etc. is likely going to make life easy. Despite that, after consulting the R literature and this forum, I am stumped.

Would appreciate some advice on this one. Thanks everybody!

I think you can do this in two steps, first create a data.frame that has the significant symbols for each set. Here

sigset <- stack(lapply(files, function(x) paste(x$SYMBOL[x$SIGNIFICANT=="yes"], collapse=", ")))
names(sigset) <- c("SYMBOL","SETS")

We use lapply() to iterate the file list, extracting the significant symbols can combining them together, then we stack the list into a data.frame to make it easier to work with. We change the names so it can more easily merge on column name. Then we can merge this list with each of the reports

output <- lapply(reports, function(x) merge(x, sigset))

You can use sapply to iterate over files list from each dataframe keep only SIGNIFICANT = 'yes' values and collapse them into one string.

data <- stack(sapply(files,function(x) toString(x$SYMBOL[x$SIGNIFICANT=='yes'])))

data
#               values  ind
#1 A, B, K, N, R, W, J Set1
#2                B, F Set2
#3          F, S, J, V Set3
#4    W, Z, H, Q, D, M Set4

You can then merge data with each dataframe in reports .

result <- lapply(reports, function(x) merge(x,data, by.x = 'SETS', by.y = 'ind'))
result

#$Report1
#  SETS STATISTIC              values
#1 Set1 0.4100841 A, B, K, N, R, W, J
#2 Set3 0.8108702          F, S, J, V

#$Report2
#  SETS STATISTIC           values
#1 Set2 0.6049333             B, F
#2 Set4 0.6547239 W, Z, H, Q, D, M

Here is solution similar to MrFlick's but using subset , setNames , and different lapply calls which may be slightly easier to read:

# get the characters with SIGNIFICANT equal to "yes"
all_symbs <- lapply(files, subset, SIGNIFICANT == "yes", SYMBOL, TRUE)
# create one data.frame with the above after concatenating
all_files <- setNames(stack(lapply(all_symbs, paste0, collapse = ", ")),
                      c("SYMBOL","SETS"))
# merge with reports
res <- lapply(reports, merge, y = all_files)
# the result
res
#R> $Report1
#R>   SETS STATISTIC              SYMBOL
#R> 1 Set1 0.4100841 A, B, K, N, R, W, J
#R> 2 Set3 0.8108702          F, S, J, V
#R> 
#R> $Report2
#R>   SETS STATISTIC           SYMBOL
#R> 1 Set2 0.6049333             B, F
#R> 2 Set4 0.6547239 W, Z, H, Q, D, M

You can get rid of one of the lapply calls by creating an anonymous function instead of the two lapply(files, ...) and lapply(all_symbs, ...) calls.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM