Long time lurker on the forum, this will be my first post. I appreciate your patience in advance, I have limited formal training in computer science and am definitely a biologist by day.
My question is regarding how to handle processing two lists with multiple dataframes each in R. Please find example data below.
set.seed(1)
set1 <- data.frame(NAME = paste("row_", 1:10, sep = ""),
SYMBOL = paste(c(sample(LETTERS, 10))),
SIGNIFICANT = sample(c("yes", "no"), 10, replace = TRUE))
set2 <- data.frame(NAME = paste("row_", 1:10, sep = ""),
SYMBOL = paste(c(sample(LETTERS, 10))),
SIGNIFICANT = sample(c("yes", "no"), 10, replace = TRUE))
set3 <- data.frame(NAME = paste("row_", 1:10, sep = ""),
SYMBOL = paste(c(sample(LETTERS, 10))),
SIGNIFICANT = sample(c("yes", "no"), 10, replace = TRUE))
set4 <- data.frame(NAME = paste("row_", 1:10, sep = ""),
SYMBOL = paste(c(sample(LETTERS, 10))),
SIGNIFICANT = sample(c("yes", "no"), 10, replace = TRUE))
files <- list(set1, set2, set3, set4)
names(files) <- paste("Set", 1:4, sep = "")
reports <- list(data.frame(SETS = c("Set1", "Set3"),
STATISTIC = runif(2)),
data.frame(SETS = c("Set2", "Set4"),
STATISTIC = runif(2)))
names(reports) <- c("Report1", "Report2")
files
is a list containing many dataframes of metadata from an analysis.
> files$Set1
NAME SYMBOL SIGNIFICANT
1 row_1 Y no
2 row_2 D no
3 row_3 G no
4 row_4 A yes
5 row_5 B yes
6 row_6 K yes
7 row_7 N yes
8 row_8 R yes
9 row_9 W yes
10 row_10 J yes
reports
is also a list containing 2 dataframes with primary outputs from a two-way analysis and associated statistics.
> reports$Report1
SETS STATISTIC
1 Set1 0.4100841
2 Set3 0.8108702
Note that the names of the dataframes within the files
list correspond with column 2 of the dataframes within the reports
list.
I wish to collapse these files
metadata in a particular way. If files$Set1$SIGNIFICANT == 'yes'
, I would like to append the corresponding SYMBOL
to a comma delimited string. Then, I would like to append the string to the corresponding Set within reports
. Thus, my desired output would be as follows:
> head(reports$Report1)
SETS STATISTIC SYMBOL
1 Set1 0.4100841 A, V, K, N, R, W, J
2 Set3 0.8108702 F, S, J, V
and likewise for Report2
Easy enough to do manually for this example, but in my actual project, length(files)=600
I am attempting to parse this through a for
loop but keep running into errors. Here is my current iteration
output <- data.frame()
for(i in 1:length(files)){
for(j in 1:nrow(files[[i]])){
if(files[j, 3] == "Yes"){
output[i, 1]=i;
output[i, 2]=paste0(i[,2], collapse = ", ")
}
}
}
And my current error:
Error in i[[j, 3]] : incorrect number of subscripts
I have been working with R for ~4 years now and if I know one thing, its that people avoid loops like the plague more often than not. I know some variation of apply
, lapply
, etc. is likely going to make life easy. Despite that, after consulting the R literature and this forum, I am stumped.
Would appreciate some advice on this one. Thanks everybody!
I think you can do this in two steps, first create a data.frame that has the significant symbols for each set. Here
sigset <- stack(lapply(files, function(x) paste(x$SYMBOL[x$SIGNIFICANT=="yes"], collapse=", ")))
names(sigset) <- c("SYMBOL","SETS")
We use lapply()
to iterate the file list, extracting the significant symbols can combining them together, then we stack the list into a data.frame to make it easier to work with. We change the names so it can more easily merge on column name. Then we can merge this list with each of the reports
output <- lapply(reports, function(x) merge(x, sigset))
You can use sapply
to iterate over files
list from each dataframe keep only SIGNIFICANT = 'yes'
values and collapse them into one string.
data <- stack(sapply(files,function(x) toString(x$SYMBOL[x$SIGNIFICANT=='yes'])))
data
# values ind
#1 A, B, K, N, R, W, J Set1
#2 B, F Set2
#3 F, S, J, V Set3
#4 W, Z, H, Q, D, M Set4
You can then merge
data
with each dataframe
in reports
.
result <- lapply(reports, function(x) merge(x,data, by.x = 'SETS', by.y = 'ind'))
result
#$Report1
# SETS STATISTIC values
#1 Set1 0.4100841 A, B, K, N, R, W, J
#2 Set3 0.8108702 F, S, J, V
#$Report2
# SETS STATISTIC values
#1 Set2 0.6049333 B, F
#2 Set4 0.6547239 W, Z, H, Q, D, M
Here is solution similar to MrFlick's but using subset
, setNames
, and different lapply
calls which may be slightly easier to read:
# get the characters with SIGNIFICANT equal to "yes"
all_symbs <- lapply(files, subset, SIGNIFICANT == "yes", SYMBOL, TRUE)
# create one data.frame with the above after concatenating
all_files <- setNames(stack(lapply(all_symbs, paste0, collapse = ", ")),
c("SYMBOL","SETS"))
# merge with reports
res <- lapply(reports, merge, y = all_files)
# the result
res
#R> $Report1
#R> SETS STATISTIC SYMBOL
#R> 1 Set1 0.4100841 A, B, K, N, R, W, J
#R> 2 Set3 0.8108702 F, S, J, V
#R>
#R> $Report2
#R> SETS STATISTIC SYMBOL
#R> 1 Set2 0.6049333 B, F
#R> 2 Set4 0.6547239 W, Z, H, Q, D, M
You can get rid of one of the lapply
calls by creating an anonymous function instead of the two lapply(files, ...)
and lapply(all_symbs, ...)
calls.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.