![](/img/trans.png)
[英]Joining a list of data.frames with intersected genes and redundant columns into a single unique data.frame
[英]Reduce a list of data.frames to a non-redundant data.frame
我有一个data.frame
list
,其中某些data.frame
元素的所有列和它们的值都包含在另一个data.frame
元素中。
这是一个示例list
:
df.list <- list(df1 = data.frame(id = rep("i1",6), name = c("t1_1","t1_2","t1_3","b_1","b_2","b_3"), replicate = rep(1:3,2), condition = c(rep("t1",3),rep("b",3)), y = c(0.5,0.6,0.2,0.2,0.1,0.05)),
df2 = data.frame(id = rep("i1",6), name = c("t2_1","t2_2","t2_3","b_1","b_2","b_3"), replicate = rep(1:3,2), condition = c(rep("t2",3),rep("b",3)), y = c(0.8,0.9,0.7,0.2,0.1,0.05)),
df3 = data.frame(id = rep("i1",6), name = c("t1_1","t1_2","t1_3","b_1","b_2","b_3"), replicate = rep(1:3,2), age = rep(c(10,20,30),2), condition = c(rep("t1",3),rep("b",3)), y = c(0.5,0.6,0.2,0.2,0.1,0.05)),
df4 = data.frame(id = rep("i1",6), name = c("t2_1","t2_2","t2_3","b_1","b_2","b_3"), replicate = rep(1:3,2), age = rep(c(10,20,30),2), condition = c(rep("t2",3),rep("b",3)), y = c(0.8,0.9,0.7,0.2,0.1,0.05)))
所以df.list$df1
包含在df.list$df3
中,而df.list$df2
包含在df.list$df4
中,因为df.list$df3
和df.list$df4
具有df.list$df1
$ 的age
列df.list$df1
和df.list$df2
没有。
我想将此list
Reduce
为非冗余data.frame
。
list(unique(Reduce(rbind,df.list)))
不起作用,因为data.frame
元素有不同的列( df.list$df3
和df.list$df4
中的age
列),所以我正在寻找可以检测到df.list$df1
和df.list$df2
的等效项df.list$df2
分别包含在df.list$df3
和df.list$df4
中(因此是多余的)。
在上面的示例中,生成的 data.frame 将是:
unique(rbind(df.list$df3, df.list$df4))
如果您打算减少age
中的NA
值,认为它们会自动更新,那么试试这个:
Reduce(function(a, b) {
out <- merge(a, b,
by = setdiff(intersect(names(a), names(b)), c("age","y")), suffixes = c("", ".z"),
all = TRUE)
dotz <- grep("\\.z$", names(out), value = TRUE)
noz <- gsub("\\.z$", "", dotz)
dotz <- dotz[noz %in% names(out)]
noz <- noz[noz %in% names(out)]
out[noz] <- Map(function(a, b) ifelse(is.na(a), b, a), out[noz], out[dotz])
out[setdiff(names(out), dotz)]
}, df.list)
# id name replicate condition y age
# 1 i1 b_1 1 b 0.20 10
# 2 i1 b_2 2 b 0.10 20
# 3 i1 b_3 3 b 0.05 30
# 4 i1 t1_1 1 t1 0.50 10
# 5 i1 t1_2 2 t1 0.60 20
# 6 i1 t1_3 3 t1 0.20 30
# 7 i1 t2_1 1 t2 0.80 10
# 8 i1 t2_2 2 t2 0.90 20
# 9 i1 t2_3 3 t2 0.70 30
请注意,当y
(例如)的两个值都是非NA
时,此过程将静默删除第二个(和后续)非NA
值。
正如评论中提到的 r2evans, Reduce()
实际上并不是 go 的最佳方法。
df.list |>
seq_along() |>
# for each data frame, check if every element is also element in one of the others
lapply(\(x) lapply(df.list[-x], \(y) all(df.list[[x]] %in% y))) |>
sapply(\(x) unlist(x) |> Negate(any)()) |>
{\(.) df.list[.]}() |>
do.call(what = rbind)
#> id name replicate age condition y
#> df3.1 i1 t1_1 1 10 t1 0.50
#> df3.2 i1 t1_2 2 20 t1 0.60
#> df3.3 i1 t1_3 3 30 t1 0.20
#> df3.4 i1 b_1 1 10 b 0.20
#> df3.5 i1 b_2 2 20 b 0.10
#> df3.6 i1 b_3 3 30 b 0.05
#> df4.1 i1 t2_1 1 10 t2 0.80
#> df4.2 i1 t2_2 2 20 t2 0.90
#> df4.3 i1 t2_3 3 30 t2 0.70
#> df4.4 i1 b_1 1 10 b 0.20
#> df4.5 i1 b_2 2 20 b 0.10
#> df4.6 i1 b_3 3 30 b 0.05
由reprex package (v2.0.1) 于 2022 年 2 月 1 日创建
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.