繁体   English   中英

将 data.frames 列表减少为非冗余 data.frame

[英]Reduce a list of data.frames to a non-redundant data.frame

我有一个data.frame list ,其中某些data.frame元素的所有列和它们的值都包含在另一个data.frame元素中。

这是一个示例list

df.list <- list(df1 = data.frame(id = rep("i1",6), name = c("t1_1","t1_2","t1_3","b_1","b_2","b_3"), replicate = rep(1:3,2), condition = c(rep("t1",3),rep("b",3)), y = c(0.5,0.6,0.2,0.2,0.1,0.05)),
                df2 = data.frame(id = rep("i1",6), name = c("t2_1","t2_2","t2_3","b_1","b_2","b_3"), replicate = rep(1:3,2), condition = c(rep("t2",3),rep("b",3)), y = c(0.8,0.9,0.7,0.2,0.1,0.05)),
                df3 = data.frame(id = rep("i1",6), name = c("t1_1","t1_2","t1_3","b_1","b_2","b_3"), replicate = rep(1:3,2), age = rep(c(10,20,30),2), condition = c(rep("t1",3),rep("b",3)), y = c(0.5,0.6,0.2,0.2,0.1,0.05)),
                df4 = data.frame(id = rep("i1",6), name = c("t2_1","t2_2","t2_3","b_1","b_2","b_3"), replicate = rep(1:3,2), age = rep(c(10,20,30),2), condition = c(rep("t2",3),rep("b",3)), y = c(0.8,0.9,0.7,0.2,0.1,0.05)))

所以df.list$df1包含在df.list$df3中,而df.list$df2包含在df.list$df4中,因为df.list$df3df.list$df4具有df.list$df1 $ 的agedf.list$df1df.list$df2没有。

我想将此list Reduce为非冗余data.frame

list(unique(Reduce(rbind,df.list)))

不起作用,因为data.frame元素有不同的列( df.list$df3df.list$df4中的age列),所以我正在寻找可以检测到df.list$df1df.list$df2的等效项df.list$df2分别包含在df.list$df3df.list$df4中(因此是多余的)。

在上面的示例中,生成的 data.frame 将是:

unique(rbind(df.list$df3, df.list$df4))

如果您打算减少age中的NA值,认为它们会自动更新,那么试试这个:

Reduce(function(a, b) {
  out <- merge(a, b,
    by = setdiff(intersect(names(a), names(b)), c("age","y")), suffixes = c("", ".z"),
    all = TRUE)
  dotz <- grep("\\.z$", names(out), value = TRUE)
  noz <- gsub("\\.z$", "", dotz)
  dotz <- dotz[noz %in% names(out)]
  noz <- noz[noz %in% names(out)]
  out[noz] <- Map(function(a, b) ifelse(is.na(a), b, a), out[noz], out[dotz])
  out[setdiff(names(out), dotz)]
}, df.list)
#   id name replicate condition    y age
# 1 i1  b_1         1         b 0.20  10
# 2 i1  b_2         2         b 0.10  20
# 3 i1  b_3         3         b 0.05  30
# 4 i1 t1_1         1        t1 0.50  10
# 5 i1 t1_2         2        t1 0.60  20
# 6 i1 t1_3         3        t1 0.20  30
# 7 i1 t2_1         1        t2 0.80  10
# 8 i1 t2_2         2        t2 0.90  20
# 9 i1 t2_3         3        t2 0.70  30

请注意,当y (例如)的两个值都是非NA时,此过程将静默删除第二个(和后续)非NA值。

正如评论中提到的 r2evans, Reduce()实际上并不是 go 的最佳方法。

df.list |> 
  seq_along() |>
  # for each data frame, check if every element is also element in one of the others
  lapply(\(x) lapply(df.list[-x], \(y) all(df.list[[x]] %in% y))) |> 
  sapply(\(x) unlist(x) |> Negate(any)()) |> 
  {\(.) df.list[.]}() |> 
  do.call(what = rbind)
#>       id name replicate age condition    y
#> df3.1 i1 t1_1         1  10        t1 0.50
#> df3.2 i1 t1_2         2  20        t1 0.60
#> df3.3 i1 t1_3         3  30        t1 0.20
#> df3.4 i1  b_1         1  10         b 0.20
#> df3.5 i1  b_2         2  20         b 0.10
#> df3.6 i1  b_3         3  30         b 0.05
#> df4.1 i1 t2_1         1  10        t2 0.80
#> df4.2 i1 t2_2         2  20        t2 0.90
#> df4.3 i1 t2_3         3  30        t2 0.70
#> df4.4 i1  b_1         1  10         b 0.20
#> df4.5 i1  b_2         2  20         b 0.10
#> df4.6 i1  b_3         3  30         b 0.05

reprex package (v2.0.1) 于 2022 年 2 月 1 日创建

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM