简体   繁体   English

从 R 中的两组列创建所有采样组合

[英]Creating all combinations of sampling from two groups of columns in R

I have the dataframe below and within this "two groups", the columns A&B and D&E.我有 dataframe 下面和这“两组”,列 A&B 和 D&E。 I would like to find all combinations and then to group by all combinations of applying different filters at columns A&B and D&E but in the form of only choosing 1 column from each group at the time.我想找到所有组合,然后按在 A&B 和 D&E 列应用不同过滤器的所有组合进行分组,但形式是当时只从每个组中选择 1 列。 I dont know the right formula to do this and the problem is way bigger in reality.我不知道执行此操作的正确公式,而实际上问题要大得多。

df= df=

     Size    A     B     D     E
       1     1     1     0     0
       5     0     0     1     0
       10    1     1     1     0
       3     1     0     0     0
       2     1     1     1     1
       55    0     0     0     1
       5     1     0     1     1
       2     0     0     1     1
       1     1     1     1     1
       4     1     1     1     0

So the combinations to filter should be所以过滤的组合应该是

Filter 1: A=1 AND D=1过滤器 1:A=1 和 D=1

Filter 2: A=1 AND D=0过滤器 2:A=1 和 D=0

Filter 3: A=1 AND E=1过滤器 3:A=1 AND E=1

Filter 4: A=1 AND E=0过滤器 4:A=1 AND E=0

Filter 5: A=0 AND D=1过滤器 5:A=0 AND D=1

Filter 6: A=0 AND D=0过滤器 6:A=0 和 D=0

Filter 7: A=0 AND E=1过滤器 7:A=0 AND E=1

Filter 8: A=0 AND E=0过滤器 8:A=0 和 E=0

Filter 9: B=1 AND D=1过滤器 9:B=1 且 D=1

Filter 10: B=1 AND D=0过滤器 10:B=1 且 D=0

Filter 11: B=1 AND E=1过滤器 11:B=1 且 E=1

Filter 12: B=1 AND E=0过滤器 12:B=1 且 E=0

Filter 13: B=0 AND D=1过滤器 13:B=0 且 D=1

Filter 14: B=0 AND D=0过滤器 14:B=0 且 D=0

Filter 15: B=0 AND E=1过滤器 15:B=0 AND E=1

Filter 16: B=0 AND E=0过滤器 16:B=0 AND E=0

I want to find a way to efficiently create these filter groups (drawing always 1 filter from either columns A&B or D&E) and then to find the average and count of the Size column for each filter setting.我想找到一种方法来有效地创建这些过滤器组(始终从 A&B 或 D&E 列中绘制 1 个过滤器),然后找到每个过滤器设置的 Size 列的平均值和计数。 I only managed to do this without different groups to sample the filter from.我只设法在没有不同组的情况下做到这一点来对过滤器进行采样。

What I tried was in the form of this:我尝试的是这样的形式:

groupNames <- names(df)[2:5]

myGroups <- Map(combn,list(groupNames),seq_along(groupNames),simplify = FALSE) %>% unlist(recursive = FALSE)

results = lapply(myGroups, FUN = function(x) {do.call(what = group_by_, args = c(list(df), x)) %>% summarise( n = length(Size), avgVar1 = mean(Size))})

It treats the four columns equally and does not consider sampling from the 2 groups.它平等对待四列,不考虑从 2 组中抽样。 What could I do to the code to make this work?我可以对代码做些什么来完成这项工作?

Thank you very much.非常感谢。

library(tidyverse)
df <- tribble(~Size, ~A, ~B, ~D, ~E,
              1, "1", "1", "0", "0",
              5, "0", "0", "1", "0",
              10, "1", "1", "1", "0",
              3, "1", "0", "0", "0",
              2, "1", "1", "1", "1",
              55, "0", "0", "0", "1",
              5, "1", "0", "1", "1",
              2, "0", "0", "1", "1",
              1, "1", "1", "1", "1",
              4, "1", "1", "1", "0")
p <- function(...) paste0(...) # for legibility, should rather use glue

all_filtering_groups <- list(c("A", "B"), c("D", "E")) # assuming these are known
all_combns <- map(1:length(all_filtering_groups), ~ combn(all_filtering_groups, .))
res <- list(length(all_combns))

#microbenchmark::microbenchmark({
for(comb_length in seq_along(all_combns)){
  res[[comb_length]] <- list(ncol(all_combns[[comb_length]]))
  for(col_i in seq_len(ncol(all_combns[[comb_length]]))){
    
    filtering_groups <- all_combns[[comb_length]][,col_i]
    group_names <- as.character(seq_along(filtering_groups))
    
    
    # prepare grid of all combinations
    filtering_combs <- c(filtering_groups, rep(list(0:1), length(filtering_groups)))
    names(filtering_combs) <- c(p("vars_", group_names), p("vals_", group_names))
    full_grid <- expand.grid(filtering_combs)
    
    for(ll in 1:nrow(full_grid)){ # for each line in the full_grid
      # find df lines that correspond
      cond <- as.logical(rep(TRUE, nrow(df)))
      for(grp in group_names){
        cond <- cond & df[[full_grid[p("vars_", grp)][ll,]]] == full_grid[p("vals_", grp)][ll,]
      }
      # and compute whatever
      full_grid$lines[ll] <- paste(which(cond), collapse = ", ") #for visual verification
      full_grid$n[ll] <- length(df$Size[cond])
      full_grid$sum[ll] <- sum(df$Size[cond])
      full_grid$mean[ll] <- mean(df$Size[cond])
    }
    res[[comb_length]][[col_i]] <- full_grid
    
  }
}
#}, times = 10) #microbenchmark

bind_rows(res) %>% relocate(starts_with("vars") | starts_with("vals"))

Following the discussion in the comments, I think we can treat the groups as variables.在评论中的讨论之后,我认为我们可以将组视为变量。 So we need to reshape the dataframe to have one column per factor, then we can use standard tidyverse approaches.因此,我们需要将 dataframe 重塑为每个因子一列,然后我们可以使用标准的 tidyverse 方法。 I'm assuming the groups are defined by the column names (A1...Ak, B1...Bk, ...).我假设这些组是由列名(A1...Ak,B1...Bk,...)定义的。

library(tidyverse)
df <- tribble(~Size, ~A1, ~A2, ~B1, ~B2,
              1, "1", "1", "0", "0",
              5, "0", "0", "1", "0",
              10, "1", "1", "1", "0",
              3, "1", "0", "0", "0",
              2, "1", "1", "1", "1",
              55, "0", "0", "0", "1",
              5, "1", "0", "1", "1",
              2, "0", "0", "1", "1",
              1, "1", "1", "1", "1",
              4, "1", "1", "1", "0")

get_levels <- function(col){
  paste(names(col)[col == "1"], collapse = ",")
}
# Rewrite with groups as factors
df_factors <- df %>%
  mutate(id = row_number()) %>%  #to avoid aggregating same Size
  nest(A = starts_with("A"), B = starts_with("B")) %>%
  mutate(A = factor(map_chr(A, get_levels)),
         B = factor(map_chr(B, get_levels)))

# Now look at factor combinations
df_factors %>%
  group_by(A, B) %>%
  summarize(n = n(),
            mean = mean(Size))

# A tibble: 8 x 4
# Groups:   A [3]
#   A       B           n  mean
#   <fct>   <fct>   <int> <dbl>
# 1 ""      "B1"        1   5  
# 2 ""      "B1,B2"     1   2  
# 3 ""      "B2"        1  55  
# 4 "A1"    ""          1   3  
# 5 "A1"    "B1,B2"     1   5  
# 6 "A1,A2" ""          1   1  
# 7 "A1,A2" "B1"        2   7  
# 8 "A1,A2" "B1,B2"     2   1.5

I called "A" and "B" explicitly.我明确地称为“A”和“B”。 It seems still doable to do that with 6 groups.使用 6 组似乎仍然可行。 If you have more, it would become necessary to automatize, but I'm not sure how to do that easily.如果您有更多,则有必要进行自动化,但我不确定如何轻松做到这一点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM