[英]Creating all combinations of sampling from two groups of columns in R
我有 dataframe 下面和這“兩組”,列 A&B 和 D&E。 我想找到所有組合,然后按在 A&B 和 D&E 列應用不同過濾器的所有組合進行分組,但形式是當時只從每個組中選擇 1 列。 我不知道執行此操作的正確公式,而實際上問題要大得多。
df=
Size A B D E
1 1 1 0 0
5 0 0 1 0
10 1 1 1 0
3 1 0 0 0
2 1 1 1 1
55 0 0 0 1
5 1 0 1 1
2 0 0 1 1
1 1 1 1 1
4 1 1 1 0
所以過濾的組合應該是
過濾器 1:A=1 和 D=1
過濾器 2:A=1 和 D=0
過濾器 3:A=1 AND E=1
過濾器 4:A=1 AND E=0
過濾器 5:A=0 AND D=1
過濾器 6:A=0 和 D=0
過濾器 7:A=0 AND E=1
過濾器 8:A=0 和 E=0
過濾器 9:B=1 且 D=1
過濾器 10:B=1 且 D=0
過濾器 11:B=1 且 E=1
過濾器 12:B=1 且 E=0
過濾器 13:B=0 且 D=1
過濾器 14:B=0 且 D=0
過濾器 15:B=0 AND E=1
過濾器 16:B=0 AND E=0
我想找到一種方法來有效地創建這些過濾器組(始終從 A&B 或 D&E 列中繪制 1 個過濾器),然后找到每個過濾器設置的 Size 列的平均值和計數。 我只設法在沒有不同組的情況下做到這一點來對過濾器進行采樣。
我嘗試的是這樣的形式:
groupNames <- names(df)[2:5]
myGroups <- Map(combn,list(groupNames),seq_along(groupNames),simplify = FALSE) %>% unlist(recursive = FALSE)
results = lapply(myGroups, FUN = function(x) {do.call(what = group_by_, args = c(list(df), x)) %>% summarise( n = length(Size), avgVar1 = mean(Size))})
它平等對待四列,不考慮從 2 組中抽樣。 我可以對代碼做些什么來完成這項工作?
非常感謝。
library(tidyverse)
df <- tribble(~Size, ~A, ~B, ~D, ~E,
1, "1", "1", "0", "0",
5, "0", "0", "1", "0",
10, "1", "1", "1", "0",
3, "1", "0", "0", "0",
2, "1", "1", "1", "1",
55, "0", "0", "0", "1",
5, "1", "0", "1", "1",
2, "0", "0", "1", "1",
1, "1", "1", "1", "1",
4, "1", "1", "1", "0")
p <- function(...) paste0(...) # for legibility, should rather use glue
all_filtering_groups <- list(c("A", "B"), c("D", "E")) # assuming these are known
all_combns <- map(1:length(all_filtering_groups), ~ combn(all_filtering_groups, .))
res <- list(length(all_combns))
#microbenchmark::microbenchmark({
for(comb_length in seq_along(all_combns)){
res[[comb_length]] <- list(ncol(all_combns[[comb_length]]))
for(col_i in seq_len(ncol(all_combns[[comb_length]]))){
filtering_groups <- all_combns[[comb_length]][,col_i]
group_names <- as.character(seq_along(filtering_groups))
# prepare grid of all combinations
filtering_combs <- c(filtering_groups, rep(list(0:1), length(filtering_groups)))
names(filtering_combs) <- c(p("vars_", group_names), p("vals_", group_names))
full_grid <- expand.grid(filtering_combs)
for(ll in 1:nrow(full_grid)){ # for each line in the full_grid
# find df lines that correspond
cond <- as.logical(rep(TRUE, nrow(df)))
for(grp in group_names){
cond <- cond & df[[full_grid[p("vars_", grp)][ll,]]] == full_grid[p("vals_", grp)][ll,]
}
# and compute whatever
full_grid$lines[ll] <- paste(which(cond), collapse = ", ") #for visual verification
full_grid$n[ll] <- length(df$Size[cond])
full_grid$sum[ll] <- sum(df$Size[cond])
full_grid$mean[ll] <- mean(df$Size[cond])
}
res[[comb_length]][[col_i]] <- full_grid
}
}
#}, times = 10) #microbenchmark
bind_rows(res) %>% relocate(starts_with("vars") | starts_with("vals"))
在評論中的討論之后,我認為我們可以將組視為變量。 因此,我們需要將 dataframe 重塑為每個因子一列,然后我們可以使用標准的 tidyverse 方法。 我假設這些組是由列名(A1...Ak,B1...Bk,...)定義的。
library(tidyverse)
df <- tribble(~Size, ~A1, ~A2, ~B1, ~B2,
1, "1", "1", "0", "0",
5, "0", "0", "1", "0",
10, "1", "1", "1", "0",
3, "1", "0", "0", "0",
2, "1", "1", "1", "1",
55, "0", "0", "0", "1",
5, "1", "0", "1", "1",
2, "0", "0", "1", "1",
1, "1", "1", "1", "1",
4, "1", "1", "1", "0")
get_levels <- function(col){
paste(names(col)[col == "1"], collapse = ",")
}
# Rewrite with groups as factors
df_factors <- df %>%
mutate(id = row_number()) %>% #to avoid aggregating same Size
nest(A = starts_with("A"), B = starts_with("B")) %>%
mutate(A = factor(map_chr(A, get_levels)),
B = factor(map_chr(B, get_levels)))
# Now look at factor combinations
df_factors %>%
group_by(A, B) %>%
summarize(n = n(),
mean = mean(Size))
# A tibble: 8 x 4
# Groups: A [3]
# A B n mean
# <fct> <fct> <int> <dbl>
# 1 "" "B1" 1 5
# 2 "" "B1,B2" 1 2
# 3 "" "B2" 1 55
# 4 "A1" "" 1 3
# 5 "A1" "B1,B2" 1 5
# 6 "A1,A2" "" 1 1
# 7 "A1,A2" "B1" 2 7
# 8 "A1,A2" "B1,B2" 2 1.5
我明確地稱為“A”和“B”。 使用 6 組似乎仍然可行。 如果您有更多,則有必要進行自動化,但我不確定如何輕松做到這一點。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.