简体   繁体   中英

All possible combinations in data set with nested groups that are ordered sequencially in R

I posted this question earlier, but after a few comments I realized a better explanation of the df and result are needed. Additionally, I realized a simplified data structure can be used to achieve my goals.

I need to find all combinations of values within a specific sequence.

In the example df you can see the nested grouping structure, with GROUP nested within SAMPLE and SITE . My actual data set has ~1000 SAMPLE across 30 SITE . Note that each GROUP is in sequence (1 to 3) and this sequence needs to be retained in the final result as this is a critical part of future analyses. In other words, combinations that put GROUP 2 or 3 before GROUP 1 are not needed.

Within each GROUP are 2 or 3 text or numeric values in the ASSIGN_1 , ASSIGN_2 , and ASSIGN_3 columns. For example, F1 has 3 groups and each group has 2 possibilities: GROUP 1 has 4 and unk_palmer_trib; GROUP 2 has 6.1 and 5.2; and 'GROUP' 3 has 10.1 and 6.1. You can see SAMPLE B2 also has three groups, with GROUP 1 and 3 having two possibilities and GROUP 2 having three possibilities.

So…I need to find all the possible combinations of the values in the ASSIGN columns while retaining the sequence of the GROUP . Note that OPTION columns in the result are the various combinations of values from the ASSIGN columns in the df and you will notice that these combinations retain the GROUP sequence (ie, 1 to 3). Also note that the number of OPTION columns with data (ie, no “NAs”) in the result correspond to all the possible combinations of the ASSIGN columns that retain the GROUP sequence. SAMPLE F1 has 3 groups each with 2 possibilities and thus there are 8 possible outcomes (ie, 2 x 2 x 2; see OPTION columns 1 to 8). SAMPLE B2 has 3 groups (2 groups with two possibilities and 1 group with three possibilities) and thus there are 12 possible outcomes (ie, 2 x 3 x 2; see OPTION columns 1 to 12).

To really push this point home, lets look at SAMPLE F1 in the df and the result to illustrate how these combinations need to be structured. OPTION_1 for F1 is just the ASSIGN_1 column from df . OPTION_2 is row 1 from ASSIGN_1 (ie, GROUP_1 = 4) and then rows 2 and 3 from ASSIGN_2 (ie, GROUP_2 = 5.2 and GROUP_3 = 6.1). 'OPTION_3' is row 1 from ASSIGN_1 (ie, GROUP_1 = 4), row 2 from ASSIGN_2 (ie, GROUP_2 = 5.2) and row 3 from ASSIGN_1 (ie, GROUP_3 = 10.1). Repeat at nauseum….

In my actual data set, some of the GROUP can have up to 5 ASSIGN columns…so the number of possible combinations can be large for some SAMPLE .

I have tried using expand.grid() within ave() , but have not been able to get the code to work. I am open to all solutions, but prefer base package solutions as I want to avoid having to load packages. I also suspect that the df likely needs to be restructured in some way…which is perfectly fine so long as the GROUP sequence is retained in the final product.

Let me know if I need to clarify.

Thanks in advance for your help.

df <- read.table(text = "SITE   SAMPLE  GROUP   ASSIGN_1    ASSIGN_2    ASSIGN_3
A1  F1  1   4   unk_palmer_trib NA
A1  F1  2   6.1 5.2 NA
A1  F1  3   10.1    6.1 NA
M15 B2  1   6.2 6.4 NA
M15 B2  2   10.1    6.1 5.2
M15 B2  3   10.1    6.1 NA
", header = TRUE)

result <- read.table(text = "SITE   SAMPLE  GROUP   OPTION_1    OPTION_2    OPTION_3    OPTION_4    OPTION_5    OPTION_6    OPTION_7    OPTION_8    OPTION_9    OPTION_10   OPTION_11   OPTION_12
A1  F1  1   4   4   4   4   unk_palmer_trib unk_palmer_trib unk_palmer_trib unk_palmer_trib NA  NA  NA  NA
A1  F1  2   6.1 5.2 5.2 6.1 6.1 5.2 5.2 6.1 NA  NA  NA  NA
A1  F1  3   10.1    6.1 10.1    6.1 10.1    6.1 10.1    6.1 NA  NA  NA  NA
M15 B2  1   6.2 6.2 6.2 6.2 6.2 6.2 6.4 6.4 6.4 6.4 6.4 6.4
M15 B2  2   10.1    6.1 10.1    6.1 5.2 5.2 10.1    6.1 10.1    6.1 5.2 5.2
M15 B2  3   10.1    6.1 6.1 10.1    10.1    6.1 10.1    6.1 6.1 10.1    10.1    6.1
", header = TRUE)

Here's a stab:-)

func <- function(x, keep = integer(0)) {
  if (length(keep)) {
    saved <- x[,keep,drop=FALSE]
    x <- x[,-keep]
  } else {
    saved <- x[,0] # empty column
  }
  out <- t(do.call(expand.grid, asplit(t(x), 2)))
  out <- as.data.frame(out[, colSums(is.na(out)) == 0])
  colnames(out) <- paste0("OPTION_", seq_along(out))
  cbind(saved, out)
}

The keep= argument is an integer vector of the fields that need to be preserved and not expanded in the combinations.

Demonstration:

LOF <- Filter(length, by(df, df[,c("SITE","SAMPLE")], FUN = func, keep = 1:3))
allnames <- unique(unlist(lapply(LOF, colnames)))
LOF <- lapply(LOF, function(z) { z[setdiff(allnames, colnames(z))] <- NA; z; })
do.call(rbind, LOF)
#   SITE SAMPLE GROUP OPTION_1        OPTION_2 OPTION_3        OPTION_4 OPTION_5        OPTION_6 OPTION_7        OPTION_8 OPTION_9 OPTION_10 OPTION_11 OPTION_12
# 4  M15     B2     1      6.2             6.4      6.2             6.4      6.2             6.4      6.2             6.4      6.2       6.4       6.2       6.4
# 5  M15     B2     2     10.1            10.1      6.1             6.1      5.2             5.2     10.1            10.1      6.1       6.1       5.2       5.2
# 6  M15     B2     3     10.1            10.1     10.1            10.1     10.1            10.1      6.1             6.1      6.1       6.1       6.1       6.1
# 1   A1     F1     1      4.0 unk_palmer_trib      4.0 unk_palmer_trib      4.0 unk_palmer_trib      4.0 unk_palmer_trib     <NA>      <NA>      <NA>      <NA>
# 2   A1     F1     2      6.1             6.1      5.2             5.2      6.1             6.1      5.2             5.2     <NA>      <NA>      <NA>      <NA>
# 3   A1     F1     3     10.1            10.1     10.1            10.1      6.1             6.1      6.1             6.1     <NA>      <NA>      <NA>      <NA>

If you are already using data.table , then this can be simplified to the first plus either of the second or third:

LOF <- Filter(length, by(df, df[,c("SITE","SAMPLE")], FUN = func, keep = 1:3))
data.table::rbindlist(LOF, fill = TRUE, use.names = TRUE)
dplyr::bind_rows(LOF)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM