简体   繁体   中英

Algorithm to minimize sample pooling to reach minimum mass

trying to determine how I would address this problem using R code.

Brief description of problem : There is a minimum mass required to run an analysis of samples. Previously collected samples are often less than this mass, which means that multiple samples within an experimental treatment must be pooled to reach the minimum requirement. However, samples should be pooled as little as possible to maximize biological replicates.

For example, samples within Treatment A may have these masses: 8g, 7g, 5g, and 10g. Another Treatment, B, has samples with masses of 20g, 21g, 24g, and 29g.

If the minimum mass required for the analysis is 15g, then each sample in Treatment B can be analyzed without pooling. However, in Treatment A, samples would need to be pooled to reach this minimum.

It would be best to combine the 5g and 10g sample and the 8g and 7g sample, because this maximizes the number of possible pooled samples by minimizing potential total masses (ie, if I combined the 5g and 8g and also the 10g and 7g, I would only have one possible pooled sample that meets the minimum)

Data and R

The data is structured as this example follows:

sample_id = c(1:24)
treatments = c(rep("A",8),rep("B",8),rep("C",8))
mass = round(c(runif(8,4,10),runif(8,5,13),runif(8,15,18)),1)
df = data.frame(cbind(sample_id,treatments,mass))
df$mass = as.numeric(df$mass)
df$sample_id = as.numeric(df$sample_id)

> df
   sample_id treatments mass
1          1          A  8.6
2          2          A  8.9
3          3          A  7.5
4          4          A  4.5
5          5          A  7.9
6          6          A  4.5
7          7          A  7.7
8          8          A  6.6
9          9          B  5.0
10        10          B 12.0
11        11          B  7.4
12        12          B  8.4
13        13          B 12.2
14        14          B 10.0
15        15          B  6.5
16        16          B 12.1
17        17          C 15.6
18        18          C 16.5
19        19          C 16.8
20        20          C 17.5
21        21          C 15.6
22        22          C 17.6
23        23          C 18.0
24        24          C 15.8

So far my strategy has been:

# Step 1: separate out all samples that do not need to be pooled, for ease IRL
bigenough = df %>%
  filter(mass >= 15)

#Keep df with all the samples that will need to be pooled
poolneeded = df %>%
  filter(!(sample_id %in% bigenough$sample_id))

I am at a loss of how to best pool the samples algorithmically however. If anyone has any suggestions that would be helpful. I usually use tidyverse if that helps...

Here is a first attempt. It is made up of a function which takes a split data.frame (split by treatment) of the data to be pooled. In this function a new DF is created which contains all pairwise possibilities of the sample_id . This df2 is then 2 times left_join -ed with data and the sum of the two samples is calculated, filtered for being >= 15 and ordered.

This function is then called by map after group_split . The result is all the possible allowed sample combinations.

library(tidyverse)

fff <- function(data) {
  nn <- nrow(data)
  mm <- combn(seq(data$sample_id[1], data$sample_id[nn]), 2) |> t()
  df2 <- data.frame(mm) |> setNames(c("sample_id", "sample_id_2"))
  ddf <- df2 |>
    left_join(data) |>   # nolint: object_usage_linter.
    left_join(data, by = c("sample_id_2" = "sample_id", "treatments")) |> 
    mutate(sum = mass.x + mass.y) |> # nolint: object_usage_linter.
    filter(sum >= 15) |>
    arrange(sample_id, sum) # nolint: object_usage_linter.
  return(ddf)
}
  
  
poolneeded |>
  group_split(treatments) |> 
  map(fff)
#> Joining, by = "sample_id"
#> Joining, by = "sample_id"
#> [[1]]
#>   sample_id sample_id_2 treatments mass.x mass.y  sum
#> 1         1           2          A    9.4    6.2 15.6
#> 2         1           7          A    9.4    6.5 15.9
#> 3         1           4          A    9.4    6.8 16.2
#> 4         1           3          A    9.4    7.6 17.0
#> 5         1           8          A    9.4    8.9 18.3
#> 6         2           8          A    6.2    8.9 15.1
#> 7         3           8          A    7.6    8.9 16.5
#> 8         4           8          A    6.8    8.9 15.7
#> 9         7           8          A    6.5    8.9 15.4
#> 
#> [[2]]
#>    sample_id sample_id_2 treatments mass.x mass.y  sum
#> 1          9          10          B   10.9    7.0 17.9
#> 2          9          14          B   10.9    7.2 18.1
#> 3          9          11          B   10.9    7.9 18.8
#> 4          9          16          B   10.9    8.5 19.4
#> 5          9          13          B   10.9   11.2 22.1
#> 6          9          12          B   10.9   11.7 22.6
#> 7          9          15          B   10.9   11.7 22.6
#> 8         10          16          B    7.0    8.5 15.5
#> 9         10          13          B    7.0   11.2 18.2
#> 10        10          12          B    7.0   11.7 18.7
#> 11        10          15          B    7.0   11.7 18.7
#> 12        11          14          B    7.9    7.2 15.1
#> 13        11          16          B    7.9    8.5 16.4
#> 14        11          13          B    7.9   11.2 19.1
#> 15        11          12          B    7.9   11.7 19.6
#> 16        11          15          B    7.9   11.7 19.6
#> 17        12          14          B   11.7    7.2 18.9
#> 18        12          16          B   11.7    8.5 20.2
#> 19        12          13          B   11.7   11.2 22.9
#> 20        12          15          B   11.7   11.7 23.4
#> 21        13          14          B   11.2    7.2 18.4
#> 22        13          16          B   11.2    8.5 19.7
#> 23        13          15          B   11.2   11.7 22.9
#> 24        14          16          B    7.2    8.5 15.7
#> 25        14          15          B    7.2   11.7 18.9
#> 26        15          16          B   11.7    8.5 20.2

Another way

This makes use of the same function fff as above but it needs to be called with a subset of the poolneeeded - in this case below it is a subset of treatments == "B" . You see then a DF of all possible allowed combinations for pooling and can choose a first pair for pooling. Then the remaining choices for a second pooling are also shown.

sel2 <- function(data) {
  ddf <- fff(data)
  cat(paste("\n", "These are your possibilities for the FIRST pooling", "\n"))
  print(ddf)
  ask <- askYesNo("Do You want to make first choice?")
  if (ask) {
    s_1 <- readline(prompt = "Enter sample 1: ")
    s_2 <- readline(prompt = "Enter sample 2: ")
    ddf2 <- ddf |> filter(
      sample_id != s_1 & sample_id != s_2 &
        sample_id_2 != s_1 & sample_id_2 != s_2 # nolint: object_usage_linter.
    )
    cat(paste0("\n", "These are your possibilities for the SECOND pooling", "\n"))
    print(ddf2)
  } else {
    return()
  }
}

poolneeded_b <- poolneeded |> filter(treatments == "B")
sel2(poolneeded_b)


#> r$> sel2(poolneeded_b)
#> Joining, by = "sample_id"
#> 
#>  These are your possibilities for the FIRST pooling 
#>    sample_id sample_id_2 treatments mass.x mass.y  sum
#> 1          9          14          B    7.6    8.1 15.7
#> 2          9          12          B    7.6    8.7 16.3
#> 3          9          15          B    7.6    9.6 17.2
#> 4          9          10          B    7.6   10.3 17.9
#> 5          9          16          B    7.6   10.9 18.5
#> 6          9          13          B    7.6   12.3 19.9
#> 7         10          11          B   10.3    5.6 15.9
#> 8         10          14          B   10.3    8.1 18.4
#> 9         10          12          B   10.3    8.7 19.0
#> 10        10          15          B   10.3    9.6 19.9
#> 11        10          16          B   10.3   10.9 21.2
#> 12        10          13          B   10.3   12.3 22.6
#> 13        11          15          B    5.6    9.6 15.2
#> 14        11          16          B    5.6   10.9 16.5
#> 15        11          13          B    5.6   12.3 17.9
#> 16        12          14          B    8.7    8.1 16.8
#> 17        12          15          B    8.7    9.6 18.3
#> 18        12          16          B    8.7   10.9 19.6
#> 19        12          13          B    8.7   12.3 21.0
#> 20        13          14          B   12.3    8.1 20.4
#> 21        13          15          B   12.3    9.6 21.9
#> 22        13          16          B   12.3   10.9 23.2
#> 23        14          15          B    8.1    9.6 17.7
#> 24        14          16          B    8.1   10.9 19.0
#> 25        15          16          B    9.6   10.9 20.5
#> 
#> Do You want to make first choice? (Yes/no/abbrechen) y
#> Enter sample 1: 9
#> Enter sample 2: 14
#> 
#> These are your possibilities for the SECOND pooling
#>    sample_id sample_id_2 treatments mass.x mass.y  sum
#> 1         10          11          B   10.3    5.6 15.9
#> 2         10          12          B   10.3    8.7 19.0
#> 3         10          15          B   10.3    9.6 19.9
#> 4         10          16          B   10.3   10.9 21.2
#> 5         10          13          B   10.3   12.3 22.6
#> 6         11          15          B    5.6    9.6 15.2
#> 7         11          16          B    5.6   10.9 16.5
#> 8         11          13          B    5.6   12.3 17.9
#> 9         12          15          B    8.7    9.6 18.3
#> 10        12          16          B    8.7   10.9 19.6
#> 11        12          13          B    8.7   12.3 21.0
#> 12        13          15          B   12.3    9.6 21.9
#> 13        13          16          B   12.3   10.9 23.2
#> 14        15          16          B    9.6   10.9 20.5

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM