Splitting a dataframe into a specific number

Question

My task at hand is to figure out how to split a data frame based on the cumulative sum of a column.

As an example, here is a data frame

df <- data.frame(a1 = c("X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8", "X9", "X10"), a2 = rnorm(20, mean=5, sd=2) )

df2 <- df[order(df$a2),] # the df is sorted from by a2 from smallest to largest.

How do I create a list of data frames where the a2 column sums to 10 in each df2 without any rows repeating?

Answer 1

The result for each row depends on the value of previous rows, so requires an iterative approach. It may be easiest to wrap this in a function:

split_sum <- function(data, column, split_by = 10) {

  x <- data[[as.character(match.call()$column)]]

  group <- current_group <- value <- 0
  
  for(i in seq(nrow(data))) {
    group[i] <- current_group
    value <- x[i] + value
    if(value > split_by) {
      current_group <- current_group + 1
      value <- 0
    }
  }
  
  setNames(split(data, group), NULL)
}

Testing this out, we have:

df <- data.frame(a1 = c("X1", "X2", "X3", "X4", "X5",
                        "X6", "X7", "X8", "X9", "X10"), 
                 a2 = sort(rnorm(20, mean = 10, sd = 2)))

split_sum(df, a2, 10)
#> [[1]]
#>   a1       a2
#> 1 X1 6.232476
#> 2 X2 7.466659
#> 
#> [[2]]
#>   a1       a2
#> 3 X3 7.674123
#> 4 X4 7.946872
#> 
#> [[3]]
#>   a1       a2
#> 5 X5 8.202970
#> 6 X6 8.340281
#> 
#> [[4]]
#>   a1       a2
#> 7 X7 9.046769
#> 8 X8 9.323847
#> 
#> [[5]]
#>     a1       a2
#> 9   X9 9.569602
#> 10 X10 9.955981
#> 
#> [[6]]
#>    a1       a2
#> 11 X1 10.58773
#> 
#> [[7]]
#>    a1       a2
#> 12 X2 10.69006
#> 
#> [[8]]
#>    a1      a2
#> 13 X3 10.7256
#> 
#> [[9]]
#>    a1       a2
#> 14 X4 11.61281
#> 
#> [[10]]
#>    a1       a2
#> 15 X5 11.80578
#> 
#> [[11]]
#>    a1       a2
#> 16 X6 11.91369
#> 
#> [[12]]
#>    a1       a2
#> 17 X7 12.40917
#> 
#> [[13]]
#>    a1       a2
#> 18 X8 13.11021
#> 
#> [[14]]
#>    a1       a2
#> 19 X9 14.00975
#> 
#> [[15]]
#>     a1       a2
#> 20 X10 14.11547

But the function is written in such a way as, for example, to break only when the sum reaches 50:

split_sum(df, a2, 50)
#> [[1]]
#>   a1       a2
#> 1 X1 6.232476
#> 2 X2 7.466659
#> 3 X3 7.674123
#> 4 X4 7.946872
#> 5 X5 8.202970
#> 6 X6 8.340281
#> 7 X7 9.046769
#> 
#> [[2]]
#>     a1        a2
#> 8   X8  9.323847
#> 9   X9  9.569602
#> 10 X10  9.955981
#> 11  X1 10.587733
#> 12  X2 10.690059
#> 
#> [[3]]
#>    a1       a2
#> 13 X3 10.72560
#> 14 X4 11.61281
#> 15 X5 11.80578
#> 16 X6 11.91369
#> 17 X7 12.40917
#> 
#> [[4]]
#>     a1       a2
#> 18  X8 13.11021
#> 19  X9 14.00975
#> 20 X10 14.11547

^{Created on 2022-06-17 by the reprex package (v2.0.1)}

Splitting a dataframe into a specific number

Question

1 answers

solution1
1 2022-06-17 17:16:37

Splitting a dataframe into a specific number

Question

1 answers

solution1 1 2022-06-17 17:16:37

solution1
1 2022-06-17 17:16:37