简体   繁体   中英

Splitting a dataframe into a specific number

My task at hand is to figure out how to split a data frame based on the cumulative sum of a column.

As an example, here is a data frame

df <- data.frame(a1 = c("X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8", "X9", "X10"), a2 = rnorm(20, mean=5, sd=2) )

df2 <- df[order(df$a2),] # the df is sorted from by a2 from smallest to largest.

How do I create a list of data frames where the a2 column sums to 10 in each df2 without any rows repeating?

The result for each row depends on the value of previous rows, so requires an iterative approach. It may be easiest to wrap this in a function:

split_sum <- function(data, column, split_by = 10) {

  x <- data[[as.character(match.call()$column)]]

  group <- current_group <- value <- 0
  
  for(i in seq(nrow(data))) {
    group[i] <- current_group
    value <- x[i] + value
    if(value > split_by) {
      current_group <- current_group + 1
      value <- 0
    }
  }
  
  setNames(split(data, group), NULL)
}

Testing this out, we have:

df <- data.frame(a1 = c("X1", "X2", "X3", "X4", "X5",
                        "X6", "X7", "X8", "X9", "X10"), 
                 a2 = sort(rnorm(20, mean = 10, sd = 2)))

split_sum(df, a2, 10)
#> [[1]]
#>   a1       a2
#> 1 X1 6.232476
#> 2 X2 7.466659
#> 
#> [[2]]
#>   a1       a2
#> 3 X3 7.674123
#> 4 X4 7.946872
#> 
#> [[3]]
#>   a1       a2
#> 5 X5 8.202970
#> 6 X6 8.340281
#> 
#> [[4]]
#>   a1       a2
#> 7 X7 9.046769
#> 8 X8 9.323847
#> 
#> [[5]]
#>     a1       a2
#> 9   X9 9.569602
#> 10 X10 9.955981
#> 
#> [[6]]
#>    a1       a2
#> 11 X1 10.58773
#> 
#> [[7]]
#>    a1       a2
#> 12 X2 10.69006
#> 
#> [[8]]
#>    a1      a2
#> 13 X3 10.7256
#> 
#> [[9]]
#>    a1       a2
#> 14 X4 11.61281
#> 
#> [[10]]
#>    a1       a2
#> 15 X5 11.80578
#> 
#> [[11]]
#>    a1       a2
#> 16 X6 11.91369
#> 
#> [[12]]
#>    a1       a2
#> 17 X7 12.40917
#> 
#> [[13]]
#>    a1       a2
#> 18 X8 13.11021
#> 
#> [[14]]
#>    a1       a2
#> 19 X9 14.00975
#> 
#> [[15]]
#>     a1       a2
#> 20 X10 14.11547

But the function is written in such a way as, for example, to break only when the sum reaches 50:

split_sum(df, a2, 50)
#> [[1]]
#>   a1       a2
#> 1 X1 6.232476
#> 2 X2 7.466659
#> 3 X3 7.674123
#> 4 X4 7.946872
#> 5 X5 8.202970
#> 6 X6 8.340281
#> 7 X7 9.046769
#> 
#> [[2]]
#>     a1        a2
#> 8   X8  9.323847
#> 9   X9  9.569602
#> 10 X10  9.955981
#> 11  X1 10.587733
#> 12  X2 10.690059
#> 
#> [[3]]
#>    a1       a2
#> 13 X3 10.72560
#> 14 X4 11.61281
#> 15 X5 11.80578
#> 16 X6 11.91369
#> 17 X7 12.40917
#> 
#> [[4]]
#>     a1       a2
#> 18  X8 13.11021
#> 19  X9 14.00975
#> 20 X10 14.11547

Created on 2022-06-17 by the reprex package (v2.0.1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM