Splitting data into batches conditional on cumulative sum

Question

I'm trying to batch some data based on a start_date and end_date that is conditional of the cumulative sum of which is <= 500000. Say I have a simple data frame with two columns:

index  Date            num_books
0      2021-01-01       200000
1      2021-01-02       240000
2      2021-01-03       55000
3      2021-01-04       400000
4      2021-01-05       80000
5      2021-01-06       100000

I need to do a cumulative sum of the values in num_books until it has <= 500000 and record the start date, end date and the cumsum value. This is an example of what I'm trying to achieve

start_date  end_date   cumsum_books
2021-01-01  2021-01-03  495000
2021-01-04  2021-01-05  480000
2021-01-06  2021-01-06  100000

Is there an efficient way/function to achieve this? Thank you!

Answer 1

Here's one way:

from io import StringIO as sio
d = sio("""
index  Date            num_books
0      2021-01-01       200000
1      2021-01-02       240000
2      2021-01-03       55000
3      2021-01-04       400000
4      2021-01-05       80000
5      2021-01-06       100000
""")

import pandas as pd
df = pd.read_csv(d, sep='\s+')

batch_num = 5*10**5
df['batch_num'] = df['num_books'].cumsum()//batch_num 

result = df.groupby('batch_num').agg(start_date=('Date', 'min'), end_date=('Date', 'max'), cumsum_books=('num_books','sum'))
print(result)

#           start_date    end_date  cumsum_books
#batch_num                                      
#0          2021-01-01  2021-01-03        495000
#1          2021-01-04  2021-01-05        480000
#2          2021-01-06  2021-01-06        100000

Note that the result dataframe also contains the entry with more than 500_000 , but it's trivial to drop/filter it out.

Splitting data into batches conditional on cumulative sum

Question

1 answers

solution1
2 ACCPTED 2021-03-04 22:19:50

Splitting data into batches conditional on cumulative sum

Question

1 answers

solution1 2 ACCPTED 2021-03-04 22:19:50

solution1
2 ACCPTED 2021-03-04 22:19:50