简体   繁体   中英

reset cumulative sum base on condition pandas and return other cumulative sum

I have this dataframe -

    counter  duration amount
0         1      0.08  1,235
1         2      0.36  1,170
2         3      1.04  1,222
3         4      0.81  1,207
4         5      3.99  1,109
5         6      1.20  1,261
6         7      4.24  1,068
7         8      3.07  1,098
8         9      2.08  1,215
9        10      4.09  1,043
10       11      2.95  1,176
11       12      3.96  1,038
12       13      3.95  1,119
13       14      3.92  1,074
14       15      3.91  1,076
15       16      1.50  1,224
16       17      3.65    962
17       18      3.85  1,039
18       19      3.82  1,062
19       20      3.34    917

I would like to create another column based on the following logic:

For each row, I want to calculate a running sum of 'duration' but it should be a running sum for the rows that are below the current row (lead and not lag). I would like to stop the calculation when the running sum reaches 5 -> when it reaches 5, I want to return the running sum for 'amount' (with the same logic).

For instance, for 'counter' 1 it should take the first 4 rows (0.08+0.36+1.04+0.81<5) and then to return 1,235+1,170+1,222+1,207=4834

for 'counter' 2 it should take only 0.36 + 1.04 + 0.81<5 and to return 1,170+1,222+1,207=3599

Will appreciate any help!

Let us try build your own logic with loop

c = df.duration.values
v=df.amount.values
result = []
lim=5
check = []
for i in range(len(c)):
    total = 0
    value = 0
    for x, y in zip(v[i:],c[i:]):
        total += y
        value += x
        if total >= lim:
            result.append(value-x)
            print(total)
            break
#result
#[4834, 3599, 2429, 2316, 1109, 1261, 1068, 1098, 1215, 1043, 1176, 1038, 1119, 1074, 1076, 1224, 962, 1039, 1062]

I would first go through the 2 columns once for their cumulative sums.

cum_amount = df['amount'].cumsum()
cum_duration = df['duration'].cumsum()

Get a list ready for the results

results = []

Then loop through each index (equivalent to counter)

for idx in cum_duration.index:
    # keep only rows within `5` and the max. index is where the required numbers are located 
    wanted_idx = (cum_duration[cum_duration<5]).index.max()

    # read those numbers with the wanted index
    results.append({'idx': idx, 'cum_duration': cum_duration[wanted_idx], 'cum_amount': cum_amount[wanted_idx]})

    # subtract the lag (we need only the leads not the lags)
    cum_amount -= cum_amount[idx]
    cum_duration -= cum_duration[idx]

Finally the result in a DataFrame.

pd.DataFrame(results)

    idx cum_duration    cum_amount
0   0   2.29    4834.0
1   1   2.21    3599.0
2   2   1.85    2429.0
3   3   4.80    2316.0
4   4   3.99    1109.0
5   5   1.20    1261.0
6   6   4.24    1068.0
7   7   3.07    1098.0
8   8   2.08    1215.0
9   9   4.09    1043.0
10  10  2.95    1176.0
11  11  3.96    1038.0
12  12  3.95    1119.0
13  13  3.92    1074.0
14  14  3.91    1076.0
15  15  1.50    1224.0
16  16  3.65    962.0
17  17  3.85    1039.0
18  18  3.82    1062.0
19  19  3.34    917.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM