Pandas - Cumsum，如果条件基于结果累加值，则跳过行

Question

如果这些行的累积结果超过某个阈值，如何累积跳过行的值？

threshold = 120
Col1
---
100
5
90
5
8

Expected output:
Acumm_with_condition
---
100
105     (100+5)
NaN     (105+90 > threshold, skip )
110     (105+5)
118     (110+8)

Answer 1

虽然它不是完全矢量化的，但您可以使用一个循环来计算 cumsum，然后检查它是否超过阈值，如果超过，将第一次打破阈值的值设置为 0 并重新启动循环。

def thresholded_cumsum(df, column, threshold=np.inf, dropped_value_fill=None):
    s = df[column].copy().to_numpy()
    dropped_value_mask = np.zeros_like(s, dtype=bool)
    
    cur_cumsum = s.cumsum()
    cur_mask = cur_cumsum > threshold
    
    while cur_mask.any():
        first_above_thresh_idx = np.nonzero(cur_mask)[0][0]
        
        # Drop the value out of s, note the position of this value within the mask
        s[first_above_thresh_idx] = 0
        dropped_value_mask[first_above_thresh_idx] = True

        # Recalculate the cumsum & threshold mask now that we've dropped the value
        cur_cumsum = s.cumsum()
        cur_mask = cur_cumsum > threshold
            
    if dropped_value_fill is not None:
        cur_cumsum[dropped_value_mask] = dropped_value_fill
        
    return cur_cumsum

用法：

df["thresh_cumsum"] = thresholded_cumsum(df, "col1", threshold=120)

print(df)
   col1  thresh_cumsum
0   100            100
1     5            105
2    90            105
3     5            110
4     8            118

我在这里添加了一个额外的参数dropped_value_fill ，这本质上是一个可以用来注释 output 的值，让您知道哪些值因违反阈值而被故意删除。

使用dropped_value_fill=-1

df["thresh_cumsum"] = thresholded_cumsum(df, "col1", threshold=120, dropped_value_fill=-1)

print(df)
   col1  thresh_cumsum
0   100            100
1     5            105
2    90             -1
3     5            110
4     8            118

Answer 2

最终使用：

    def accumulate_under_threshold(values, threshold, skipped_row_value):
        output = []
        accumulated = 0
        for i, val in enumerate(values):
            if val + accumulated <= threshold:
                accumulated = val + accumulated
                output.append(accumulated)
            else:
                output.append(math.nan)
                if values[i:].min() > (threshold - accumulated ):
                    output.extend( [skipped_row_value]*(len(values)-1-i))
                    break
        return np.array(output)
    
    df['acumm_with_condition'] = accumulate_under_threshold(df['Col1'].values, 120, math.nan)

Pandas - Cumsum，如果条件基于结果累加值，则跳过行

问题描述

2 个解决方案

解决方案1
0 2021-01-28 01:08:10

解决方案2
0 已采纳 2021-01-28 02:36:26

Pandas - Cumsum，如果条件基于结果累加值，则跳过行

问题描述

2 个解决方案

解决方案1 0 2021-01-28 01:08:10

解决方案2 0 已采纳 2021-01-28 02:36:26

解决方案1
0 2021-01-28 01:08:10

解决方案2
0 已采纳 2021-01-28 02:36:26