简体   繁体   English

Pandas - Cumsum,如果条件基于结果累加值,则跳过行

[英]Pandas - Cumsum, skip row if condition based on the resulting accumulated value

How to accumulate values skipping rows if the accumulated result of those rows exceeds a certain threshold?如果这些行的累积结果超过某个阈值,如何累积跳过行的值?

threshold = 120
Col1
---
100
5
90
5
8

Expected output:
Acumm_with_condition
---
100
105     (100+5)
NaN     (105+90 > threshold, skip )
110     (105+5)
118     (110+8)

Though it's not entirely vectorized, you can use a loop where you calculate the cumsum, then check to see if it has exceeded the threshold and if it has, set the value where it first breaks the threshold to 0 and restart the loop.虽然它不是完全矢量化的,但您可以使用一个循环来计算 cumsum,然后检查它是否超过阈值,如果超过,将第一次打破阈值的值设置为 0 并重新启动循环。

def thresholded_cumsum(df, column, threshold=np.inf, dropped_value_fill=None):
    s = df[column].copy().to_numpy()
    dropped_value_mask = np.zeros_like(s, dtype=bool)
    
    cur_cumsum = s.cumsum()
    cur_mask = cur_cumsum > threshold
    
    while cur_mask.any():
        first_above_thresh_idx = np.nonzero(cur_mask)[0][0]
        
        # Drop the value out of s, note the position of this value within the mask
        s[first_above_thresh_idx] = 0
        dropped_value_mask[first_above_thresh_idx] = True

        # Recalculate the cumsum & threshold mask now that we've dropped the value
        cur_cumsum = s.cumsum()
        cur_mask = cur_cumsum > threshold
            
    if dropped_value_fill is not None:
        cur_cumsum[dropped_value_mask] = dropped_value_fill
        
    return cur_cumsum

Usage:用法:

df["thresh_cumsum"] = thresholded_cumsum(df, "col1", threshold=120)

print(df)
   col1  thresh_cumsum
0   100            100
1     5            105
2    90            105
3     5            110
4     8            118

I've included an extra parameter here dropped_value_fill , this is essentially a value you can use to annotate your output to let you know which values were intentionally dropped for violating the threshold.我在这里添加了一个额外的参数dropped_value_fill ,这本质上是一个可以用来注释 output 的值,让您知道哪些值因违反阈值而被故意删除。

With dropped_value_fill=-1使用dropped_value_fill=-1

df["thresh_cumsum"] = thresholded_cumsum(df, "col1", threshold=120, dropped_value_fill=-1)

print(df)
   col1  thresh_cumsum
0   100            100
1     5            105
2    90             -1
3     5            110
4     8            118

Ended up using:最终使用:

    def accumulate_under_threshold(values, threshold, skipped_row_value):
        output = []
        accumulated = 0
        for i, val in enumerate(values):
            if val + accumulated <= threshold:
                accumulated = val + accumulated
                output.append(accumulated)
            else:
                output.append(math.nan)
                if values[i:].min() > (threshold - accumulated ):
                    output.extend( [skipped_row_value]*(len(values)-1-i))
                    break
        return np.array(output)
    
    df['acumm_with_condition'] = accumulate_under_threshold(df['Col1'].values, 120, math.nan)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM