[英]Pandas - Cumsum, skip row if condition based on the resulting accumulated value
How to accumulate values skipping rows if the accumulated result of those rows exceeds a certain threshold?如果这些行的累积结果超过某个阈值,如何累积跳过行的值?
threshold = 120
Col1
---
100
5
90
5
8
Expected output:
Acumm_with_condition
---
100
105 (100+5)
NaN (105+90 > threshold, skip )
110 (105+5)
118 (110+8)
Though it's not entirely vectorized, you can use a loop where you calculate the cumsum, then check to see if it has exceeded the threshold and if it has, set the value where it first breaks the threshold to 0 and restart the loop.虽然它不是完全矢量化的,但您可以使用一个循环来计算 cumsum,然后检查它是否超过阈值,如果超过,将第一次打破阈值的值设置为 0 并重新启动循环。
def thresholded_cumsum(df, column, threshold=np.inf, dropped_value_fill=None):
s = df[column].copy().to_numpy()
dropped_value_mask = np.zeros_like(s, dtype=bool)
cur_cumsum = s.cumsum()
cur_mask = cur_cumsum > threshold
while cur_mask.any():
first_above_thresh_idx = np.nonzero(cur_mask)[0][0]
# Drop the value out of s, note the position of this value within the mask
s[first_above_thresh_idx] = 0
dropped_value_mask[first_above_thresh_idx] = True
# Recalculate the cumsum & threshold mask now that we've dropped the value
cur_cumsum = s.cumsum()
cur_mask = cur_cumsum > threshold
if dropped_value_fill is not None:
cur_cumsum[dropped_value_mask] = dropped_value_fill
return cur_cumsum
Usage:用法:
df["thresh_cumsum"] = thresholded_cumsum(df, "col1", threshold=120)
print(df)
col1 thresh_cumsum
0 100 100
1 5 105
2 90 105
3 5 110
4 8 118
I've included an extra parameter here dropped_value_fill
, this is essentially a value you can use to annotate your output to let you know which values were intentionally dropped for violating the threshold.我在这里添加了一个额外的参数
dropped_value_fill
,这本质上是一个可以用来注释 output 的值,让您知道哪些值因违反阈值而被故意删除。
With dropped_value_fill=-1
使用
dropped_value_fill=-1
df["thresh_cumsum"] = thresholded_cumsum(df, "col1", threshold=120, dropped_value_fill=-1)
print(df)
col1 thresh_cumsum
0 100 100
1 5 105
2 90 -1
3 5 110
4 8 118
Ended up using:最终使用:
def accumulate_under_threshold(values, threshold, skipped_row_value):
output = []
accumulated = 0
for i, val in enumerate(values):
if val + accumulated <= threshold:
accumulated = val + accumulated
output.append(accumulated)
else:
output.append(math.nan)
if values[i:].min() > (threshold - accumulated ):
output.extend( [skipped_row_value]*(len(values)-1-i))
break
return np.array(output)
df['acumm_with_condition'] = accumulate_under_threshold(df['Col1'].values, 120, math.nan)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.