[英]Pandas - Cumsum, skip row if condition based on the resulting accumulated value
如果这些行的累积结果超过某个阈值,如何累积跳过行的值?
threshold = 120
Col1
---
100
5
90
5
8
Expected output:
Acumm_with_condition
---
100
105 (100+5)
NaN (105+90 > threshold, skip )
110 (105+5)
118 (110+8)
虽然它不是完全矢量化的,但您可以使用一个循环来计算 cumsum,然后检查它是否超过阈值,如果超过,将第一次打破阈值的值设置为 0 并重新启动循环。
def thresholded_cumsum(df, column, threshold=np.inf, dropped_value_fill=None):
s = df[column].copy().to_numpy()
dropped_value_mask = np.zeros_like(s, dtype=bool)
cur_cumsum = s.cumsum()
cur_mask = cur_cumsum > threshold
while cur_mask.any():
first_above_thresh_idx = np.nonzero(cur_mask)[0][0]
# Drop the value out of s, note the position of this value within the mask
s[first_above_thresh_idx] = 0
dropped_value_mask[first_above_thresh_idx] = True
# Recalculate the cumsum & threshold mask now that we've dropped the value
cur_cumsum = s.cumsum()
cur_mask = cur_cumsum > threshold
if dropped_value_fill is not None:
cur_cumsum[dropped_value_mask] = dropped_value_fill
return cur_cumsum
用法:
df["thresh_cumsum"] = thresholded_cumsum(df, "col1", threshold=120)
print(df)
col1 thresh_cumsum
0 100 100
1 5 105
2 90 105
3 5 110
4 8 118
我在这里添加了一个额外的参数dropped_value_fill
,这本质上是一个可以用来注释 output 的值,让您知道哪些值因违反阈值而被故意删除。
使用dropped_value_fill=-1
df["thresh_cumsum"] = thresholded_cumsum(df, "col1", threshold=120, dropped_value_fill=-1)
print(df)
col1 thresh_cumsum
0 100 100
1 5 105
2 90 -1
3 5 110
4 8 118
最终使用:
def accumulate_under_threshold(values, threshold, skipped_row_value):
output = []
accumulated = 0
for i, val in enumerate(values):
if val + accumulated <= threshold:
accumulated = val + accumulated
output.append(accumulated)
else:
output.append(math.nan)
if values[i:].min() > (threshold - accumulated ):
output.extend( [skipped_row_value]*(len(values)-1-i))
break
return np.array(output)
df['acumm_with_condition'] = accumulate_under_threshold(df['Col1'].values, 120, math.nan)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.