[英]Pandas Count zeros in time series
I have a daily time series [1980 - present] where I need to check each daily timestep for zeros and systematically drop records.我有一个每日时间序列 [1980 年至今],我需要在其中检查每个每日时间步长是否为零并系统地删除记录。 I would ultimately like to vectorize this solution, so I can pre-process these operations before proceeding with my analysis.我最终想对这个解决方案进行矢量化,这样我就可以在继续我的分析之前对这些操作进行预处理。 If I have the dataframe df
:如果我有数据框df
:
date name elev_exact swe
0 1990-10-30 COTTONWOOD_CREEK 2337.816 0.01524
1 1990-10-30 EMIGRANT_SUMMIT 2252.472 0.00000
2 1990-10-30 PHILLIPS_BENCH 2499.360 0.05334
3 1990-10-30 PINE_CREEK_PASS 2048.256 0.00000
4 1990-10-30 SALT_RIVER_SUMMIT 2328.672 0.00000
5 1990-10-30 SEDGWICK_PEAK 2392.680 0.00000
6 1990-10-30 SHEEP_MTN 2026.920 0.00000
7 1990-10-30 SLUG_CREEK_DIVIDE 2202.180 0.00000
8 1990-10-30 SOMSEN_RANCH 2072.640 0.00000
9 1990-10-30 WILDHORSE_DIVIDE 1978.152 0.00000
10 1990-10-30 WILLOW_CREEK 2462.784 0.01778
11 1991-03-15 COTTONWOOD_CREEK 2337.816 0.41910
12 1991-03-15 EMIGRANT_SUMMIT 2252.472 0.42418
13 1991-03-15 PHILLIPS_BENCH 2499.360 0.52832
14 1991-03-15 PINE_CREEK_PASS 2048.256 0.32258
15 1991-03-15 SALT_RIVER_SUMMIT 2328.672 0.23876
16 1991-03-15 SEDGWICK_PEAK 2392.680 0.39878
17 1991-03-15 SHEEP_MTN 2026.920 0.31242
18 1991-03-15 SLUG_CREEK_DIVIDE 2202.180 0.29464
19 1991-03-15 SOMSEN_RANCH 2072.640 0.29972
20 1991-03-15 WILDHORSE_DIVIDE 1978.152 0.35052
21 1991-03-15 WILLOW_CREEK 2462.784 0.60706
22 1991-10-25 COTTONWOOD_CREEK 2337.816 0.01270
23 1991-10-25 EMIGRANT_SUMMIT 2252.472 0.01016
24 1991-10-25 PHILLIPS_BENCH 2499.360 0.02286
25 1991-10-25 PINE_CREEK_PASS 2048.256 0.00508
26 1991-10-25 SALT_RIVER_SUMMIT 2328.672 0.01016
27 1991-10-25 SEDGWICK_PEAK 2392.680 0.00254
28 1991-10-25 SHEEP_MTN 2026.920 0.00000
29 1991-10-25 SLUG_CREEK_DIVIDE 2202.180 0.00762
30 1991-10-25 SOMSEN_RANCH 2072.640 0.00000
31 1991-10-25 WILDHORSE_DIVIDE 1978.152 0.00508
32 1991-10-25 WILLOW_CREEK 2462.784 0.02032
The problem is I want to find days where more than one zero swe
measurement, and only keep the observation with the largest elev_exact
.问题是我想找到超过一个零swe
测量值的日子,并且只保留最大elev_exact
的观察结果。 I then need to merge the desired zero record back into df
.然后我需要将所需的零记录合并回df
。
Here is a groupby loop that would achieve what I want:这是一个可以实现我想要的 groupby 循环:
result = pd.DataFrame()
for name, group in df.groupby('date'):
non_zero = group.where(group.swe >0).dropna()
if not group.equals(non_zero):
zeros = group.where(group.swe == 0).dropna()
zero_kept = zeros.loc[zeros.elev_exact.idxmax()]
out = non_zero.append(zero_kept)
out = out[out.elev_exact >= zero_kept.elev_exact]
result = pd.concat([result, out])
else:
result = pd.concat([result, non_zero])
I dont mind using groupby
but I would like to use it a little more methodically so I don't have the inner if-else
loop.我不介意使用groupby
但我想更有条理地使用它,所以我没有内部if-else
循环。
Here is how I am thinking about the problem这是我对这个问题的思考方式
zero_count = df.groupby('date').apply(lambda x: np.count_nonzero(x==0))
zero_count = zero_count.where(zero_count >1).dropna()
zero_count > 1
使用zero_count > 1
分隔日期zero_fix = zero_count.where(zero_count >1).dropna()
fixes = df[df.date.isin(zero_fix.index)].dropna()
fixes = fixes.loc[fixes[fixes.swe==0].groupby('date')['elev_exact'].idxmax().to_list()]
df
.将找到的高程阈值应用回df
。df.loc[:,'threshold'] = df.date.map(lu_dict)
df = df.replace(np.nan, 0)
df = df[df.elev_exact >= df.threshold].drop('threshold', axis=1)
This also works, but the lambda function is step 1 is pretty slow.这也有效,但 lambda 函数是第 1 步非常慢。 Is there another way to count zeros?还有另一种计算零的方法吗?
Expected output:预期输出:
date name elev_exact swe
2 1990-10-30 PHILLIPS_BENCH 2499.360 0.05334
5 1990-10-30 SEDGWICK_PEAK 2392.680 0.00000
10 1990-10-30 WILLOW_CREEK 2462.784 0.01778
11 1991-03-15 COTTONWOOD_CREEK 2337.816 0.41910
12 1991-03-15 EMIGRANT_SUMMIT 2252.472 0.42418
13 1991-03-15 PHILLIPS_BENCH 2499.360 0.52832
14 1991-03-15 PINE_CREEK_PASS 2048.256 0.32258
15 1991-03-15 SALT_RIVER_SUMMIT 2328.672 0.23876
16 1991-03-15 SEDGWICK_PEAK 2392.680 0.39878
17 1991-03-15 SHEEP_MTN 2026.920 0.31242
18 1991-03-15 SLUG_CREEK_DIVIDE 2202.180 0.29464
19 1991-03-15 SOMSEN_RANCH 2072.640 0.29972
20 1991-03-15 WILDHORSE_DIVIDE 1978.152 0.35052
21 1991-03-15 WILLOW_CREEK 2462.784 0.60706
22 1991-10-25 COTTONWOOD_CREEK 2337.816 0.01270
23 1991-10-25 EMIGRANT_SUMMIT 2252.472 0.01016
24 1991-10-25 PHILLIPS_BENCH 2499.360 0.02286
26 1991-10-25 SALT_RIVER_SUMMIT 2328.672 0.01016
27 1991-10-25 SEDGWICK_PEAK 2392.680 0.00254
29 1991-10-25 SLUG_CREEK_DIVIDE 2202.180 0.00762
30 1991-10-25 SOMSEN_RANCH 2072.640 0.00000
32 1991-10-25 WILLOW_CREEK 2462.784 0.02032
You can try this, split the dataframe into non-zeroes and zeroes, then sort zeroes dataframe by highest elev_exact and use drop_duplicates
with subset on date column.您可以尝试这样做,将数据帧拆分为非零和零,然后按最高 elev_exact 对零数据帧进行排序,并在日期列上使用带有子集的drop_duplicates
。 Lastly, use pd.concat
to join dataframe back together and sort:最后,使用pd.concat
将数据帧重新连接在一起并排序:
df_nonzeroes = df[df['swe'].ne(0)]
df_zeroes = df[df['swe'].eq(0)].sort_values('elev_exact', ascending=False).drop_duplicates(subset=['date'])
df_out = pd.concat([df_nonzeroes, df_zeroes]).sort_index()
print(df_out)
Output:输出:
date name elev_exact swe
0 1990-10-30 COTTONWOOD_CREEK 2337.816 0.01524
2 1990-10-30 PHILLIPS_BENCH 2499.360 0.05334
5 1990-10-30 SEDGWICK_PEAK 2392.680 0.00000
10 1990-10-30 WILLOW_CREEK 2462.784 0.01778
11 1991-03-15 COTTONWOOD_CREEK 2337.816 0.41910
12 1991-03-15 EMIGRANT_SUMMIT 2252.472 0.42418
13 1991-03-15 PHILLIPS_BENCH 2499.360 0.52832
14 1991-03-15 PINE_CREEK_PASS 2048.256 0.32258
15 1991-03-15 SALT_RIVER_SUMMIT 2328.672 0.23876
16 1991-03-15 SEDGWICK_PEAK 2392.680 0.39878
17 1991-03-15 SHEEP_MTN 2026.920 0.31242
18 1991-03-15 SLUG_CREEK_DIVIDE 2202.180 0.29464
19 1991-03-15 SOMSEN_RANCH 2072.640 0.29972
20 1991-03-15 WILDHORSE_DIVIDE 1978.152 0.35052
21 1991-03-15 WILLOW_CREEK 2462.784 0.60706
22 1991-10-25 COTTONWOOD_CREEK 2337.816 0.01270
23 1991-10-25 EMIGRANT_SUMMIT 2252.472 0.01016
24 1991-10-25 PHILLIPS_BENCH 2499.360 0.02286
25 1991-10-25 PINE_CREEK_PASS 2048.256 0.00508
26 1991-10-25 SALT_RIVER_SUMMIT 2328.672 0.01016
27 1991-10-25 SEDGWICK_PEAK 2392.680 0.00254
29 1991-10-25 SLUG_CREEK_DIVIDE 2202.180 0.00762
30 1991-10-25 SOMSEN_RANCH 2072.640 0.00000
31 1991-10-25 WILDHORSE_DIVIDE 1978.152 0.00508
32 1991-10-25 WILLOW_CREEK 2462.784 0.02032
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.