熊猫在时间序列中计数零

Question

I have a daily time series [1980 - present] where I need to check each daily timestep for zeros and systematically drop records.我有一个每日时间序列 [1980 年至今]，我需要在其中检查每个每日时间步长是否为零并系统地删除记录。 I would ultimately like to vectorize this solution, so I can pre-process these operations before proceeding with my analysis.我最终想对这个解决方案进行矢量化，这样我就可以在继续我的分析之前对这些操作进行预处理。 If I have the dataframe df :如果我有数据框df ：

         date               name  elev_exact      swe
0  1990-10-30   COTTONWOOD_CREEK    2337.816  0.01524
1  1990-10-30    EMIGRANT_SUMMIT    2252.472  0.00000
2  1990-10-30     PHILLIPS_BENCH    2499.360  0.05334
3  1990-10-30    PINE_CREEK_PASS    2048.256  0.00000
4  1990-10-30  SALT_RIVER_SUMMIT    2328.672  0.00000
5  1990-10-30      SEDGWICK_PEAK    2392.680  0.00000
6  1990-10-30          SHEEP_MTN    2026.920  0.00000
7  1990-10-30  SLUG_CREEK_DIVIDE    2202.180  0.00000
8  1990-10-30       SOMSEN_RANCH    2072.640  0.00000
9  1990-10-30   WILDHORSE_DIVIDE    1978.152  0.00000
10 1990-10-30       WILLOW_CREEK    2462.784  0.01778
11 1991-03-15   COTTONWOOD_CREEK    2337.816  0.41910
12 1991-03-15    EMIGRANT_SUMMIT    2252.472  0.42418
13 1991-03-15     PHILLIPS_BENCH    2499.360  0.52832
14 1991-03-15    PINE_CREEK_PASS    2048.256  0.32258
15 1991-03-15  SALT_RIVER_SUMMIT    2328.672  0.23876
16 1991-03-15      SEDGWICK_PEAK    2392.680  0.39878
17 1991-03-15          SHEEP_MTN    2026.920  0.31242
18 1991-03-15  SLUG_CREEK_DIVIDE    2202.180  0.29464
19 1991-03-15       SOMSEN_RANCH    2072.640  0.29972
20 1991-03-15   WILDHORSE_DIVIDE    1978.152  0.35052
21 1991-03-15       WILLOW_CREEK    2462.784  0.60706
22 1991-10-25   COTTONWOOD_CREEK    2337.816  0.01270
23 1991-10-25    EMIGRANT_SUMMIT    2252.472  0.01016
24 1991-10-25     PHILLIPS_BENCH    2499.360  0.02286
25 1991-10-25    PINE_CREEK_PASS    2048.256  0.00508
26 1991-10-25  SALT_RIVER_SUMMIT    2328.672  0.01016
27 1991-10-25      SEDGWICK_PEAK    2392.680  0.00254
28 1991-10-25          SHEEP_MTN    2026.920  0.00000
29 1991-10-25  SLUG_CREEK_DIVIDE    2202.180  0.00762
30 1991-10-25       SOMSEN_RANCH    2072.640  0.00000
31 1991-10-25   WILDHORSE_DIVIDE    1978.152  0.00508
32 1991-10-25       WILLOW_CREEK    2462.784  0.02032

The problem is I want to find days where more than one zero swe measurement, and only keep the observation with the largest elev_exact .问题是我想找到超过一个零swe测量值的日子，并且只保留最大elev_exact的观察结果。 I then need to merge the desired zero record back into df .然后我需要将所需的零记录合并回df 。

Here is a groupby loop that would achieve what I want:这是一个可以实现我想要的 groupby 循环：

result = pd.DataFrame()
for name, group in df.groupby('date'):

    non_zero = group.where(group.swe >0).dropna()

    if not group.equals(non_zero):
        zeros = group.where(group.swe == 0).dropna() 
        zero_kept = zeros.loc[zeros.elev_exact.idxmax()]
        out = non_zero.append(zero_kept)
        out = out[out.elev_exact >= zero_kept.elev_exact]
        result = pd.concat([result, out])
    else:
        result = pd.concat([result, non_zero])

I dont mind using groupby but I would like to use it a little more methodically so I don't have the inner if-else loop.我不介意使用groupby但我想更有条理地使用它，所以我没有内部if-else循环。

Here is how I am thinking about the problem这是我对这个问题的思考方式

For each daily timestep, I want to find where there are more than one zero measurement对于每天的每个时间步，我想找到有多个零测量的地方

zero_count = df.groupby('date').apply(lambda x: np.count_nonzero(x==0))
zero_count = zero_count.where(zero_count >1).dropna()

Separate dates with where zero_count > 1使用zero_count > 1分隔日期

zero_fix = zero_count.where(zero_count >1).dropna()

Find the maximum elevation for each day with multiple zeros使用多个零查找每天的最大海拔

fixes = df[df.date.isin(zero_fix.index)].dropna()
fixes = fixes.loc[fixes[fixes.swe==0].groupby('date')['elev_exact'].idxmax().to_list()]

Apply the found elevation thresholds back to df .将找到的高程阈值应用回df 。

df.loc[:,'threshold'] = df.date.map(lu_dict)
df = df.replace(np.nan, 0)
df = df[df.elev_exact >= df.threshold].drop('threshold', axis=1)

This also works, but the lambda function is step 1 is pretty slow.这也有效，但 lambda 函数是第 1 步非常慢。 Is there another way to count zeros?还有另一种计算零的方法吗？

Expected output:预期输出：

          date               name  elev_exact      swe
2   1990-10-30     PHILLIPS_BENCH    2499.360  0.05334
5   1990-10-30      SEDGWICK_PEAK    2392.680  0.00000
10  1990-10-30       WILLOW_CREEK    2462.784  0.01778
11  1991-03-15   COTTONWOOD_CREEK    2337.816  0.41910
12  1991-03-15    EMIGRANT_SUMMIT    2252.472  0.42418
13  1991-03-15     PHILLIPS_BENCH    2499.360  0.52832
14  1991-03-15    PINE_CREEK_PASS    2048.256  0.32258
15  1991-03-15  SALT_RIVER_SUMMIT    2328.672  0.23876
16  1991-03-15      SEDGWICK_PEAK    2392.680  0.39878
17  1991-03-15          SHEEP_MTN    2026.920  0.31242
18  1991-03-15  SLUG_CREEK_DIVIDE    2202.180  0.29464
19  1991-03-15       SOMSEN_RANCH    2072.640  0.29972
20  1991-03-15   WILDHORSE_DIVIDE    1978.152  0.35052
21  1991-03-15       WILLOW_CREEK    2462.784  0.60706
22  1991-10-25   COTTONWOOD_CREEK    2337.816  0.01270
23  1991-10-25    EMIGRANT_SUMMIT    2252.472  0.01016
24  1991-10-25     PHILLIPS_BENCH    2499.360  0.02286
26  1991-10-25  SALT_RIVER_SUMMIT    2328.672  0.01016
27  1991-10-25      SEDGWICK_PEAK    2392.680  0.00254
29  1991-10-25  SLUG_CREEK_DIVIDE    2202.180  0.00762
30  1991-10-25       SOMSEN_RANCH    2072.640  0.00000
32  1991-10-25       WILLOW_CREEK    2462.784  0.02032

Answer 1

You can try this, split the dataframe into non-zeroes and zeroes, then sort zeroes dataframe by highest elev_exact and use drop_duplicates with subset on date column.您可以尝试这样做，将数据帧拆分为非零和零，然后按最高 elev_exact 对零数据帧进行排序，并在日期列上使用带有子集的drop_duplicates 。 Lastly, use pd.concat to join dataframe back together and sort:最后，使用pd.concat将数据帧重新连接在一起并排序：

df_nonzeroes = df[df['swe'].ne(0)]
df_zeroes = df[df['swe'].eq(0)].sort_values('elev_exact', ascending=False).drop_duplicates(subset=['date'])

df_out = pd.concat([df_nonzeroes, df_zeroes]).sort_index()
print(df_out)

Output:输出：

          date               name  elev_exact      swe
0   1990-10-30   COTTONWOOD_CREEK    2337.816  0.01524
2   1990-10-30     PHILLIPS_BENCH    2499.360  0.05334
5   1990-10-30      SEDGWICK_PEAK    2392.680  0.00000
10  1990-10-30       WILLOW_CREEK    2462.784  0.01778
11  1991-03-15   COTTONWOOD_CREEK    2337.816  0.41910
12  1991-03-15    EMIGRANT_SUMMIT    2252.472  0.42418
13  1991-03-15     PHILLIPS_BENCH    2499.360  0.52832
14  1991-03-15    PINE_CREEK_PASS    2048.256  0.32258
15  1991-03-15  SALT_RIVER_SUMMIT    2328.672  0.23876
16  1991-03-15      SEDGWICK_PEAK    2392.680  0.39878
17  1991-03-15          SHEEP_MTN    2026.920  0.31242
18  1991-03-15  SLUG_CREEK_DIVIDE    2202.180  0.29464
19  1991-03-15       SOMSEN_RANCH    2072.640  0.29972
20  1991-03-15   WILDHORSE_DIVIDE    1978.152  0.35052
21  1991-03-15       WILLOW_CREEK    2462.784  0.60706
22  1991-10-25   COTTONWOOD_CREEK    2337.816  0.01270
23  1991-10-25    EMIGRANT_SUMMIT    2252.472  0.01016
24  1991-10-25     PHILLIPS_BENCH    2499.360  0.02286
25  1991-10-25    PINE_CREEK_PASS    2048.256  0.00508
26  1991-10-25  SALT_RIVER_SUMMIT    2328.672  0.01016
27  1991-10-25      SEDGWICK_PEAK    2392.680  0.00254
29  1991-10-25  SLUG_CREEK_DIVIDE    2202.180  0.00762
30  1991-10-25       SOMSEN_RANCH    2072.640  0.00000
31  1991-10-25   WILDHORSE_DIVIDE    1978.152  0.00508
32  1991-10-25       WILLOW_CREEK    2462.784  0.02032

熊猫在时间序列中计数零

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-12-11 16:05:58

熊猫在时间序列中计数零

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-12-11 16:05:58

解决方案1
2 已采纳 2019-12-11 16:05:58