自上一次在Pandas DataFrame中发生以来还没有几天？

Question

Let's say I have a Pandas DataFrame df : 假设我有一个Pandas DataFrame df ：

Date      Value
01/01/17  0
01/02/17  0
01/03/17  1
01/04/17  0
01/05/17  0
01/06/17  0
01/07/17  1
01/08/17  0
01/09/17  0

For each row, I want to efficiently calculate the days since the last occurence of Value=1 . 对于每一行，我想有效地计算自上次出现Value=1以来的天数。

So that df : 这样df ：

Date      Value    Last_Occurence
01/01/17  0        NaN
01/02/17  0        NaN
01/03/17  1        0
01/04/17  0        1
01/05/17  0        2
01/06/17  0        3
01/07/17  1        0
01/08/17  0        1
01/09/17  0        2

I could do a loop: 我可以做一个循环：

for i in range(0, len(df)):
    last = np.where(df.loc[0:i,'Value']==1)
    df.loc[i, 'Last_Occurence'] = i-last

But it seems very inefficient for extremely large data sets and probably isn't right anyway. 但是，对于庞大的数据集而言，效率似乎很低，而且可能还是不合适。

Answer 1

Here's a NumPy approach - 这是NumPy的方法-

def intervaled_cumsum(a, trigger_val=1, start_val = 0, invalid_specifier=-1):
    out = np.ones(a.size,dtype=int)    
    idx = np.flatnonzero(a==trigger_val)
    if len(idx)==0:
        return np.full(a.size,invalid_specifier)
    else:
        out[idx[0]] = -idx[0] + 1
        out[0] = start_val
        out[idx[1:]] = idx[:-1] - idx[1:] + 1
        np.cumsum(out, out=out)
        out[:idx[0]] = invalid_specifier
        return out

Few sample runs on array data to showcase the usage covering various scenarios of trigger and start values : 很少有样本在数组数据上运行以展示涵盖触发器和起始值的各种场景的用法：

In [120]: a
Out[120]: array([0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0])

In [121]: p1 = intervaled_cumsum(a, trigger_val=1, start_val=0)
     ...: p2 = intervaled_cumsum(a, trigger_val=1, start_val=1)
     ...: p3 = intervaled_cumsum(a, trigger_val=0, start_val=0)
     ...: p4 = intervaled_cumsum(a, trigger_val=0, start_val=1)
     ...: 

In [122]: np.vstack(( a, p1, p2, p3, p4 ))
Out[122]: 
array([[ 0,  1,  1,  1,  0,  0,  1,  0,  0,  1,  1,  1,  1,  1,  0],
       [-1,  0,  0,  0,  1,  2,  0,  1,  2,  0,  0,  0,  0,  0,  1],
       [-1,  1,  1,  1,  2,  3,  1,  2,  3,  1,  1,  1,  1,  1,  2],
       [ 0,  1,  2,  3,  0,  0,  1,  0,  0,  1,  2,  3,  4,  5,  0],
       [ 1,  2,  3,  4,  1,  1,  2,  1,  1,  2,  3,  4,  5,  6,  1]])

Using it to solve our case : 用它来解决我们的情况：

df['Last_Occurence'] = intervaled_cumsum(df.Value.values)

Sample output - 样本输出-

In [181]: df
Out[181]: 
       Date  Value  Last_Occurence
0  01/01/17      0              -1
1  01/02/17      0              -1
2  01/03/17      1               0
3  01/04/17      0               1
4  01/05/17      0               2
5  01/06/17      0               3
6  01/07/17      1               0
7  01/08/17      0               1
8  01/09/17      0               2

Runtime test 运行时测试

Approaches - 方法-

# @Scott Boston's soln
def pandas_groupby(df):
    mask = df.Value.cumsum().replace(0,False).astype(bool)
    return df.assign(Last_Occurance=df.groupby(df.Value.astype(bool).\
                                    cumsum()).cumcount().where(mask))

# Proposed in this post
def numpy_based(df):
    df['Last_Occurence'] = intervaled_cumsum(df.Value.values)

Timings - 时间-

In [33]: df = pd.DataFrame((np.random.rand(10000000)>0.7).astype(int), columns=[['Value']])

In [34]: %timeit pandas_groupby(df)
1 loops, best of 3: 1.06 s per loop

In [35]: %timeit numpy_based(df)
10 loops, best of 3: 103 ms per loop

In [36]: df = pd.DataFrame((np.random.rand(100000000)>0.7).astype(int), columns=[['Value']])

In [37]: %timeit pandas_groupby(df)
1 loops, best of 3: 11.1 s per loop

In [38]: %timeit numpy_based(df)
1 loops, best of 3: 1.03 s per loop

Answer 2

Let's try this using cumsum , cumcount , and groupby : 让我们使用cumsum ， cumcount和groupby尝试一下：

mask = df.Value.cumsum().replace(0,False).astype(bool) #Mask starting zeros as NaN
df_out = df.assign(Last_Occurance=df.groupby(df.Value.astype(bool).cumsum()).cumcount().where(mask))
print(df_out)

output: 输出：

       Date  Value  Last_Occurance
0  01/01/17      0             NaN
1  01/02/17      0             NaN
2  01/03/17      1             0.0
3  01/04/17      0             1.0
4  01/05/17      0             2.0
5  01/06/17      0             3.0
6  01/07/17      1             0.0
7  01/08/17      0             1.0
8  01/09/17      0             2.0

Answer 3

You don't have to update the value to last every step in the for loop. 您不必更新值last for循环中的每一步。 Initiate a variable outside the loop 在循环外启动变量

last = np.nan
for i in range(len(df)):
    if df.loc[i, 'Value'] == 1:
        last = i
    df.loc[i, 'Last_Occurence'] = i - last

and update it only when a 1 occurs in column Value . 并仅在Value列中出现1时更新它。

Note that no matter what method you select, iterating the whole table once is inevitable. 请注意，无论您选择哪种方法，都必须对整个表进行一次迭代。

Answer 4

You can use argmax: 您可以使用argmax：

df.apply(lambda x: np.argmax(df.iloc[x.name::-1].Value.tolist()),axis=1)
Out[85]: 
0    0
1    0
2    0
3    1
4    2
5    3
6    0
7    1
8    2
dtype: int64

If you have to have nan for the first 2 rows, use: 如果必须在前两行使用nan，请使用：

df.apply(lambda x: np.argmax(df.iloc[x.name::-1].Value.tolist()) \
                   if 1 in df.iloc[x.name::-1].Value.values \
                   else np.nan,axis=1)
Out[86]: 
0    NaN
1    NaN
2    0.0
3    1.0
4    2.0
5    3.0
6    0.0
7    1.0
8    2.0
dtype: float64

自上一次在Pandas DataFrame中发生以来还没有几天？

问题描述

4 个解决方案

解决方案1
6 已采纳 2017-06-07 19:26:58

解决方案2
2 2017-06-07 19:15:43

解决方案3
1 2017-06-07 19:10:55

解决方案4
1 2017-06-07 20:17:57

自上一次在Pandas DataFrame中发生以来还没有几天？

问题描述

4 个解决方案

解决方案1 6 已采纳 2017-06-07 19:26:58

解决方案2 2 2017-06-07 19:15:43

解决方案3 1 2017-06-07 19:10:55

解决方案4 1 2017-06-07 20:17:57

解决方案1
6 已采纳 2017-06-07 19:26:58

解决方案2
2 2017-06-07 19:15:43

解决方案3
1 2017-06-07 19:10:55

解决方案4
1 2017-06-07 20:17:57