遍历数据帧时，代码运行时间太长

Question

signal = pd.DataFrame([[0, 0, 0],
                [-1, -1, -1],
                [1, 0, 0],
                [0, 0, 0],
                [1, 0, 0],
                [0, 1, 0],
                [0, 0, 1],
                [0, -1, 1],
                [-1, 0, 0],
                [0, 0, 0]],columns=['TKV','SWP','BWN'],index=date_index)

`
remove_duplicate(df,lookahead_days):
    df = df.copy()
    df.index = pd.to_datetime(df.index)
    for i in range(0, signal.shape[0], lookahead_days-1):
        date_range = df.index[i:i+lookahead_days]
        for col in df.columns:
            duplicates = df[col][date_range].duplicated(keep="first")
            duplicates_index = df[col][date_range][duplicates].index
            df.loc[duplicates_index, col] = 0
    df.index = df.index.date
    return df`

my objective is to loop through the signal dataframe within a window of days(loookahead_days) and check if duplicates exist, and turn the latter ones to zero, keeping only the first one. 我的目标是在天（loookahead_days）窗口内遍历信号数据帧，并检查是否存在重复项，并将后一个重复项设置为零，仅保留第一个重复项。

I have done that with the function above, the problem now is it takes too long to run, when i pass it through the real dataframe with shape of about 1000X500. 我已经使用上面的函数完成了此操作，现在的问题是，当我将它传递通过形状约为1000X500的真实数据帧时，它花费的时间太长。

I'm wondering if there's a better way that i should have done this. 我想知道是否应该有更好的方法。

Answer 1

Setup : 设置：

from pandas import Timestamp
signal = pd.DataFrame({'TKV': {Timestamp('2018-01-01 00:00:00'): 0, Timestamp('2018-01-02 00:00:00'): -1, Timestamp('2018-01-03 00:00:00'): 1, Timestamp('2018-01-04 00:00:00'): 0, Timestamp('2018-01-05 00:00:00'): 1, Timestamp('2018-01-06 00:00:00'): 0, Timestamp('2018-01-07 00:00:00'): 0, Timestamp('2018-01-08 00:00:00'): 0, Timestamp('2018-01-09 00:00:00'): -1, Timestamp('2018-01-10 00:00:00'): 0}, 'SWP': {Timestamp('2018-01-01 00:00:00'): 0, Timestamp('2018-01-02 00:00:00'): -1, Timestamp('2018-01-03 00:00:00'): 0, Timestamp('2018-01-04 00:00:00'): 0, Timestamp('2018-01-05 00:00:00'): 0, Timestamp('2018-01-06 00:00:00'): 1, Timestamp('2018-01-07 00:00:00'): 0, Timestamp('2018-01-08 00:00:00'): -1, Timestamp('2018-01-09 00:00:00'): 0, Timestamp('2018-01-10 00:00:00'): 0}, 'BWN': {Timestamp('2018-01-01 00:00:00'): 0, Timestamp('2018-01-02 00:00:00'): -1, Timestamp('2018-01-03 00:00:00'): 0, Timestamp('2018-01-04 00:00:00'): 0, Timestamp('2018-01-05 00:00:00'): 0, Timestamp('2018-01-06 00:00:00'): 0, Timestamp('2018-01-07 00:00:00'): 1, Timestamp('2018-01-08 00:00:00'): 1, Timestamp('2018-01-09 00:00:00'): 0, Timestamp('2018-01-10 00:00:00'): 0}})

You can use drop_duplicates here, the tricky thing is that you need to create a column that will never result in duplicates outside of each n -day period (Or whatever time grouping you decide on). 您可以在此处使用drop_duplicates ，棘手的事情是您需要创建一个列，该列在每个n天的时间段内（或您决定的任何时间分组）都永远不会导致重复。 Let's say that you want to drop duplicates if they appear within a 5 day period, we need to create a column that is duplicated for each of these periods, that we can use as a key to drop_duplicates : 假设您要删除重复项（如果它们在5天之内出现），我们需要创建一个在每个周期内都重复的列，我们可以将其用作drop_duplicates的键：

s = (signal.reset_index()
        .groupby(pd.Grouper(freq='5d', key='index'))
        ['index'].transform('first')
    )

0   2018-01-01
1   2018-01-01
2   2018-01-01
3   2018-01-01
4   2018-01-01
5   2018-01-06
6   2018-01-06
7   2018-01-06
8   2018-01-06
9   2018-01-06
Name: index, dtype: datetime64[ns]

This gives us a column that will always be the same for each 5-day period, but can be used to distinguish between other columns when checking for duplicates. 这样我们得到的列在每个5天期间都将始终相同，但是在检查重复项时可用于区分其他列。 Now all we have to do is drop duplicates based off of our "flag" column, and the other columns we are checking for: 现在，我们要做的就是根据“标志”列和我们要检查的其他列删除重复项：

signal.assign(flag=s.values).drop_duplicates(['flag', 'TKV', 'SWP', 'BWN']).drop('flag', 1)

            TKV  SWP  BWN
2018-01-01    0    0    0
2018-01-02   -1   -1   -1
2018-01-03    1    0    0
2018-01-06    0    1    0
2018-01-07    0    0    1
2018-01-08    0   -1    1
2018-01-09   -1    0    0
2018-01-10    0    0    0

If instead of dropping duplicates, you'd like to simply replace them with 0 , you can make use of duplicated here. 如果您不想删除重复项，而是想简单地将它们替换为0 ，则可以在此处使用duplicated 。

tmp = signal.assign(flag=s.values)
tmp[tmp.duplicated()] = 0
tmp = tmp.drop('flag', 1)

            TKV  SWP  BWN
2018-01-01    0    0    0
2018-01-02   -1   -1   -1
2018-01-03    1    0    0
2018-01-04    0    0    0
2018-01-05    0    0    0
2018-01-06    0    1    0
2018-01-07    0    0    1
2018-01-08    0   -1    1
2018-01-09   -1    0    0
2018-01-10    0    0    0

This results in the last two entries from the first group getting dropped, as they were duplicated within that period, but not rows in the second group, even if they appeared in the first. 这将导致第一个组中的最后两个条目被删除，因为它们在该时间段内被重复，但是第二组中的行却没有，即使它们出现在第一组中也是如此。

This should be much more performant than your option: 这应该比您的选择具有更高的性能：

signal = pd.concat([signal]*2000)
signal = signal.reset_index(drop=True).set_index(pd.date_range(start='1995-01-01', periods=20000))

In [445]: %%timeit
     ...: s = (signal.reset_index().groupby(pd.Grouper(freq='5d', key='index'))['index'].transform('first'))
     ...: signal.assign(flag=s.values).drop_duplicates(['flag', 'TKV', 'SWP', 'BWN']).drop('flag', 1)
     ...:
9.5 ms ± 277 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [482]: %%timeit
     ...: s = (signal.reset_index().groupby(pd.Grouper(freq='5d', key='index'))['index'].transform('first'))
     ...: tmp = signal.assign(flag=s.values)
     ...: tmp[tmp.duplicated()] = 0
     ...: tmp = tmp.drop('flag', 1)
56.4 ms ± 205 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

遍历数据帧时，代码运行时间太长

问题描述

1 个解决方案

解决方案1
0 2018-08-06 23:17:22

遍历数据帧时，代码运行时间太长

问题描述

1 个解决方案

解决方案1 0 2018-08-06 23:17:22

解决方案1
0 2018-08-06 23:17:22