Code takes too long to run when looping through a dataframe

Question

signal = pd.DataFrame([[0, 0, 0],
                [-1, -1, -1],
                [1, 0, 0],
                [0, 0, 0],
                [1, 0, 0],
                [0, 1, 0],
                [0, 0, 1],
                [0, -1, 1],
                [-1, 0, 0],
                [0, 0, 0]],columns=['TKV','SWP','BWN'],index=date_index)

`
remove_duplicate(df,lookahead_days):
    df = df.copy()
    df.index = pd.to_datetime(df.index)
    for i in range(0, signal.shape[0], lookahead_days-1):
        date_range = df.index[i:i+lookahead_days]
        for col in df.columns:
            duplicates = df[col][date_range].duplicated(keep="first")
            duplicates_index = df[col][date_range][duplicates].index
            df.loc[duplicates_index, col] = 0
    df.index = df.index.date
    return df`

my objective is to loop through the signal dataframe within a window of days(loookahead_days) and check if duplicates exist, and turn the latter ones to zero, keeping only the first one.

I have done that with the function above, the problem now is it takes too long to run, when i pass it through the real dataframe with shape of about 1000X500.

I'm wondering if there's a better way that i should have done this.

Answer 1

Setup :

from pandas import Timestamp
signal = pd.DataFrame({'TKV': {Timestamp('2018-01-01 00:00:00'): 0, Timestamp('2018-01-02 00:00:00'): -1, Timestamp('2018-01-03 00:00:00'): 1, Timestamp('2018-01-04 00:00:00'): 0, Timestamp('2018-01-05 00:00:00'): 1, Timestamp('2018-01-06 00:00:00'): 0, Timestamp('2018-01-07 00:00:00'): 0, Timestamp('2018-01-08 00:00:00'): 0, Timestamp('2018-01-09 00:00:00'): -1, Timestamp('2018-01-10 00:00:00'): 0}, 'SWP': {Timestamp('2018-01-01 00:00:00'): 0, Timestamp('2018-01-02 00:00:00'): -1, Timestamp('2018-01-03 00:00:00'): 0, Timestamp('2018-01-04 00:00:00'): 0, Timestamp('2018-01-05 00:00:00'): 0, Timestamp('2018-01-06 00:00:00'): 1, Timestamp('2018-01-07 00:00:00'): 0, Timestamp('2018-01-08 00:00:00'): -1, Timestamp('2018-01-09 00:00:00'): 0, Timestamp('2018-01-10 00:00:00'): 0}, 'BWN': {Timestamp('2018-01-01 00:00:00'): 0, Timestamp('2018-01-02 00:00:00'): -1, Timestamp('2018-01-03 00:00:00'): 0, Timestamp('2018-01-04 00:00:00'): 0, Timestamp('2018-01-05 00:00:00'): 0, Timestamp('2018-01-06 00:00:00'): 0, Timestamp('2018-01-07 00:00:00'): 1, Timestamp('2018-01-08 00:00:00'): 1, Timestamp('2018-01-09 00:00:00'): 0, Timestamp('2018-01-10 00:00:00'): 0}})

You can use drop_duplicates here, the tricky thing is that you need to create a column that will never result in duplicates outside of each n -day period (Or whatever time grouping you decide on). Let's say that you want to drop duplicates if they appear within a 5 day period, we need to create a column that is duplicated for each of these periods, that we can use as a key to drop_duplicates :

s = (signal.reset_index()
        .groupby(pd.Grouper(freq='5d', key='index'))
        ['index'].transform('first')
    )

0   2018-01-01
1   2018-01-01
2   2018-01-01
3   2018-01-01
4   2018-01-01
5   2018-01-06
6   2018-01-06
7   2018-01-06
8   2018-01-06
9   2018-01-06
Name: index, dtype: datetime64[ns]

This gives us a column that will always be the same for each 5-day period, but can be used to distinguish between other columns when checking for duplicates. Now all we have to do is drop duplicates based off of our "flag" column, and the other columns we are checking for:

signal.assign(flag=s.values).drop_duplicates(['flag', 'TKV', 'SWP', 'BWN']).drop('flag', 1)

            TKV  SWP  BWN
2018-01-01    0    0    0
2018-01-02   -1   -1   -1
2018-01-03    1    0    0
2018-01-06    0    1    0
2018-01-07    0    0    1
2018-01-08    0   -1    1
2018-01-09   -1    0    0
2018-01-10    0    0    0

If instead of dropping duplicates, you'd like to simply replace them with 0 , you can make use of duplicated here.

tmp = signal.assign(flag=s.values)
tmp[tmp.duplicated()] = 0
tmp = tmp.drop('flag', 1)

            TKV  SWP  BWN
2018-01-01    0    0    0
2018-01-02   -1   -1   -1
2018-01-03    1    0    0
2018-01-04    0    0    0
2018-01-05    0    0    0
2018-01-06    0    1    0
2018-01-07    0    0    1
2018-01-08    0   -1    1
2018-01-09   -1    0    0
2018-01-10    0    0    0

This results in the last two entries from the first group getting dropped, as they were duplicated within that period, but not rows in the second group, even if they appeared in the first.

This should be much more performant than your option:

signal = pd.concat([signal]*2000)
signal = signal.reset_index(drop=True).set_index(pd.date_range(start='1995-01-01', periods=20000))

In [445]: %%timeit
     ...: s = (signal.reset_index().groupby(pd.Grouper(freq='5d', key='index'))['index'].transform('first'))
     ...: signal.assign(flag=s.values).drop_duplicates(['flag', 'TKV', 'SWP', 'BWN']).drop('flag', 1)
     ...:
9.5 ms ± 277 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [482]: %%timeit
     ...: s = (signal.reset_index().groupby(pd.Grouper(freq='5d', key='index'))['index'].transform('first'))
     ...: tmp = signal.assign(flag=s.values)
     ...: tmp[tmp.duplicated()] = 0
     ...: tmp = tmp.drop('flag', 1)
56.4 ms ± 205 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Code takes too long to run when looping through a dataframe

Question

1 answers

solution1
0 2018-08-06 23:17:22

Code takes too long to run when looping through a dataframe

Question

1 answers

solution1 0 2018-08-06 23:17:22

solution1
0 2018-08-06 23:17:22