Improve performance of a for loop comparing pandas dataframe rows

Question

I'm facing a performance problem with Python/Pandas. I have a for loop comparing consequent rows in a Pandas DataFrame:

for i in range(1, N):
    if df.column_A.iloc[i] == df.column_A.iloc[i-1]:
        if df.column_B.iloc[i] == 'START' and df.column_B.iloc[i-1] == 'STOP':
            df.time.iloc[i] = df.time.iloc[i] - df.time.iloc[i-1]

Which is working properly, but is extremely slow. My dataframe has around 1M rows, and I'm wondering if there is some way to improve performance. I've read about vectorization, but I can't figure out where to start.

Answer 1

I think you can use shift and a mask :

mask = ((df.column_A == df.column_A.shift()) 
        & (df.column_B == 'START') & (df.column_B.shift() == 'STOP'))
df.loc[mask, 'time'] -= df.time.shift().loc[mask]

The mask select the row where the value in 'column_A' is equal to the value in the previous (obtained by shift ) and where 'column_B' is equal to 'START' and the previous row to 'STOP'. Using loc allows you to change the value for all the selected rows by mask in the column 'time' by removing the value at the previous row ( shift again) with the same mask in the column time

EDIT: with an example:

df = pd.DataFrame({'column_A': [0,1,1,2,1,2,2], 'column_B': ['START', 'STOP', 'START','STOP', 'START','STOP', 'START'], 'time':range(7)})
   column_A column_B  time
0         0    START     0
1         1     STOP     1
2         1    START     2
3         2     STOP     3
4         1    START     4
5         2     STOP     5
6         2    START     6

so here the row number 2 and 6 meet your condition as the previous row has the same value in column_A and get 'START' in column_B while the preivous row has 'STOP'.

After running the code you get df :

   column_A column_B  time
0         0    START   0.0
1         1     STOP   1.0
2         1    START   1.0
3         2     STOP   3.0
4         1    START   4.0
5         2     STOP   5.0
6         2    START   1.0

where the value in time at row 2 is 1 (originally 2 minus value at previous row 1) and same for row 6 ( 6 - 5)

EDIT for time comparison let's create a df with 3000 rows

df = pd.DataFrame( [['A', 'START', 3], ['A', 'STOP', 6], ['B', 'STOP', 2], 
                    ['C', 'STOP', 1], ['C', 'START', 9], ['C', 'STOP', 7]],
                   columns=['column_A', 'column_B', 'time'] )
df = pd.concat([df]*500)
df.shape
Out[16]: (3000, 3)

now create two functions with the two methods:

# original method
def espogian (df):
    N = df.shape[0]
    for i in range(1, N):
        if df.column_A.iloc[i] == df.column_A.iloc[i-1]:
            if df.column_B.iloc[i] == 'START' and df.column_B.iloc[i-1] == 'STOP':
                df.time.iloc[i] = df.time.iloc[i] - df.time.iloc[i-1]
    return df
# mine
def ben(df):
    mask = ((df.column_A == df.column_A.shift()) 
        & (df.column_B == 'START') & (df.column_B.shift() == 'STOP'))
    df.loc[mask, 'time'] -= df.time.shift().loc[mask]
    return df

and run timeit :

%timeit espogian (df)
1 loop, best of 3: 8.71 s per loop

%timeit ben (df)
100 loops, best of 3: 4.79 ms per loop

# verify they are equal
df1 = espogian (df)
df2 = ben (df)
(df1==df2).all()
Out[24]: 
column_A    True
column_B    True
time        True

Improve performance of a for loop comparing pandas dataframe rows

Question

1 answers

solution1
3 ACCPTED 2018-06-27 16:05:47

Improve performance of a for loop comparing pandas dataframe rows

Question

1 answers

solution1 3 ACCPTED 2018-06-27 16:05:47

solution1
3 ACCPTED 2018-06-27 16:05:47