简体   繁体   中英

Improve performance of a for loop comparing pandas dataframe rows

I'm facing a performance problem with Python/Pandas. I have a for loop comparing consequent rows in a Pandas DataFrame:

for i in range(1, N):
    if df.column_A.iloc[i] == df.column_A.iloc[i-1]:
        if df.column_B.iloc[i] == 'START' and df.column_B.iloc[i-1] == 'STOP':
            df.time.iloc[i] = df.time.iloc[i] - df.time.iloc[i-1]

Which is working properly, but is extremely slow. My dataframe has around 1M rows, and I'm wondering if there is some way to improve performance. I've read about vectorization, but I can't figure out where to start.

I think you can use shift and a mask :

mask = ((df.column_A == df.column_A.shift()) 
        & (df.column_B == 'START') & (df.column_B.shift() == 'STOP'))
df.loc[mask, 'time'] -= df.time.shift().loc[mask]

The mask select the row where the value in 'column_A' is equal to the value in the previous (obtained by shift ) and where 'column_B' is equal to 'START' and the previous row to 'STOP'. Using loc allows you to change the value for all the selected rows by mask in the column 'time' by removing the value at the previous row ( shift again) with the same mask in the column time

EDIT: with an example:

df = pd.DataFrame({'column_A': [0,1,1,2,1,2,2], 'column_B': ['START', 'STOP', 'START','STOP', 'START','STOP', 'START'], 'time':range(7)})
   column_A column_B  time
0         0    START     0
1         1     STOP     1
2         1    START     2
3         2     STOP     3
4         1    START     4
5         2     STOP     5
6         2    START     6

so here the row number 2 and 6 meet your condition as the previous row has the same value in column_A and get 'START' in column_B while the preivous row has 'STOP'.

After running the code you get df :

   column_A column_B  time
0         0    START   0.0
1         1     STOP   1.0
2         1    START   1.0
3         2     STOP   3.0
4         1    START   4.0
5         2     STOP   5.0
6         2    START   1.0

where the value in time at row 2 is 1 (originally 2 minus value at previous row 1) and same for row 6 ( 6 - 5)

EDIT for time comparison let's create a df with 3000 rows

df = pd.DataFrame( [['A', 'START', 3], ['A', 'STOP', 6], ['B', 'STOP', 2], 
                    ['C', 'STOP', 1], ['C', 'START', 9], ['C', 'STOP', 7]],
                   columns=['column_A', 'column_B', 'time'] )
df = pd.concat([df]*500)
df.shape
Out[16]: (3000, 3)

now create two functions with the two methods:

# original method
def espogian (df):
    N = df.shape[0]
    for i in range(1, N):
        if df.column_A.iloc[i] == df.column_A.iloc[i-1]:
            if df.column_B.iloc[i] == 'START' and df.column_B.iloc[i-1] == 'STOP':
                df.time.iloc[i] = df.time.iloc[i] - df.time.iloc[i-1]
    return df
# mine
def ben(df):
    mask = ((df.column_A == df.column_A.shift()) 
        & (df.column_B == 'START') & (df.column_B.shift() == 'STOP'))
    df.loc[mask, 'time'] -= df.time.shift().loc[mask]
    return df

and run timeit :

%timeit espogian (df)
1 loop, best of 3: 8.71 s per loop

%timeit ben (df)
100 loops, best of 3: 4.79 ms per loop

# verify they are equal
df1 = espogian (df)
df2 = ben (df)
(df1==df2).all()
Out[24]: 
column_A    True
column_B    True
time        True

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM