I'm facing a performance problem with Python/Pandas. I have a for loop comparing consequent rows in a Pandas DataFrame:
for i in range(1, N):
if df.column_A.iloc[i] == df.column_A.iloc[i-1]:
if df.column_B.iloc[i] == 'START' and df.column_B.iloc[i-1] == 'STOP':
df.time.iloc[i] = df.time.iloc[i] - df.time.iloc[i-1]
Which is working properly, but is extremely slow. My dataframe has around 1M rows, and I'm wondering if there is some way to improve performance. I've read about vectorization, but I can't figure out where to start.
I think you can use shift
and a mask
:
mask = ((df.column_A == df.column_A.shift())
& (df.column_B == 'START') & (df.column_B.shift() == 'STOP'))
df.loc[mask, 'time'] -= df.time.shift().loc[mask]
The mask select the row where the value in 'column_A' is equal to the value in the previous (obtained by shift
) and where 'column_B' is equal to 'START' and the previous row to 'STOP'. Using loc
allows you to change the value for all the selected rows by mask
in the column 'time' by removing the value at the previous row ( shift
again) with the same mask in the column time
EDIT: with an example:
df = pd.DataFrame({'column_A': [0,1,1,2,1,2,2], 'column_B': ['START', 'STOP', 'START','STOP', 'START','STOP', 'START'], 'time':range(7)})
column_A column_B time
0 0 START 0
1 1 STOP 1
2 1 START 2
3 2 STOP 3
4 1 START 4
5 2 STOP 5
6 2 START 6
so here the row number 2 and 6 meet your condition as the previous row has the same value in column_A and get 'START' in column_B while the preivous row has 'STOP'.
After running the code you get df
:
column_A column_B time
0 0 START 0.0
1 1 STOP 1.0
2 1 START 1.0
3 2 STOP 3.0
4 1 START 4.0
5 2 STOP 5.0
6 2 START 1.0
where the value in time at row 2 is 1 (originally 2 minus value at previous row 1) and same for row 6 ( 6 - 5)
EDIT for time comparison let's create a df with 3000 rows
df = pd.DataFrame( [['A', 'START', 3], ['A', 'STOP', 6], ['B', 'STOP', 2],
['C', 'STOP', 1], ['C', 'START', 9], ['C', 'STOP', 7]],
columns=['column_A', 'column_B', 'time'] )
df = pd.concat([df]*500)
df.shape
Out[16]: (3000, 3)
now create two functions with the two methods:
# original method
def espogian (df):
N = df.shape[0]
for i in range(1, N):
if df.column_A.iloc[i] == df.column_A.iloc[i-1]:
if df.column_B.iloc[i] == 'START' and df.column_B.iloc[i-1] == 'STOP':
df.time.iloc[i] = df.time.iloc[i] - df.time.iloc[i-1]
return df
# mine
def ben(df):
mask = ((df.column_A == df.column_A.shift())
& (df.column_B == 'START') & (df.column_B.shift() == 'STOP'))
df.loc[mask, 'time'] -= df.time.shift().loc[mask]
return df
and run timeit
:
%timeit espogian (df)
1 loop, best of 3: 8.71 s per loop
%timeit ben (df)
100 loops, best of 3: 4.79 ms per loop
# verify they are equal
df1 = espogian (df)
df2 = ben (df)
(df1==df2).all()
Out[24]:
column_A True
column_B True
time True
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.