简体   繁体   中英

Fastest way to compare row and previous row in pandas dataframe with millions of rows

I'm looking for solutions to speed up a function I have written to loop through a pandas dataframe and compare column values between the current row and the previous row.

As an example, this is a simplified version of my problem:

   User  Time                 Col1  newcol1  newcol2  newcol3  newcol4
0     1     6     [cat, dog, goat]        0        0        0        0
1     1     6         [cat, sheep]        0        0        0        0
2     1    12        [sheep, goat]        0        0        0        0
3     2     3          [cat, lion]        0        0        0        0
4     2     5  [fish, goat, lemur]        0        0        0        0
5     3     9           [cat, dog]        0        0        0        0
6     4     4          [dog, goat]        0        0        0        0
7     4    11                [cat]        0        0        0        0

At the moment I have a function which loops through and calculates values for ' newcol1 ' and ' newcol2 ' based on whether the ' User ' has changed since the previous row and also whether the difference in the ' Time ' values is greater than 1. It also looks at the first value in the arrays stored in ' Col1 ' and ' Col2 ' and updates ' newcol3 ' and ' newcol4 ' if these values have changed since the previous row.

Here's the pseudo-code for what I'm doing currently (since I've simplified the problem I haven't tested this but it's pretty similar to what I'm actually doing in ipython notebook):

 def myJFunc(df):
...     #initialize jnum counter
...     jnum = 0;
...     #loop through each row of dataframe (not including the first/zeroeth)
...     for i in range(1,len(df)):
...             #has user changed?
...             if df.User.loc[i] == df.User.loc[i-1]:
...                     #has time increased by more than 1 (hour)?
...                     if abs(df.Time.loc[i]-df.Time.loc[i-1])>1:
...                             #update new columns
...                             df['newcol2'].loc[i-1] = 1;
...                             df['newcol1'].loc[i] = 1;
...                             #increase jnum
...                             jnum += 1;
...                     #has content changed?
...                     if df.Col1.loc[i][0] != df.Col1.loc[i-1][0]:
...                             #record this change
...                             df['newcol4'].loc[i-1] = [df.Col1.loc[i-1][0], df.Col2.loc[i][0]];
...             #different user?
...             elif df.User.loc[i] != df.User.loc[i-1]:
...                     #update new columns
...                     df['newcol1'].loc[i] = 1; 
...                     df['newcol2'].loc[i-1] = 1;
...                     #store jnum elsewhere (code not included here) and reset jnum
...                     jnum = 1;

I now need to apply this function to several million rows and it's impossibly slow so I'm trying to figure out the best way to speed it up. I've heard that Cython can increase the speed of functions but I have no experience with it (and I'm new to both pandas and python). Is it possible to pass two rows of a dataframe as arguments to the function and then use Cython to speed it up or would it be necessary to create new columns with " diff " values in them so that the function only reads from and writes to one row of the dataframe at a time, in order to benefit from using Cython? Any other speed tricks would be greatly appreciated!

(As regards using .loc, I compared .loc, .iloc and .ix and this one was marginally faster so that's the only reason I'm using that currently)

(Also, my User column in reality is unicode not int, which could be problematic for speedy comparisons)

I was thinking along the same lines as Andy, just with groupby added, and I think this is complementary to Andy's answer. Adding groupby is just going to have the effect of putting a NaN in the first row whenever you do a diff or shift . (Note that this is not an attempt at an exact answer, just to sketch out some basic techniques.)

df['time_diff'] = df.groupby('User')['Time'].diff()

df['Col1_0'] = df['Col1'].apply( lambda x: x[0] )

df['Col1_0_prev'] = df.groupby('User')['Col1_0'].shift()

   User  Time                 Col1  time_diff Col1_0 Col1_0_prev
0     1     6     [cat, dog, goat]        NaN    cat         NaN
1     1     6         [cat, sheep]          0    cat         cat
2     1    12        [sheep, goat]          6  sheep         cat
3     2     3          [cat, lion]        NaN    cat         NaN
4     2     5  [fish, goat, lemur]          2   fish         cat
5     3     9           [cat, dog]        NaN    cat         NaN
6     4     4          [dog, goat]        NaN    dog         NaN
7     4    11                [cat]          7    cat         dog

As a followup to Andy's point about storing objects, note that what I did here was to extract the first element of the list column (and add a shifted version also). Doing it like this you only have to do an expensive extraction once and after that can stick to standard pandas methods.

Use pandas (constructs) and vectorize your code ie don't use for loops, instead use pandas/numpy functions.

'newcol1' and 'newcol2' based on whether the 'User' has changed since the previous row and also whether the difference in the 'Time' values is greater than 1.

Calculate these separately:

df['newcol1'] = df['User'].shift() == df['User']
df.ix[0, 'newcol1'] = True # possibly tweak the first row??

df['newcol1'] = (df['Time'].shift() - df['Time']).abs() > 1

It's unclear to me the purpose of Col1, but general python objects in columns doesn't scale well (you can't use fast path and the contents are scattered in memory). Most of the time you can get away with using something else...


Cython is the very last option , and not needed in 99% of use-cases, but see enhancing performance section of the docs for tips.

In your problem, it seems like you want to iterate through row pairwise. The first thing you could do is something like this:

from itertools import tee, izip
def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = tee(iterable)
    next(b, None)
    return izip(a, b)

for (idx1, row1), (idx2, row2) in pairwise(df.iterrows()):
    # you stuff

However you cannot modify row1 and row2 directly you will still need to use .loc or .iloc with the indexes.

If iterrows is still too slow I suggest to do something like this:

  • Create a user_id column from you unicode names using pd.unique(User) and mapping the name with a dictionary to integer ids.

  • Create a delta dataframe: to a shifted dataframe with the user_id and time column you substract the original dataframe.

     df[[col1, ..]].shift() - df[[col1, ..]]) 

If user_id > 0, it means that the user changed in two consecutive row. The time column can be filtered directly with delta[delta['time' > 1]] With this delta dataframe you record the changes row-wise. You can use it aa mask to update the columns you need from you original dataframe.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM