简体   繁体   English

最快的方法来比较pandas数据帧中的行和上一行以及数百万行

[英]Fastest way to compare row and previous row in pandas dataframe with millions of rows

I'm looking for solutions to speed up a function I have written to loop through a pandas dataframe and compare column values between the current row and the previous row. 我正在寻找解决方案来加速我编写的函数来循环遍历pandas数据帧并比较当前行和前一行之间的列值。

As an example, this is a simplified version of my problem: 例如,这是我的问题的简化版本:

   User  Time                 Col1  newcol1  newcol2  newcol3  newcol4
0     1     6     [cat, dog, goat]        0        0        0        0
1     1     6         [cat, sheep]        0        0        0        0
2     1    12        [sheep, goat]        0        0        0        0
3     2     3          [cat, lion]        0        0        0        0
4     2     5  [fish, goat, lemur]        0        0        0        0
5     3     9           [cat, dog]        0        0        0        0
6     4     4          [dog, goat]        0        0        0        0
7     4    11                [cat]        0        0        0        0

At the moment I have a function which loops through and calculates values for ' newcol1 ' and ' newcol2 ' based on whether the ' User ' has changed since the previous row and also whether the difference in the ' Time ' values is greater than 1. It also looks at the first value in the arrays stored in ' Col1 ' and ' Col2 ' and updates ' newcol3 ' and ' newcol4 ' if these values have changed since the previous row. 目前我有一个循环的函数,根据“ User ”自上一行以来是否发生了变化以及“ Time ”值的差异是否大于1来计算“ newcol1 ”和“ newcol2 ”的值。它还会查看存储在' Col1 '和' Col2 '中的数组中的第一个值,如果这些值自上一行以来已更改,则更新' newcol3 '和' newcol4 '。

Here's the pseudo-code for what I'm doing currently (since I've simplified the problem I haven't tested this but it's pretty similar to what I'm actually doing in ipython notebook): 这是我目前正在做的伪代码(因为我已经简化了我没有测试过的问题,但它与我在ipython笔记本中实际做的非常类似):

 def myJFunc(df):
...     #initialize jnum counter
...     jnum = 0;
...     #loop through each row of dataframe (not including the first/zeroeth)
...     for i in range(1,len(df)):
...             #has user changed?
...             if df.User.loc[i] == df.User.loc[i-1]:
...                     #has time increased by more than 1 (hour)?
...                     if abs(df.Time.loc[i]-df.Time.loc[i-1])>1:
...                             #update new columns
...                             df['newcol2'].loc[i-1] = 1;
...                             df['newcol1'].loc[i] = 1;
...                             #increase jnum
...                             jnum += 1;
...                     #has content changed?
...                     if df.Col1.loc[i][0] != df.Col1.loc[i-1][0]:
...                             #record this change
...                             df['newcol4'].loc[i-1] = [df.Col1.loc[i-1][0], df.Col2.loc[i][0]];
...             #different user?
...             elif df.User.loc[i] != df.User.loc[i-1]:
...                     #update new columns
...                     df['newcol1'].loc[i] = 1; 
...                     df['newcol2'].loc[i-1] = 1;
...                     #store jnum elsewhere (code not included here) and reset jnum
...                     jnum = 1;

I now need to apply this function to several million rows and it's impossibly slow so I'm trying to figure out the best way to speed it up. 我现在需要将此功能应用于数百万行,并且速度非常慢,所以我试图找出加速它的最佳方法。 I've heard that Cython can increase the speed of functions but I have no experience with it (and I'm new to both pandas and python). 我听说Cython可以提高功能的速度,但我没有使用它的经验(而且我是pandas和python的新手)。 Is it possible to pass two rows of a dataframe as arguments to the function and then use Cython to speed it up or would it be necessary to create new columns with " diff " values in them so that the function only reads from and writes to one row of the dataframe at a time, in order to benefit from using Cython? 是否可以将数据帧的两行作为参数传递给函数,然后使用Cython加速它,或者是否需要创建具有“ diff ”值的新列,以便该函数只读取和写入一个一次数据帧的行,以便从使用Cython中受益? Any other speed tricks would be greatly appreciated! 任何其他速度技巧将不胜感激!

(As regards using .loc, I compared .loc, .iloc and .ix and this one was marginally faster so that's the only reason I'm using that currently) (关于使用.loc,我比较了.loc,.iloc和.ix这个比较快,所以这是我目前使用它的唯一原因)

(Also, my User column in reality is unicode not int, which could be problematic for speedy comparisons) (另外,我的User专栏实际上是unicode而不是int,这对于快速比较可能会有问题)

I was thinking along the same lines as Andy, just with groupby added, and I think this is complementary to Andy's answer. 我和Andy一样思考,只是添加了groupby ,我认为这是对Andy的回答的补充。 Adding groupby is just going to have the effect of putting a NaN in the first row whenever you do a diff or shift . 添加groupby只会在每次执行diffshift时将NaN放在第一行。 (Note that this is not an attempt at an exact answer, just to sketch out some basic techniques.) (请注意,这不是一个确切答案的尝试,只是为了勾勒出一些基本技术。)

df['time_diff'] = df.groupby('User')['Time'].diff()

df['Col1_0'] = df['Col1'].apply( lambda x: x[0] )

df['Col1_0_prev'] = df.groupby('User')['Col1_0'].shift()

   User  Time                 Col1  time_diff Col1_0 Col1_0_prev
0     1     6     [cat, dog, goat]        NaN    cat         NaN
1     1     6         [cat, sheep]          0    cat         cat
2     1    12        [sheep, goat]          6  sheep         cat
3     2     3          [cat, lion]        NaN    cat         NaN
4     2     5  [fish, goat, lemur]          2   fish         cat
5     3     9           [cat, dog]        NaN    cat         NaN
6     4     4          [dog, goat]        NaN    dog         NaN
7     4    11                [cat]          7    cat         dog

As a followup to Andy's point about storing objects, note that what I did here was to extract the first element of the list column (and add a shifted version also). 作为Andy关于存储对象的观点的后续,请注意我在这里所做的是提取列表列的第一个元素(并添加移位版本)。 Doing it like this you only have to do an expensive extraction once and after that can stick to standard pandas methods. 像这样做你只需要进行一次昂贵的提取,然后就可以坚持标准的熊猫方法了。

Use pandas (constructs) and vectorize your code ie don't use for loops, instead use pandas/numpy functions. 使用pandas(构造)并向量化你的代码,即不要使用for循环,而是使用pandas / numpy函数。

'newcol1' and 'newcol2' based on whether the 'User' has changed since the previous row and also whether the difference in the 'Time' values is greater than 1. 'newcol1'和'newcol2'基于“用户”自上一行以来是否发生了变化,以及“时间”值的差异是否大于1。

Calculate these separately: 分别计算:

df['newcol1'] = df['User'].shift() == df['User']
df.ix[0, 'newcol1'] = True # possibly tweak the first row??

df['newcol1'] = (df['Time'].shift() - df['Time']).abs() > 1

It's unclear to me the purpose of Col1, but general python objects in columns doesn't scale well (you can't use fast path and the contents are scattered in memory). 我不清楚Col1的目的,但列中的一般python对象不能很好地扩展(你不能使用快速路径,内容分散在内存中)。 Most of the time you can get away with using something else... 大多数时候你可以逃避使用别的东西......


Cython is the very last option , and not needed in 99% of use-cases, but see enhancing performance section of the docs for tips. Cython是最后一个选项 ,在99%的用例中不需要,但是请参阅文档的增强性能部分以获取提示。

In your problem, it seems like you want to iterate through row pairwise. 在您的问题中,您似乎想要成对地遍历行。 The first thing you could do is something like this: 你可以做的第一件事是这样的:

from itertools import tee, izip
def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = tee(iterable)
    next(b, None)
    return izip(a, b)

for (idx1, row1), (idx2, row2) in pairwise(df.iterrows()):
    # you stuff

However you cannot modify row1 and row2 directly you will still need to use .loc or .iloc with the indexes. 但是,您无法直接修改row1和row2,您仍然需要将.loc或.iloc与索引一起使用。

If iterrows is still too slow I suggest to do something like this: 如果iterrows仍然太慢,我建议做这样的事情:

  • Create a user_id column from you unicode names using pd.unique(User) and mapping the name with a dictionary to integer ids. 使用pd.unique(User)从您的unicode名称创建user_id列,并将名称与字典映射到整数ID。

  • Create a delta dataframe: to a shifted dataframe with the user_id and time column you substract the original dataframe. 创建增量数据框:使用user_id和time列替换原始数据帧的移位数据框。

     df[[col1, ..]].shift() - df[[col1, ..]]) 

If user_id > 0, it means that the user changed in two consecutive row. 如果user_id> 0,则表示用户连续两行更改。 The time column can be filtered directly with delta[delta['time' > 1]] With this delta dataframe you record the changes row-wise. 可以使用delta [delta ['time'> 1]直接过滤时间列。使用此delta数据帧,您可以逐行记录更改。 You can use it aa mask to update the columns you need from you original dataframe. 您可以使用它来使用掩码从原始数据帧更新所需的列。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM