简体   繁体   English

熊猫中基于两个条件的分组和聚集的最有效方法

[英]most efficient way to groupby and aggregate based on two condition in pandas

I have the following dataframe : 我有以下数据框:

user_id  sale     date        refunded
  1      1000    '2016-10-02'   0  
  1      1000    '2016-09-13'   0
  2      1000    '2016-08-11'   0
  2      1000    '2016-10-21'   0
  3      1000    '2016-11-01'   1
  3      1000    '2016-11-01'   1

i need to group by user_id and calculate sum of sale based on these two conditions: 我需要按user_id分组并根据以下两个条件计算销售总额:

   date >='2016-10-01'
   refunded==0

i took two different approach: 我采取了两种不同的方法:

    non_refunded = df.refunded == 0
    after_assignment = df.date > '2016-10-01'
    columns = ['user_id', 'sale']
    tt = tdf.loc[non_refunded & after_assignment][columns].groupby(['user_id']).sum().reset_index()

another approach is: 另一种方法是:

columns = ['user_id', 'sale']
tt = df.loc[(df.refunded == 0) & (df.date > '2016-10-01')][columns].groupby(['user_id']).sum().reset_index()

in first approach first i create two copy dataframe (i am not sure if they are copy) then apply the condition.how do you compare these two approaches in terms of speed, resources needed, when these two approaches start to show their differences , for example if we should do it for 30 different dataframes with 100k rows or more. 在第一种方法中,我首先创建两个副本数据帧(我不确定它们是否是副本),然后应用条件。当这两种方法开始显示它们的差异时,如何在速度,所需资源方面比较这两种方法?例如,如果我们应该对30万个具有10万行或更多行的数据帧执行此操作。

Are you using an IPython interpreter? 您正在使用IPython解释器吗? If so, you can use the %timeit magic to measure how long it takes to execute one line of code. 如果是这样,您可以使用%timeit魔术来测量执行一行代码所需的时间。 You two approaches seem to do the exact same thing - I wouldn't expect any performance difference. 两种方法似乎做的完全相同-我不希望有任何性能差异。

For readability I would use the second approach: 为了提高可读性,我将使用第二种方法:

%timeit df.loc[(df.refunded == 0) & (df.date > '2016-10-01')].groupby('user_id').sum()

Pandas won't struggle with 100k row data frames on an reasonably modern laptop. 在合理的现代笔记本电脑上,熊猫不会为10万行数据帧而苦恼。

I think you can use query : 我认为您可以使用query

df.date = pd.to_datetime(df.date)
columns = ['user_id', 'sale']
filtered = df.query('refunded == 0 and date > "2016-10-01"')
tt = filtered[columns].groupby(['user_id']).sum().reset_index()
print (tt)
   user_id  sale
0        1  1000
1        2  1000

Another solution is removed ][ and add columns to loc : 删除了另一个解决方案][ ,并向loc添加列:

df.date = pd.to_datetime(df.date)
columns = ['user_id', 'sale']
filtered = df.loc[(df.refunded == 0) & (df.date > '2016-10-01'), columns]
tt = filtered[columns].groupby(['user_id']).sum().reset_index()
print (tt)
   user_id  sale
0        1  1000
1        2  1000

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM