简体   繁体   English

如何通过几列中的唯一索引在pandas中求和?

[英]How to sum in pandas by unique index in several columns?

I have a pandas DataFrame which details online activities in terms of "clicks" during an user session. 我有一个pandas DataFrame,它根据用户会话期间的“点击次数”详细说明了在线活动。 There are as many as 50,000 unique users, and the dataframe has around 1.5 million samples. 有多达50,000个唯一身份用户,数据框有大约150万个样本。 Obviously most users have multiple records. 显然大多数用户都有多条记录。

The four columns are a unique user id, the date when the user began the service "Registration", the date the user used the service "Session", the total number of clicks. 这四列是唯一的用户ID,用户开始服务的日期“注册”,用户使用服务的日期“会话”,总点击次数。

The organization of the dataframe is as follows: 数据框的组织如下:

User_ID    Registration  Session      clicks
2349876    2012-02-22    2014-04-24   2 
1987293    2011-02-01    2013-05-03   1 
2234214    2012-07-22    2014-01-22   7 
9874452    2010-12-22    2014-08-22   2 
...

(There is also an index above beginning with 0, but one could set User_ID as the index.) (上面还有一个以0开头的索引,但可以将User_ID设置为索引。)

I would like to aggregate the total number of clicks by the user since Registration date. 我希望汇总自注册日期以来用户的总点击次数。 The dataframe (or pandas Series object) would list User_ID and "Total_Number_Clicks". 数据帧(或pandas Series对象)将列出User_ID和“Total_Number_Clicks”。

User_ID    Total_Clicks
2349876    722 
1987293    341
2234214    220 
9874452    1405 
...

How does one do this in pandas? 大熊猫如何做到这一点? Is this done by .agg() ? 这是由.agg()完成的吗? Each User_ID needs to be summed individually. 每个User_ID需要单独求和。

As there are 1.5 million records, does this scale? 由于有150万条记录,这是否有规模?

IIUC you can use groupby , sum and reset_index : IIUC你可以使用groupbysumreset_index

print df
   User_ID Registration    Session  clicks
0  2349876   2012-02-22 2014-04-24       2
1  1987293   2011-02-01 2013-05-03       1
2  2234214   2012-07-22 2014-01-22       7
3  9874452   2010-12-22 2014-08-22       2

print df.groupby('User_ID')['clicks'].sum().reset_index()
   User_ID  clicks
0  1987293       1
1  2234214       7
2  2349876       2
3  9874452       2

If first column User_ID is index : 如果第一列User_IDindex

print df
        Registration    Session  clicks
User_ID                                
2349876   2012-02-22 2014-04-24       2
1987293   2011-02-01 2013-05-03       1
2234214   2012-07-22 2014-01-22       7
9874452   2010-12-22 2014-08-22       2

print df.groupby(level=0)['clicks'].sum().reset_index()
   User_ID  clicks
0  1987293       1
1  2234214       7
2  2349876       2
3  9874452       2

Or: 要么:

print df.groupby(df.index)['clicks'].sum().reset_index()
   User_ID  clicks
0  1987293       1
1  2234214       7
2  2349876       2
3  9874452       2

EDIT: 编辑:

As Alexander pointed, you need filter data before groupby , if Session dates is less as Registration dates per User_ID : 正如亚历山大指出的那样,如果Session日期少于每个User_ID Registration日期,则需要在groupby之前过滤数据:

print df
   User_ID Registration    Session  clicks
0  2349876   2012-02-22 2014-04-24       2
1  1987293   2011-02-01 2013-05-03       1
2  2234214   2012-07-22 2014-01-22       7
3  9874452   2010-12-22 2014-08-22       2

print df[df.Session >= df.Registration].groupby('User_ID')['clicks'].sum().reset_index()
   User_ID  clicks
0  1987293       1
1  2234214       7
2  2349876       2
3  9874452       2

I change 3. row of data for better sample: 为了更好的样本,我更改了3.行数据:

print df
        Registration    Session  clicks
User_ID                                
2349876   2012-02-22 2014-04-24       2
1987293   2011-02-01 2013-05-03       1
2234214   2012-07-22 2012-01-22       7
9874452   2010-12-22 2014-08-22       2

print df.Session >= df.Registration
User_ID
2349876     True
1987293     True
2234214    False
9874452     True
dtype: bool

print df[df.Session >= df.Registration]
        Registration    Session  clicks
User_ID                                
2349876   2012-02-22 2014-04-24       2
1987293   2011-02-01 2013-05-03       1
9874452   2010-12-22 2014-08-22       2

df1 = df[df.Session >= df.Registration]
print df1.groupby(df1.index)['clicks'].sum().reset_index()
   User_ID  clicks
0  1987293       1
1  2349876       2
2  9874452       2

The first thing to do is filter registrations dates that precede the registration date, then group on the User_ID and sum. 首先要做的是在注册日期之前过滤注册日期,然后在User_ID和sum上进行分组。

gb = (df[df.Session >= df.Registration]
      .groupby('User_ID')
      .clicks.agg({'Total_Clicks': np.sum}))

>>> gb
         Total_Clicks
User_ID              
1987293             1
2234214             7
2349876             2
9874452             2

For the use case you mentioned, I believe this is scalable. 对于您提到的用例,我相信这是可扩展的。 It always depends, of course, on your available memory. 当然,这总取决于你的可用内存。

假设您的数据框名称为df,请执行以下操作

df.groupby(['User_ID']).sum()[['User_ID','clicks']]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM