[英]pandas sort within group then aggregation
I am doing query analysis of search engine.我正在做搜索引擎的查询分析。 User may search different query one by one on google search engine at different time in one session.用户可以在一个 session 的不同时间在谷歌搜索引擎上一一搜索不同的查询。
I have data with several field: session_id
, log_time
, query
, feature_i
, etc. I want to group by session_id
and then concat
several rows into one by the order of log_time
.我有几个字段的数据: session_id
, log_time
, query
, feature_i
等。我想按session_id
分组,然后按concat
log_time
顺序将几行合并为一行。 So that output data will represent user's behaviors in a time series way.这样 output 数据将以时间序列的方式表示用户的行为。
Code:代码:
toy_data = pd.DataFrame({'session_id':[1,2,1,2,3,3,],
'log_time':[4,5,6,1,2,3],
'query':['hi','dude','pandas','groupby','sort','agg'],
'cate_feat_0':['apple','banana']*3,
'num_feat_0':[1,2,3,4,5,6]})
print(toy_data)
Output: Output:
session_id log_time query cate_feat_0 num_feat_0
0 1 4 hi apple 1
1 2 5 dude banana 2
2 1 6 pandas apple 3
3 2 1 groupby banana 4
4 3 2 sort apple 5
5 3 3 agg banana 6
What I want:我想要的是:
## note that all list are sorted by log time with each session_id group
session_id query_list log_time_list cate_feat_0_list num_feat_0_list
1 [hi, pandas] [4,6] [apple, apple] [1,3]
2 [groupby, dude] [1,5] [banana, banana] [4,2]
3 [sort,agg] [2,3] [apple, banana] [5,6]
First we groupby and agg with code:首先我们用代码进行 groupby 和 agg:
toy_data_res = toy_data.groupby('session_id').agg({'query':list, 'log_time':list, 'cate_feat_0':list, 'num_feat_0':list})
toy_data_res
Gives:给出:
query log_time cate_feat_0 num_feat_0
session_id
1 [hi, pandas] [4, 6] [apple, apple] [1, 3]
2 [dude, groupby] [5, 1] [banana, banana] [2, 4]
3 [sort, agg] [2, 3] [apple, banana] [5, 6]
Then we sort with in each session with code:然后我们在每个 session 中使用代码进行排序:
for i in toy_data_res.index:
sort_index = np.argsort(toy_data_res.loc[i,'log_time']) ## get time order with in group
for col in toy_data_res.columns.values:
toy_data_res.loc[i,col] = [toy_data_res.loc[i,col][j] for j in sort_index] ## sort values in cols
toy_data_res
Gives:给出:
query log_time cate_feat_0 num_feat_0
session_id
1 [hi, pandas] [4, 6] [apple, apple] [1, 3]
2 [groupby, dude] [1, 5] [banana, banana] [4, 2]
3 [sort, agg] [2, 3] [apple, banana] [5, 6]
My approach is quick slow.我的方法是快慢。 Is there any better way to do groupby -> sort with in group -> aggregation
?有没有更好的方法来做groupby -> sort with in group -> aggregation
?
Tips: We can use STRING_AGG
or GROUP_CONCAT
in SQL to do within group sorting.提示: 我们可以使用STRING_AGG
中的 STRING_AGG 或GROUP_CONCAT
进行组内排序。
Use DataFrame.sort_values
before groupby
, if need apply same function is possible use list of columns names:在groupby
之前使用DataFrame.sort_values
,如果需要应用相同的 function 可以使用列名列表:
df = (toy_data.sort_values(['session_id','log_time'])
.groupby('session_id')[['query','log_time','cate_feat_0', 'num_feat_0']]
.agg(list))
print (df)
query log_time cate_feat_0 num_feat_0
session_id
1 [hi, pandas] [4, 6] [apple, apple] [1, 3]
2 [groupby, dude] [1, 5] [banana, banana] [4, 2]
3 [sort, agg] [2, 3] [apple, banana] [5, 6]
try sorting by session_id and log_time before groupby尝试在 groupby 之前按 session_id 和 log_time 排序
df = pd.DataFrame({'session_id':[1,2,1,2,3,3,],
'log_time':[4,5,6,1,2,3],
'query':['hi','dude','pandas','groupby','sort','agg'],
'cate_feat_0':['apple','banana']*3,
'num_feat_0':[1,2,3,4,5,6]})
df=df.sort_values(by=['session_id','log_time'])
grouped=df.groupby('session_id')
['log_time','query','cate_feat_0','num_feat_0'].agg(list)
print(grouped)
output output
log_time query cate_feat_0 num_feat_0
session_id
1 [4, 6] [hi, pandas] [apple, apple] [1, 3]
2 [1, 5] [groupby, dude] [banana, banana] [4, 2]
3 [2, 3] [sort, agg] [apple, banana] [5, 6]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.