My data is like and I want to calculate the NDCG
metric
pid query title label predict prob
0 1 A a 0 2 0.9
1 1 A b 2 0 0.8
2 1 A c 1 2 0.7
3 2 D d 3 1 0.8
4 2 D e 4 2 0.9
After groupby
opration
each_pid = data.groupby(data['pid']).apply(lambda g: g.sort_values(['predict', 'prob'], ascending=False))
I got a dataframe like this.
pid query title label predict prob score
pid
1 0 1 A a 0 2 0.9 6
2 1 A c 1 2 0.7 18
1 1 A b 2 0 0.8 42
2 4 2 D e 4 2 0.9 186
3 2 D d 3 1 0.8 90
Now I want another column names weight
which's value is according to the order of each group as following.
pid query title label predict prob score weight
pid
1 0 1 A a 0 2 0.9 6 1
2 1 A c 1 2 0.7 18 2
1 1 A b 2 0 0.8 42 3
2 4 2 D e 4 2 0.9 186 1
3 2 D d 3 1 0.8 90 2
And how to get top2 rows of each group like
pid query title label predict prob score weight
pid
1 0 1 A a 0 2 0.9 6 1
2 1 A c 1 2 0.7 18 2
2 4 2 D e 4 2 0.9 186 1
3 2 D d 3 1 0.8 90 2
Can anyone help?
EDIT. Thanks for @Akshay Kandul and @Allen help. The code works as
data = data.groupby(level=0).head(2)
data['weight'] = data.groupby(level=0).cumcount()+1
You can use cumcount function on group by of index level zero
each_pid['weight'] = each_pid.groupby(level=[0]).cumcount()
output
label pid predict prob query title weight
pid
1 0 0 1 2 0.9 A a 0
2 1 1 2 0.7 A c 1
1 2 1 0 0.8 A b 2
2 4 4 2 2 0.9 D e 0
3 3 2 1 0.8 D d 1
If your really want it to start with 1 then just do this instead:
each_pid['weight'] = each_pid.groupby(level=[0]).cumcount()+1
output
label pid predict prob query title weight
pid
1 0 0 1 2 0.9 A a 1
2 1 1 2 0.7 A c 2
1 2 1 0 0.8 A b 3
2 4 4 2 2 0.9 D e 1
3 3 2 1 0.8 D d 2
If you need top 2 records of each group you can use this code:
print(each_pid.groupby(level=0).head(2))
Based on each_pid you've already got, you can use apply to calculate the order:
each_pid['weight'] = \
(
each_pid.groupby(level=0)
.apply(lambda x: pd.Series(range(1,len(x)+1)))
.values
)
each_pid
Out[263]:
pid query title label predict prob weight
pid
1 0 1 A a 0 2 0.9 1
2 1 A c 1 2 0.7 2
1 1 A b 2 0 0.8 3
2 4 2 D e 4 2 0.9 1
3 2 D d 3 1 0.8 2
If you need top 2 of each group, you can do:
each_pid.groupby(level=0).head(2)
Out[273]:
pid query title label predict prob weight
pid
1 0 1 A a 0 2 0.9 1
2 1 A c 1 2 0.7 2
2 4 2 D e 4 2 0.9 1
3 2 D d 3 1 0.8 2
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.