[英]Finding top 5 values based on another column pandas
How to find top 5 values of category
column based while grouping customer_id
column in pandas dataframe?如何在对 pandas dataframe 中的
customer_id
列进行分组时找到基于category
列的前 5 个值?
customer_id email address_id name category
0 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Scifi
1 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Action
2 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Sports
3 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Scifi
4 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Family
5 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Action
6 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Documentary
7 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Action
8 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Sports
9 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Scifi
10 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Documentary
11 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Sports
I want another column named preferred_film_category
for each unique customer_id
( the top 5 values are based on how many times particular category occurs for each unique customer_id )我想要为每个唯一的
customer_id
命名另一个名为preferred_film_category
的列(前 5 个值基于每个唯一的 customer_id 出现特定类别的次数)
Expected Dataframe:预期 Dataframe:
customer_id email address_id name category preferred_film_category
0 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Scifi Scifi, Action, Sports, Animation, Drama
1 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Action Scifi, Action, Sports, Animation, Drama
2 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Sports Scifi, Action, Sports, Animation, Drama
3 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Scifi Scifi, Action, Sports, Animation, Drama
4 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Family Scifi, Action, Sports, Animation, Drama
5 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Action Scifi, Action, Sports, Animation, Drama
6 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Documentary Documentary, Sports, Scifi, Action
7 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Action Documentary, Sports, Scifi, Action
8 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Sports Documentary, Sports, Scifi, Action
9 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Scifi Documentary, Sports, Scifi, Action
10 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Documentary Documentary, Sports, Scifi, Action
11 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Sports Documentary, Sports, Scifi, Action
Try value_counts
+ groupby nlargest
to get get the highest frequency categories, then groupby aggregate
to convert to a string, then join
to merge back with the original DataFrame:尝试
value_counts
+ groupby nlargest
得到频率最高的类别,然后groupby aggregate
转换为字符串,然后join
与原始 DataFrame 合并:
n = 2
df = df.join(
df.value_counts(['customer_id', 'category'])
.groupby(level=0).nlargest(n)
.reset_index('category')
.groupby(level=0)['category'].agg(', '.join)
.rename('preferred_film_category'),
on='customer_id'
)
df
: df
:
customer_id email address_id name category preferred_film_category
0 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Scifi Action, Scifi
1 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Action Action, Scifi
2 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Sports Action, Scifi
3 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Scifi Action, Scifi
4 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Family Action, Scifi
5 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Action Action, Scifi
6 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Documentary Documentary, Sports
7 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Action Documentary, Sports
8 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Sports Documentary, Sports
9 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Scifi Documentary, Sports
10 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Documentary Documentary, Sports
11 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Sports Documentary, Sports
*note n
is set to 2
as each customer only has 4 unique values in categrory
and so 5
does not demonstrate the functioning of the code. *注意
n
设置为2
,因为每个客户在categrory
中只有 4 个唯一值,因此5
不演示代码的功能。 Change this to the desired value to keep ( 5
).将此更改为所需的值以保留 (
5
)。
DataFrame Used: DataFrame 使用:
df = pd.DataFrame({
'customer_id': [411, 411, 411, 411, 411, 411, 527, 527, 527, 527, 527, 527],
'email': ['NORMAN.CURRIER@sakilacustomer.org',
'NORMAN.CURRIER@sakilacustomer.org',
'NORMAN.CURRIER@sakilacustomer.org',
'NORMAN.CURRIER@sakilacustomer.org',
'NORMAN.CURRIER@sakilacustomer.org',
'NORMAN.CURRIER@sakilacustomer.org',
'CORY.MEEHAN@sakilacustomer.org',
'CORY.MEEHAN@sakilacustomer.org',
'CORY.MEEHAN@sakilacustomer.org',
'CORY.MEEHAN@sakilacustomer.org',
'CORY.MEEHAN@sakilacustomer.org',
'CORY.MEEHAN@sakilacustomer.org'],
'address_id': [416, 416, 416, 416, 416, 416, 533, 533, 533, 533, 533, 533],
'name': ['NORMAN CURRIER', 'NORMAN CURRIER', 'NORMAN CURRIER',
'NORMAN CURRIER', 'NORMAN CURRIER', 'NORMAN CURRIER',
'CORY MEEHAN', 'CORY MEEHAN', 'CORY MEEHAN', 'CORY MEEHAN',
'CORY MEEHAN', 'CORY MEEHAN'],
'category': ['Scifi', 'Action', 'Sports', 'Scifi', 'Family', 'Action',
'Documentary', 'Action', 'Sports', 'Scifi', 'Documentary',
'Sports']
})
df
: df
:
customer_id email address_id name category
0 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Scifi
1 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Action
2 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Sports
3 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Scifi
4 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Family
5 411 NORMAN.CURRIER@sakilacustomer.org 416 NORMAN CURRIER Action
6 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Documentary
7 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Action
8 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Sports
9 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Scifi
10 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Documentary
11 527 CORY.MEEHAN@sakilacustomer.org 533 CORY MEEHAN Sports
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.