简体   繁体   English

根据另一列 pandas 查找前 5 个值

[英]Finding top 5 values based on another column pandas

How to find top 5 values of category column based while grouping customer_id column in pandas dataframe?如何在对 pandas dataframe 中的customer_id列进行分组时找到基于category列的前 5 个值?

       customer_id     email                        address_id  name              category

0         411   NORMAN.CURRIER@sakilacustomer.org       416     NORMAN CURRIER      Scifi
1         411   NORMAN.CURRIER@sakilacustomer.org       416     NORMAN CURRIER      Action
2         411   NORMAN.CURRIER@sakilacustomer.org       416     NORMAN CURRIER      Sports
3         411   NORMAN.CURRIER@sakilacustomer.org       416     NORMAN CURRIER      Scifi
4         411   NORMAN.CURRIER@sakilacustomer.org       416     NORMAN CURRIER      Family
5         411   NORMAN.CURRIER@sakilacustomer.org       416     NORMAN CURRIER      Action
6         527   CORY.MEEHAN@sakilacustomer.org          533     CORY MEEHAN         Documentary
7         527   CORY.MEEHAN@sakilacustomer.org          533     CORY MEEHAN         Action
8         527   CORY.MEEHAN@sakilacustomer.org          533     CORY MEEHAN         Sports
9         527   CORY.MEEHAN@sakilacustomer.org          533     CORY MEEHAN         Scifi
10        527   CORY.MEEHAN@sakilacustomer.org          533     CORY MEEHAN         Documentary
11        527   CORY.MEEHAN@sakilacustomer.org          533     CORY MEEHAN         Sports

I want another column named preferred_film_category for each unique customer_id ( the top 5 values are based on how many times particular category occurs for each unique customer_id )我想要为每个唯一的customer_id命名另一个名为preferred_film_category的列(前 5 个值基于每个唯一的 customer_id 出现特定类别的次数

Expected Dataframe:预期 Dataframe:

       customer_id     email                     address_id    name       category      preferred_film_category   

0       411    NORMAN.CURRIER@sakilacustomer.org   416   NORMAN CURRIER    Scifi        Scifi, Action, Sports, Animation, Drama    
1       411    NORMAN.CURRIER@sakilacustomer.org   416   NORMAN CURRIER    Action       Scifi, Action, Sports, Animation, Drama 
2       411    NORMAN.CURRIER@sakilacustomer.org   416   NORMAN CURRIER    Sports       Scifi, Action, Sports, Animation, Drama 
3       411    NORMAN.CURRIER@sakilacustomer.org   416   NORMAN CURRIER    Scifi        Scifi, Action, Sports, Animation, Drama 
4       411    NORMAN.CURRIER@sakilacustomer.org   416   NORMAN CURRIER    Family       Scifi, Action, Sports, Animation, Drama 
5       411    NORMAN.CURRIER@sakilacustomer.org   416   NORMAN CURRIER    Action       Scifi, Action, Sports, Animation, Drama
6       527     CORY.MEEHAN@sakilacustomer.org     533   CORY MEEHAN       Documentary  Documentary, Sports, Scifi, Action 
7       527     CORY.MEEHAN@sakilacustomer.org     533   CORY MEEHAN       Action       Documentary, Sports, Scifi, Action 
8       527     CORY.MEEHAN@sakilacustomer.org     533   CORY MEEHAN       Sports       Documentary, Sports, Scifi, Action 
9       527     CORY.MEEHAN@sakilacustomer.org     533   CORY MEEHAN       Scifi        Documentary, Sports, Scifi, Action 
10      527     CORY.MEEHAN@sakilacustomer.org     533   CORY MEEHAN       Documentary  Documentary, Sports, Scifi, Action 
11      527     CORY.MEEHAN@sakilacustomer.org     533   CORY MEEHAN       Sports       Documentary, Sports, Scifi, Action 

Try value_counts + groupby nlargest to get get the highest frequency categories, then groupby aggregate to convert to a string, then join to merge back with the original DataFrame:尝试value_counts + groupby nlargest得到频率最高的类别,然后groupby aggregate转换为字符串,然后join与原始 DataFrame 合并:

n = 2
df = df.join(
    df.value_counts(['customer_id', 'category'])
        .groupby(level=0).nlargest(n)
        .reset_index('category')
        .groupby(level=0)['category'].agg(', '.join)
        .rename('preferred_film_category'),
    on='customer_id'
)

df : df

    customer_id                              email  address_id            name     category preferred_film_category
0           411  NORMAN.CURRIER@sakilacustomer.org         416  NORMAN CURRIER        Scifi           Action, Scifi
1           411  NORMAN.CURRIER@sakilacustomer.org         416  NORMAN CURRIER       Action           Action, Scifi
2           411  NORMAN.CURRIER@sakilacustomer.org         416  NORMAN CURRIER       Sports           Action, Scifi
3           411  NORMAN.CURRIER@sakilacustomer.org         416  NORMAN CURRIER        Scifi           Action, Scifi
4           411  NORMAN.CURRIER@sakilacustomer.org         416  NORMAN CURRIER       Family           Action, Scifi
5           411  NORMAN.CURRIER@sakilacustomer.org         416  NORMAN CURRIER       Action           Action, Scifi
6           527     CORY.MEEHAN@sakilacustomer.org         533     CORY MEEHAN  Documentary     Documentary, Sports
7           527     CORY.MEEHAN@sakilacustomer.org         533     CORY MEEHAN       Action     Documentary, Sports
8           527     CORY.MEEHAN@sakilacustomer.org         533     CORY MEEHAN       Sports     Documentary, Sports
9           527     CORY.MEEHAN@sakilacustomer.org         533     CORY MEEHAN        Scifi     Documentary, Sports
10          527     CORY.MEEHAN@sakilacustomer.org         533     CORY MEEHAN  Documentary     Documentary, Sports
11          527     CORY.MEEHAN@sakilacustomer.org         533     CORY MEEHAN       Sports     Documentary, Sports

*note n is set to 2 as each customer only has 4 unique values in categrory and so 5 does not demonstrate the functioning of the code. *注意n设置为2 ,因为每个客户在categrory中只有 4 个唯一值,因此5不演示代码的功能。 Change this to the desired value to keep ( 5 ).将此更改为所需的值以保留 ( 5 )。


DataFrame Used: DataFrame 使用:

df = pd.DataFrame({
    'customer_id': [411, 411, 411, 411, 411, 411, 527, 527, 527, 527, 527, 527],
    'email': ['NORMAN.CURRIER@sakilacustomer.org',
              'NORMAN.CURRIER@sakilacustomer.org',
              'NORMAN.CURRIER@sakilacustomer.org',
              'NORMAN.CURRIER@sakilacustomer.org',
              'NORMAN.CURRIER@sakilacustomer.org',
              'NORMAN.CURRIER@sakilacustomer.org',
              'CORY.MEEHAN@sakilacustomer.org',
              'CORY.MEEHAN@sakilacustomer.org',
              'CORY.MEEHAN@sakilacustomer.org',
              'CORY.MEEHAN@sakilacustomer.org',
              'CORY.MEEHAN@sakilacustomer.org',
              'CORY.MEEHAN@sakilacustomer.org'],
    'address_id': [416, 416, 416, 416, 416, 416, 533, 533, 533, 533, 533, 533],
    'name': ['NORMAN CURRIER', 'NORMAN CURRIER', 'NORMAN CURRIER',
             'NORMAN CURRIER', 'NORMAN CURRIER', 'NORMAN CURRIER',
             'CORY MEEHAN', 'CORY MEEHAN', 'CORY MEEHAN', 'CORY MEEHAN',
             'CORY MEEHAN', 'CORY MEEHAN'],
    'category': ['Scifi', 'Action', 'Sports', 'Scifi', 'Family', 'Action',
                 'Documentary', 'Action', 'Sports', 'Scifi', 'Documentary',
                 'Sports']
})

df : df

    customer_id                              email  address_id            name     category
0           411  NORMAN.CURRIER@sakilacustomer.org         416  NORMAN CURRIER        Scifi
1           411  NORMAN.CURRIER@sakilacustomer.org         416  NORMAN CURRIER       Action
2           411  NORMAN.CURRIER@sakilacustomer.org         416  NORMAN CURRIER       Sports
3           411  NORMAN.CURRIER@sakilacustomer.org         416  NORMAN CURRIER        Scifi
4           411  NORMAN.CURRIER@sakilacustomer.org         416  NORMAN CURRIER       Family
5           411  NORMAN.CURRIER@sakilacustomer.org         416  NORMAN CURRIER       Action
6           527     CORY.MEEHAN@sakilacustomer.org         533     CORY MEEHAN  Documentary
7           527     CORY.MEEHAN@sakilacustomer.org         533     CORY MEEHAN       Action
8           527     CORY.MEEHAN@sakilacustomer.org         533     CORY MEEHAN       Sports
9           527     CORY.MEEHAN@sakilacustomer.org         533     CORY MEEHAN        Scifi
10          527     CORY.MEEHAN@sakilacustomer.org         533     CORY MEEHAN  Documentary
11          527     CORY.MEEHAN@sakilacustomer.org         533     CORY MEEHAN       Sports

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Pandas groupby:根据另一列中的值更改一列中的值 - Pandas groupby: change values in one column based on values in another column 根据 pandas DataFrame 的列中的值序列查找行的索引 - Finding the index of rows based on a sequence of values in a column of pandas DataFrame 列中前十个值的列表,没有基于另一列的重复项 - List of top ten values in column no duplicates based on another column pandas根据另一列中的值创建一个列,该列选择作为条件 - pandas create a column based on values in another column which selected as conditions 根据另一列的值在列中添加部分字符串(Python Pandas) - Add a part of string in a column based on the values of another column (Python Pandas) 根据另一列 Pandas 替换列中的空值 - Replace the empty values in column based on another column Pandas 如何使用基于pandas中另一列中的条件的值生成新列 - How to generate new column with values based on condition in another column in pandas 根据 Pandas 中另一列中的值范围聚合列的内容 - Aggregate contents of a column based on the range of values in another column in Pandas 根据 pandas 中字典中另一列的值添加新列 - Add new column based on values of another column from a dictionary in pandas 如何根据 Pandas 中的条件将一列的值复制到另一列? - How to copy values of one column to another based on condition in Pandas?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM