简体   繁体   English

在 Pandas 中创建一个基于列的连接名称和排名

[英]Create a column based concatenating name and ranking in pandas

I have this dataset, which has names and counts:我有这个数据集,它有名称和计数:

df = pd.DataFrame({'Id':[1,2,3,4,5,6], 'Name':['Eve','Diana','Diana','Mia','Eve','Eve'], "Count":[10,3,14,8,5,2]})
df

    Id  Name    Count
0   1   Eve     10
1   2   Diana   3
2   3   Diana   14
3   4   Mia     8
4   5   Eve     5
5   6   Eve     2

And I want to create a new column which is the concatenation of the name plus the ranking.我想创建一个新列,它是名称和排名的串联。 So first I have to select those non-unique values and order them:所以首先我必须选择那些非唯一值并对它们进行排序:

df_nounique = df[df.duplicated(subset=['Name'], keep=False)]
df_nounique = df_nounique.sort_values(by=['Name','Count'], ascending=False)
df_nounique
    Id  Name    Count
0   1   Eve    10
4   5   Eve    5
5   6   Eve    2
2   3   Diana  14
1   2   Diana  3

Ok, now I have to assign the ranking based on the name and count:好的,现在我必须根据名称和数量分配排名:

df_nounique['rank'] = df_nounique.groupby('Name')['Count'].rank()
df_nounique
    Id  Name    Count   rank
0   1   Eve     10      3.0
4   5   Eve     5       2.0
5   6   Eve     2       1.0
2   3   Diana   14      2.0
1   2   Diana   3       1.0

But this is where I am stuck.但这就是我被困的地方。 For the first row the rank should be 1, but I get 3!.对于第一行,排名应该是 1,但我得到 3!。 If I get this right, I can merge an concatenate to obtain this:如果我做对了,我可以合并一个连接来获得这个:

    Id  Name    Count   New_col
0   1   Eve     10      Eve_1
1   2   Diana   3       Diana_2
2   3   Diana   14      Diana_1
3   4   Mia     8       Mia
4   5   Eve     5       Eve_2
5   6   Eve     2       Eve_3

It seems that I am taking too much steps so please, could you help me at least with my rank problem, and a suggestion to a better approach for my ultimate goal?看来我采取的步骤太多了,所以请您至少帮助我解决我的排名问题,并为我的最终目标提供更好的方法建议?

Use ascending=False as argument of rank() :使用ascending=False作为rank()参数:

df_nounique['rank'] = df_nounique.groupby('Name')['Count'] \
                                 .rank(ascending=False).astype(int)
>>> df_nounique
   Id   Name  Count  rank
0   1    Eve     10     1
4   5    Eve      5     2
5   6    Eve      2     3
2   3  Diana     14     1
1   2  Diana      3     2

Then:然后:

df['New_col'] = (df_nounique['Name'] + '_' + df_nounique['rank'].astype(str)) \
                    .combine_first(df['Name'])
>>> df
   Id   Name  Count  New_col
0   1    Eve     10    Eve_1
1   2  Diana      3  Diana_2
2   3  Diana     14  Diana_1
3   4    Mia      8      Mia
4   5    Eve      5    Eve_2
5   6    Eve      2    Eve_3

We can also create the series directly from df without needing df_nounique by:我们还可以通过以下方式直接从df创建系列,而无需df_nounique

  1. Generating the Series from groupby rank (with ascending=False and method='dense' to ensure whole number steps)groupby rank生成系列(使用ascending=Falsemethod='dense'以确保整数步数)
  2. Using fillna to fill missing values Name使用fillna填充缺失值Name
  3. join back to the DataFrame. join回 DataFrame。 ( Series.rename is needed to assign the new column name as join only works with named Series): (需要Series.rename来分配新列名,因为join仅适用于命名系列):
df = df.join(
    (df['Name'] + '_' + df[df.duplicated(subset=['Name'], keep=False)]
     .groupby('Name')['Count']
     .rank(ascending=False, method='dense')
     .map('{:.0f}'.format)).fillna(df['Name']).rename('New_col')
)

df : df

   Id   Name  Count  New_col
0   1    Eve     10    Eve_1
1   2  Diana      3  Diana_2
2   3  Diana     14  Diana_1
3   4    Mia      8      Mia
4   5    Eve      5    Eve_2
5   6    Eve      2    Eve_3

although answer is already chosen, this code is , i think, not bad... take a look虽然已经选择了答案,但我认为这段代码还不错......看看

# module

import pandas as pd
import numpy as np

# make a dataset

df = pd.DataFrame({'Id':[1,2,3,4,5,6], 'Name':['Eve','Diana','Diana','Mia','Eve','Eve'], "Count":[10,3,14,8,5,2]})
print(df)


# rank and make new column

df['rank']=df.groupby('Name')['Count'].rank(ascending=False).astype('str') #rank
df.loc[~(df.duplicated(subset=['Name'], keep=False)),'rank']=np.nan # replace rank null if value of name column is unique
df.loc[~(df['rank'].isna()),'New_col'], df.loc[(df['rank'].isna()),'New_col']  = (df['Name'] + '_' + df['rank']),(df['Name'])
print(df)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM