[英]Create a column based concatenating name and ranking in pandas
I have this dataset, which has names and counts:我有这个数据集,它有名称和计数:
df = pd.DataFrame({'Id':[1,2,3,4,5,6], 'Name':['Eve','Diana','Diana','Mia','Eve','Eve'], "Count":[10,3,14,8,5,2]})
df
Id Name Count
0 1 Eve 10
1 2 Diana 3
2 3 Diana 14
3 4 Mia 8
4 5 Eve 5
5 6 Eve 2
And I want to create a new column which is the concatenation of the name plus the ranking.我想创建一个新列,它是名称和排名的串联。 So first I have to select those non-unique values and order them:所以首先我必须选择那些非唯一值并对它们进行排序:
df_nounique = df[df.duplicated(subset=['Name'], keep=False)]
df_nounique = df_nounique.sort_values(by=['Name','Count'], ascending=False)
df_nounique
Id Name Count
0 1 Eve 10
4 5 Eve 5
5 6 Eve 2
2 3 Diana 14
1 2 Diana 3
Ok, now I have to assign the ranking based on the name and count:好的,现在我必须根据名称和数量分配排名:
df_nounique['rank'] = df_nounique.groupby('Name')['Count'].rank()
df_nounique
Id Name Count rank
0 1 Eve 10 3.0
4 5 Eve 5 2.0
5 6 Eve 2 1.0
2 3 Diana 14 2.0
1 2 Diana 3 1.0
But this is where I am stuck.但这就是我被困的地方。 For the first row the rank should be 1, but I get 3!.对于第一行,排名应该是 1,但我得到 3!。 If I get this right, I can merge an concatenate to obtain this:如果我做对了,我可以合并一个连接来获得这个:
Id Name Count New_col
0 1 Eve 10 Eve_1
1 2 Diana 3 Diana_2
2 3 Diana 14 Diana_1
3 4 Mia 8 Mia
4 5 Eve 5 Eve_2
5 6 Eve 2 Eve_3
It seems that I am taking too much steps so please, could you help me at least with my rank problem, and a suggestion to a better approach for my ultimate goal?看来我采取的步骤太多了,所以请您至少帮助我解决我的排名问题,并为我的最终目标提供更好的方法建议?
Use ascending=False
as argument of rank()
:使用ascending=False
作为rank()
参数:
df_nounique['rank'] = df_nounique.groupby('Name')['Count'] \
.rank(ascending=False).astype(int)
>>> df_nounique
Id Name Count rank
0 1 Eve 10 1
4 5 Eve 5 2
5 6 Eve 2 3
2 3 Diana 14 1
1 2 Diana 3 2
Then:然后:
df['New_col'] = (df_nounique['Name'] + '_' + df_nounique['rank'].astype(str)) \
.combine_first(df['Name'])
>>> df
Id Name Count New_col
0 1 Eve 10 Eve_1
1 2 Diana 3 Diana_2
2 3 Diana 14 Diana_1
3 4 Mia 8 Mia
4 5 Eve 5 Eve_2
5 6 Eve 2 Eve_3
We can also create the series directly from df
without needing df_nounique
by:我们还可以通过以下方式直接从df
创建系列,而无需df_nounique
:
groupby rank
(with ascending=False
and method='dense'
to ensure whole number steps)从groupby rank
生成系列(使用ascending=False
和method='dense'
以确保整数步数)fillna
to fill missing values Name
使用fillna
填充缺失值Name
join
back to the DataFrame. join
回 DataFrame。 ( Series.rename
is needed to assign the new column name as join
only works with named Series): (需要Series.rename
来分配新列名,因为join
仅适用于命名系列):df = df.join(
(df['Name'] + '_' + df[df.duplicated(subset=['Name'], keep=False)]
.groupby('Name')['Count']
.rank(ascending=False, method='dense')
.map('{:.0f}'.format)).fillna(df['Name']).rename('New_col')
)
df
: df
:
Id Name Count New_col
0 1 Eve 10 Eve_1
1 2 Diana 3 Diana_2
2 3 Diana 14 Diana_1
3 4 Mia 8 Mia
4 5 Eve 5 Eve_2
5 6 Eve 2 Eve_3
although answer is already chosen, this code is , i think, not bad... take a look虽然已经选择了答案,但我认为这段代码还不错......看看
# module
import pandas as pd
import numpy as np
# make a dataset
df = pd.DataFrame({'Id':[1,2,3,4,5,6], 'Name':['Eve','Diana','Diana','Mia','Eve','Eve'], "Count":[10,3,14,8,5,2]})
print(df)
# rank and make new column
df['rank']=df.groupby('Name')['Count'].rank(ascending=False).astype('str') #rank
df.loc[~(df.duplicated(subset=['Name'], keep=False)),'rank']=np.nan # replace rank null if value of name column is unique
df.loc[~(df['rank'].isna()),'New_col'], df.loc[(df['rank'].isna()),'New_col'] = (df['Name'] + '_' + df['rank']),(df['Name'])
print(df)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.