简体   繁体   English

Python:如何在 Python 中为组内的分类变量分配等级

[英]Python : How to assign ranks to categorical variables within a group in Python

Given I have a dataset containing only the first two columns, how do I create another column using Python which will contain the rank based on these ranges for each group separately.鉴于我有一个仅包含前两列的数据集,我如何使用 Python 创建另一列,该列将分别包含基于每个组的这些范围的排名。 My desired output would look like this -我想要的输出看起来像这样 -

id ID range范围 rank
1 1 10-20 10-20 2 2
1 1 20-30 20-30 3 3
1 1 5-10 5-10 1 1
2 2 20-30 20-30 2 2
2 2 10-20 10-20 1 1
2 2
3 3 10-20 10-20 2 2
3 3 5-10 5-10 1 1
3 3 20-30 20-30 3 3
3 3 30+ 30+ 4 4

NOTE - These are the only 4 ranges [5-10, 10-20, 20-30, 30+] that can belong to any id at max.注意 - 这些是仅有的 4 个范围 [5-10, 10-20, 20-30, 30+] 最多可以属于任何 id。 There can be blanks as well For example as given in the reproducible example, if for id 2 there are two ranges 10-20 and 20-30 the corresponding to 10-20 the rank will be 1 and corresponding to 20-30 the rank will be 2. I have checked that df.groupby can be used but I am not being able to figure out how in this case.也可以有空格例如在可重复的示例中给出,如果 id 2 有两个范围 10-20 和 20-30,则对应于 10-20 的等级将为 1,对应于 20-30 的等级将为是 2. 我已经检查过 df.groupby 可以使用,但我无法弄清楚在这种情况下如何使用。

Convert your range column to a category dtype before apply rank :在应用rank之前将范围列转换为类别数据类型:

df['range'] = df['range'].astype(pd.CategoricalDtype(
                  ['5-10', '10-20', '20-30', '30+'], ordered=True))

df['rank'] = df.groupby('id')['range'].apply(lambda x: x.rank())

>>> df
   id  range  rank
0   1  10-20   2.0
1   1  20-30   3.0
2   1   5-10   1.0
3   2  20-30   2.0
4   2  10-20   1.0
5   2    NaN   NaN
6   3  10-20   2.0
7   3   5-10   1.0
8   3  20-30   3.0
9   3    30+   4.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM