[英]pandas Add new "rank" columns for every column
I have a df like so (actual df has 4.5 mil rows, 23 cols):我有一个像这样的 df(实际 df 有 450 万行,23 列):
group feature col1 col2 col3
g1 f1 1 10 100
g1 f1 11 9 1000
g1 f2 0 8 200
g2 f1 2 7 330
g2 f2 3 7 331
g2 f3 1 7 100
g3 f1 1 6 101
g3 f1 5 9 100
g3 f1 1 8 100
I want to add two new "rank" cols for each col in my df.我想为我的 df 中的每个列添加两个新的“等级”列。 I will evaluate different cols differently, such as sum, mean, max, etc. For ease of explanation I've broken the problem out into two separate problems below.
我将以不同的方式评估不同的列,例如 sum、mean、max 等。为了便于解释,我将问题分解为下面两个单独的问题。
I have been advised here to use .loc
and not use groupby
, but any solution that works is fine.我在这里被建议使用
.loc
而不是使用groupby
,但任何有效的解决方案都很好。 I've tried both and had little success (see here )我都尝试过,但收效甚微(请参阅此处)
The first rank col will rank each feature on the values in col1, col2, and col3 within each group .第一个排名 col 将根据每个 group中 col1、col2 和 col3 中的值对每个特征进行排名。
At an intermediate stage it would look something like this:在中间阶段,它看起来像这样:
group feature col1 col1_sum col1_rank col2 col2_avg col2_rank col3 col3_max col3_rank
g1 f1 1 12 1 10 9.5 1 100 1000 1
g1 f1 11 9 1000
g1 f2 0 0 2 8 8 2 200 200 2
g2 f1 2 2 2 7 7 1 330 330 2
g2 f2 3 3 1 7 7 1 331 331 1
g2 f3 1 1 3 7 7 1 100 100 3
g3 f1 1 7 1 6 7.67 1 101 101 1
g3 f1 5 9 100
g3 f1 1 8 100
It will output this:它会输出这个:
group feature col1_rank col2_rank col3_rank
g1 f1 1 1 1
g1 f2 2 2 2
g2 f1 2 1 2
g2 f2 1 1 1
g2 f3 3 1 3
g3 f1 1 1 1
The second rank col will rank each group by feature on the values in col1, col2, and col3 against all other groups .第二个排名 col 将根据 col1、col2 和 col3中的值对所有其他组按特征对每个组进行排名。
At an intermediate stage it would look something like this:在中间阶段,它看起来像这样:
group feature col1 col1_sum col1_rank col2 col2_avg col2_rank col3 col3_max col3_rank
g1 f1 1 12 1 10 9.5 1 100 1000 1
g1 f1 11 9 1000
g2 f1 2 2 3 7 7 3 330 330 2
g3 f1 1 7 2 6 7.67 2 101 101 3
g3 f1 5 9 100
g3 f1 1 8 100
g1 f2 0 0 2 8 8 1 200 200 2
g2 f2 3 3 1 7 7 2 331 331 1
g2 f3 1 1 1 7 7 1 100 100 1
It will output this:它会输出这个:
group feature col1_rank col2_rank col3_rank
g1 f1 1 1 1
g2 f1 3 3 2
g3 f1 2 2 3
g1 f2 2 1 2
g2 f2 1 2 1
g2 f3 1 1 1
I would use groupby
on ['group', 'feature']
to produce an intermediary dataframe containing the sum, avg and max columns (not the ranks), and then again groupby
on group
only to produce the ranks.我会在
['group', 'feature']
上使用groupby
来生成一个包含 sum、avg 和 max 列(不是排名)的中间数据框,然后再次在group
使用groupby
来生成排名。
Intermediary dataframe:中间数据框:
df2 = pd.concat([
df.iloc[:,[0,1,2]].groupby(['group', 'feature']).sum(),
df.iloc[:,[0,1,3]].groupby(['group', 'feature']).mean(),
df.iloc[:,[0,1,4]].groupby(['group', 'feature']).max()
], axis=1)
The intermediary dataframe is:中间数据框是:
col1 col2 col3
group feature
g1 f1 12 9.500000 1000
f2 0 8.000000 200
g2 f1 2 7.000000 330
f2 3 7.000000 331
f3 1 7.000000 100
g3 f1 7 7.666667 101
Now for the final dataframe:现在对于最终的数据帧:
df3 = df2.groupby('group').rank(method='min', ascending=False).reset_index()
which finally gives:最后给出:
group feature col1 col2 col3
0 g1 f1 1.0 1.0 1.0
1 g1 f2 2.0 2.0 2.0
2 g2 f1 2.0 1.0 2.0
3 g2 f2 1.0 1.0 1.0
4 g2 f3 3.0 1.0 3.0
5 g3 f1 1.0 1.0 1.0
For the second part of the question, I would just change the indexing of the intermediary dataframe, and compute ranks after grouping on 'feature'
:对于问题的第二部分,我只会更改中间数据帧的索引,并在对
'feature'
分组后计算排名:
dfx4 = dfx.reset_index().set_index(['feature', 'group']
).sort_index().groupby('feature').rank(
method='min', ascending=False
).reset_index()
which gives:这使:
feature group col1 col2 col3
0 f1 g1 1.0 1.0 1.0
1 f1 g2 3.0 3.0 2.0
2 f1 g3 2.0 2.0 3.0
3 f2 g1 2.0 1.0 2.0
4 f2 g2 1.0 2.0 1.0
5 f3 g2 1.0 1.0 1.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.