简体   繁体   English

pandas 为每列添加新的“排名”列

[英]pandas Add new "rank" columns for every column

I have a df like so (actual df has 4.5 mil rows, 23 cols):我有一个像这样的 df(实际 df 有 450 万行,23 列):

group  feature  col1  col2  col3
g1     f1       1     10    100
g1     f1       11    9     1000
g1     f2       0     8     200
g2     f1       2     7     330
g2     f2       3     7     331
g2     f3       1     7     100
g3     f1       1     6     101
g3     f1       5     9     100
g3     f1       1     8     100

I want to add two new "rank" cols for each col in my df.我想为我的 df 中的每个列添加两个新的“等级”列。 I will evaluate different cols differently, such as sum, mean, max, etc. For ease of explanation I've broken the problem out into two separate problems below.我将以不同的方式评估不同的列,例如 sum、mean、max 等。为了便于解释,我将问题分解为下面两个单独的问题。

I have been advised here to use .loc and not use groupby , but any solution that works is fine.我在这里被建议使用.loc而不是使用groupby ,但任何有效的解决方案都很好。 I've tried both and had little success (see here )我都尝试过,但收效甚微(请参阅此处

The first rank col will rank each feature on the values in col1, col2, and col3 within each group .第一个排名 col 将根据每个 group中 col1、col2 和 col3 中的值对每个特征进行排名。

At an intermediate stage it would look something like this:在中间阶段,它看起来像这样:

group  feature  col1  col1_sum  col1_rank  col2  col2_avg  col2_rank  col3 col3_max  col3_rank
g1     f1       1     12        1          10    9.5       1          100  1000      1
g1     f1       11                         9                          1000           
g1     f2       0     0         2          8     8         2          200  200       2
g2     f1       2     2         2          7     7         1          330  330       2
g2     f2       3     3         1          7     7         1          331  331       1
g2     f3       1     1         3          7     7         1          100  100       3
g3     f1       1     7         1          6     7.67      1          101  101       1
g3     f1       5                          9                          100            
g3     f1       1                          8                          100            

It will output this:它会输出这个:

group  feature  col1_rank  col2_rank  col3_rank
g1     f1       1          1          1
g1     f2       2          2          2
g2     f1       2          1          2
g2     f2       1          1          1
g2     f3       3          1          3
g3     f1       1          1          1

The second rank col will rank each group by feature on the values in col1, col2, and col3 against all other groups .第二个排名 col 将根据 col1、col2 和 col3的值对所有其他组按特征对每个组进行排名。

At an intermediate stage it would look something like this:在中间阶段,它看起来像这样:

group  feature  col1  col1_sum  col1_rank  col2  col2_avg  col2_rank  col3 col3_max  col3_rank
g1     f1       1     12        1          10    9.5       1          100  1000      1
g1     f1       11                         9                          1000           
g2     f1       2     2         3          7     7         3          330  330       2
g3     f1       1     7         2          6     7.67      2          101  101       3
g3     f1       5                          9                          100            
g3     f1       1                          8                          100            

g1     f2       0     0         2          8     8         1          200  200       2
g2     f2       3     3         1          7     7         2          331  331       1

g2     f3       1     1         1          7     7         1          100  100       1

It will output this:它会输出这个:

group  feature  col1_rank  col2_rank  col3_rank
g1     f1       1          1          1
g2     f1       3          3          2
g3     f1       2          2          3
g1     f2       2          1          2
g2     f2       1          2          1
g2     f3       1          1          1

I would use groupby on ['group', 'feature'] to produce an intermediary dataframe containing the sum, avg and max columns (not the ranks), and then again groupby on group only to produce the ranks.我会在['group', 'feature']上使用groupby来生成一个包含 sum、avg 和 max 列(不是排名)的中间数据框,然后再次在group使用groupby来生成排名。

Intermediary dataframe:中间数据框:

df2 = pd.concat([
    df.iloc[:,[0,1,2]].groupby(['group', 'feature']).sum(),
    df.iloc[:,[0,1,3]].groupby(['group', 'feature']).mean(),
    df.iloc[:,[0,1,4]].groupby(['group', 'feature']).max()
    ], axis=1)

The intermediary dataframe is:中间数据框是:

               col1      col2  col3
group feature                      
g1    f1         12  9.500000  1000
      f2          0  8.000000   200
g2    f1          2  7.000000   330
      f2          3  7.000000   331
      f3          1  7.000000   100
g3    f1          7  7.666667   101

Now for the final dataframe:现在对于最终的数据帧:

df3 = df2.groupby('group').rank(method='min', ascending=False).reset_index()

which finally gives:最后给出:

  group feature  col1  col2  col3
0    g1      f1   1.0   1.0   1.0
1    g1      f2   2.0   2.0   2.0
2    g2      f1   2.0   1.0   2.0
3    g2      f2   1.0   1.0   1.0
4    g2      f3   3.0   1.0   3.0
5    g3      f1   1.0   1.0   1.0

For the second part of the question, I would just change the indexing of the intermediary dataframe, and compute ranks after grouping on 'feature' :对于问题的第二部分,我只会更改中间数据帧的索引,并在对'feature'分组后计算排名:

dfx4 = dfx.reset_index().set_index(['feature', 'group']
                                   ).sort_index().groupby('feature').rank(
                                   method='min', ascending=False
                                   ).reset_index()

which gives:这使:

  feature group  col1  col2  col3
0      f1    g1   1.0   1.0   1.0
1      f1    g2   3.0   3.0   2.0
2      f1    g3   2.0   2.0   3.0
3      f2    g1   2.0   1.0   2.0
4      f2    g2   1.0   2.0   1.0
5      f3    g2   1.0   1.0   1.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM