简体   繁体   English

Python Pandas:保持列值最高的行

[英]Python pandas: keep row with highest column value

Say I have a dataframe of test scores of students, where each student studies different subjects. 假设我有一个学生考试成绩的数据框,其中每个学生都学习不同的科目。 Each student can take the test for each subject multiple times, and only the highest score (out of 100) will be retained. 每个学生可以多次参加每个科目的考试,并且只会保留最高分(满分100分)。 For instance, say I have a dataframe of all test records: 例如,假设我有所有测试记录的数据框:

| student_name | subject | test_number | score | 
|--------------|---------|-------------|-------|
| sarah        | maths   | test1       | 78    |
| sarah        | maths   | test2       | 71    |
| sarah        | maths   | test3       | 83    |
| sarah        | physics | test1       | 91    |
| sarah        | physics | test2       | 97    |
| sarah        | history | test1       | 83    |
| sarah        | history | test2       | 87    |
| joan         | maths   | test1       | 83    |
| joan         | maths   | test2       | 88    |

(1) How do I keep only the test records (rows) with the maximum score? (1)如何仅保留得分最高的测试记录(行)? That is, 那是,

| student_name | subject | test_number | score | 
|--------------|---------|-------------|-------|
| sarah        | maths   | test1       | 78    |
| sarah        | maths   | test2       | 71    |
| sarah        | maths   | test3       | 83    |
| sarah        | physics | test1       | 91    |

(2) How would I keep the average of all tests taken for the same subject, for the same student? (2)如何保持同一科目的同一学生参加的所有考试的平均值 That is: 那是:

| student_name | subject | test_number | ave_score | 
|--------------|---------|-------------|-----------|
| sarah        | maths   | na          | 77.333    |
| sarah        | maths   | na          | 94        |
| sarah        | maths   | na          | 85        |
| sarah        | physics | na          | 85.5      |

I've tried various combinations of df.sort_values() and df.drop_duplicates(subset=..., keep=...) , to no avail. 我尝试了df.sort_values()df.drop_duplicates(subset=..., keep=...)各种组合,但都无济于事。

Actual Data 实际数据

| query | target   | pct-similarity | p-val | aln_length | bit-score |
|-------|----------|----------------|-------|------------|-----------|
| EV239 | B/Fw6/623 | 99.23         | 0.966 |  832       | 356       |
| EV239 | B/Fw6/623 | 97.34         | 0.982 |  1022      | 739       |
| EV239 | MMS-alpha | 92.23         | 0.997 |  838       | 384       |
| EV239 | MMS-alpha | 93.49         | 0.993 |  1402      | 829       |
| EV380 | B/Fw6/623 | 94.32         | 0.951 |  324       | 423       |
| EV380 | B/Fw6/623 | 95.27         | 0.932 |  1245      | 938       |
| EV380 | MMS-alpha | 99.23         | 0.927 |  723       | 522       |
| EV380 | MMS-alpha | 99.15         | 0.903 |  948       | 1092      |

After aggregation function is applied, only the column pct-similarity will be of interest. 应用聚合函数后,将仅关注列pct-similarity

(1) Drop duplicate query+target rows, by choosing the maximum aln_length . (1)通过选择最大aln_length删除重复的查询+目标行。 Retain the pct-similarity value that belongs to the row with maximum aln_length . 保留属于具有最大aln_length的行的pct-similarity值。

(2) Aggregate duplicate query+target rows by choosing the row with maximum aln_length , and computing the average pct-similarity for that set of duplicate rows. (2)通过选择具有最大aln_length的行并计算该组重复行的平均pct-similarity ,汇总重复的查询+目标行。 The other numerical columns aren't necessary and will be dropped eventually, so I really don't care what aggregation function (max or mean) is applied to them. 其他数字列不是必需的,最终将被删除,因此我真的不在乎将什么聚合函数(最大值或平均值)应用于它们。

Just use max() to each group of student/subject: 只需对每个学生/主题组使用max()

df.groupby(["student_name","subject"], as_index=False).max()


    student_name    subject         test_number     score
0   joan            maths           test2           88
1   sarah           history         test2           87
2   sarah           maths           test3           83
3   sarah           physics         test2           97

For the average, this use mean() instead: 对于平均值,请改用mean()

df.groupby(["student_name","subject"], as_index=False).mean()

    student_name    subject     score
0   joan            maths       85.500000
1   sarah           history     85.000000
2   sarah           maths       77.333333
3   sarah           physics     94.000000

Most likely describe can 最有可能describe

df.groupby(["student_name","subject"]).score.describe()
Out[15]: 
                          count       mean       std   min    25%   50%  \
student_name   subject                                                    
 joan           maths       2.0  85.500000  3.535534  83.0  84.25  85.5   
 sarah          history     2.0  85.000000  2.828427  83.0  84.00  85.0   
                maths       3.0  77.333333  6.027714  71.0  74.50  78.0   
                physics     2.0  94.000000  4.242641  91.0  92.50  94.0   
                            75%   max  
student_name   subject                 
 joan           maths     86.75  88.0  
 sarah          history   86.00  87.0  
                maths     80.50  83.0  
                physics   95.50  97.0  

And with drop_duplicates 并与drop_duplicates

df.sort_values('score').drop_duplicates(["student_name","subject"],keep='last')
Out[22]: 
     student_name    subject    test_number  score
2   sarah           maths      test3            83
6   sarah           history    test2            87
8   joan            maths      test2            88
4   sarah           physics    test2            97

For mean value with reindex 对于带有reindex mean

df.groupby(["student_name","subject"], as_index=False).mean().reindex(columns=df.columns)
Out[24]: 
     student_name    subject  test_number      score
0   joan            maths             NaN  85.500000
1   sarah           history           NaN  85.000000
2   sarah           maths             NaN  77.333333
3   sarah           physics           NaN  94.000000

We can use agg on a groupby to get 'idxmax' and 'mean' . 我们可以在groupby上使用agg来获取'idxmax''mean'
With that we can perform an inner join to get both the correct rows and means. 这样,我们可以执行内部联接以获取正确的行和均值。

df.join(
    df.groupby(['student_name', 'subject'])
      .score.agg(['idxmax', 'mean']).set_index('idxmax'),
    how='inner'
)

  student_name  subject test_number  score       mean
2        sarah    maths       test3     83  77.333333
4        sarah  physics       test2     97  94.000000
6        sarah  history       test2     87  85.000000
8         joan    maths       test2     88  85.500000

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM