简体   繁体   English

提取组中具有最大值的行 pandas dataframe

[英]Extract row with maximum value in a group pandas dataframe

A similar question is asked here: Python: Getting the Row which has the max value in groups using groupby这里提出了一个类似的问题: Python: Getting the Row which has the max value in groups using groupby

However, I just need one record per group even if there are more than one record with maximum value in that group.但是,即使该组中有不止一条具有最大值的记录,我也只需要每组一条记录。

In the example below, I need one record for "s2".在下面的示例中,我需要“s2”的一条记录。 For me it doesn't matter which one.对我来说,哪一个都无所谓。

>>> df = DataFrame({'Sp':['a','b','c','d','e','f'], 'Mt':['s1', 's1', 's2','s2','s2','s3'], 'Value':[1,2,3,4,5,6], 'count':[3,2,5,10,10,6]})
>>> df
   Mt Sp  Value  count
0  s1  a      1      3
1  s1  b      2      2
2  s2  c      3      5
3  s2  d      4     10
4  s2  e      5     10
5  s3  f      6      6
>>> idx = df.groupby(['Mt'])['count'].transform(max) == df['count']
>>> df[idx]
   Mt Sp  Value  count
0  s1  a      1      3
3  s2  d      4     10
4  s2  e      5     10
5  s3  f      6      6
>>> 

You can use first你可以first

In [14]: df.groupby('Mt').first()
Out[14]: 
   Sp  Value  count
Mt                 
s1  a      1      3
s2  c      3      5
s3  f      6      6

Update更新

Set as_index=False to achieve your goal设置as_index=False以实现您的目标

In [28]: df.groupby('Mt', as_index=False).first()
Out[28]: 
   Mt Sp  Value  count
0  s1  a      1      3
1  s2  c      3      5
2  s3  f      6      6 

Update Again再次更新

Sorry for misunderstanding what you mean.抱歉误解了你的意思。 You can sort it first if you want the one with max count in a group如果您想要组中最大数量的那个,您可以先对其进行排序

In [196]: df.sort('count', ascending=False).groupby('Mt', as_index=False).first()
Out[196]: 
   Mt Sp  Value  count
0  s1  a      1      3
1  s2  e      5     10
2  s3  f      6      6

To get first occurence of maximum count you can use pandas.DataFrame.idxmax() function:要获得最大count第一次出现,您可以使用pandas.DataFrame.idxmax()函数:

>>> df.iloc[df.groupby(['Mt']).apply(lambda x: x['count'].idxmax())]
   Mt Sp  Value  count
0  s1  a      1      3
3  s2  d      4     10
5  s3  f      6      6

Playing off of Roman Pekar's answer, I found that that the following code would work:根据 Roman Pekar 的回答,我发现以下代码可以工作:

from math import isnan
df.iloc[[int(x) for x in df.groupby(by=df.Mt).apply(lambda x: x['count'].idxmax()).values if not isnan(y)]]

Note the isnan condition, as my application had some nan entries in the column we are maximizing over.请注意 isnan 条件,因为我的应用程序在我们最大化的列中有一些 nan 条目。

The answers already given don't show clearly what's by far the fastest option.已经给出的答案并没有清楚地表明到目前为止最快的选择是什么。
Sort by the row where you want the max value, and then drop duplicates (takes as parameter the name of the rows to take into account for evaluating duplicates)按您想要最大值的行排序,然后删除重复项(将行的名称作为参数,以考虑评估重复项)

df.sort_values('count', ascending=False).drop_duplicates(['Mt'])

NB: Yes that answer is already given in a comment on the question but it's very easy to miss it.注意:是的,答案已经在对该问题的评论中给出,但很容易错过。 And it will be up to 10 times faster as groupby.它会比 groupby 快 10 倍。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM