[英]Extract row with maximum value in a group pandas dataframe
A similar question is asked here: Python: Getting the Row which has the max value in groups using groupby这里提出了一个类似的问题: Python: Getting the Row which has the max value in groups using groupby
However, I just need one record per group even if there are more than one record with maximum value in that group.但是,即使该组中有不止一条具有最大值的记录,我也只需要每组一条记录。
In the example below, I need one record for "s2".在下面的示例中,我需要“s2”的一条记录。 For me it doesn't matter which one.对我来说,哪一个都无所谓。
>>> df = DataFrame({'Sp':['a','b','c','d','e','f'], 'Mt':['s1', 's1', 's2','s2','s2','s3'], 'Value':[1,2,3,4,5,6], 'count':[3,2,5,10,10,6]})
>>> df
Mt Sp Value count
0 s1 a 1 3
1 s1 b 2 2
2 s2 c 3 5
3 s2 d 4 10
4 s2 e 5 10
5 s3 f 6 6
>>> idx = df.groupby(['Mt'])['count'].transform(max) == df['count']
>>> df[idx]
Mt Sp Value count
0 s1 a 1 3
3 s2 d 4 10
4 s2 e 5 10
5 s3 f 6 6
>>>
You can use first
你可以first
In [14]: df.groupby('Mt').first()
Out[14]:
Sp Value count
Mt
s1 a 1 3
s2 c 3 5
s3 f 6 6
Set as_index=False
to achieve your goal设置as_index=False
以实现您的目标
In [28]: df.groupby('Mt', as_index=False).first()
Out[28]:
Mt Sp Value count
0 s1 a 1 3
1 s2 c 3 5
2 s3 f 6 6
Sorry for misunderstanding what you mean.抱歉误解了你的意思。 You can sort it first if you want the one with max count in a group如果您想要组中最大数量的那个,您可以先对其进行排序
In [196]: df.sort('count', ascending=False).groupby('Mt', as_index=False).first()
Out[196]:
Mt Sp Value count
0 s1 a 1 3
1 s2 e 5 10
2 s3 f 6 6
To get first occurence of maximum count
you can use pandas.DataFrame.idxmax() function:要获得最大count
第一次出现,您可以使用pandas.DataFrame.idxmax()函数:
>>> df.iloc[df.groupby(['Mt']).apply(lambda x: x['count'].idxmax())]
Mt Sp Value count
0 s1 a 1 3
3 s2 d 4 10
5 s3 f 6 6
Playing off of Roman Pekar's answer, I found that that the following code would work:根据 Roman Pekar 的回答,我发现以下代码可以工作:
from math import isnan
df.iloc[[int(x) for x in df.groupby(by=df.Mt).apply(lambda x: x['count'].idxmax()).values if not isnan(y)]]
Note the isnan condition, as my application had some nan entries in the column we are maximizing over.请注意 isnan 条件,因为我的应用程序在我们最大化的列中有一些 nan 条目。
The answers already given don't show clearly what's by far the fastest option.已经给出的答案并没有清楚地表明到目前为止最快的选择是什么。
Sort by the row where you want the max value, and then drop duplicates (takes as parameter the name of the rows to take into account for evaluating duplicates)按您想要最大值的行排序,然后删除重复项(将行的名称作为参数,以考虑评估重复项)
df.sort_values('count', ascending=False).drop_duplicates(['Mt'])
NB: Yes that answer is already given in a comment on the question but it's very easy to miss it.注意:是的,答案已经在对该问题的评论中给出,但很容易错过。 And it will be up to 10 times faster as groupby.它会比 groupby 快 10 倍。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.