[英]How to get rows in pandas data frame, with maximal values in a column and keep the original index?
I have a pandas data frame. 我有一个熊猫数据框。 In the first column it can have the same value several times (in other words, the values in the first column are not unique). 在第一列中,它可以多次具有相同的值(换句话说,第一列中的值不是唯一的)。
Whenever I have several rows that contain the same value in the first column, I would like to leave only those that have maximal value in the third column. 每当我在第一列中有多个包含相同值的行时,我只想留下第三列中具有最大值的行。 I almost found a solution: 我几乎找到了解决方案:
import pandas
ls = []
ls.append({'c1':'a', 'c2':'a', 'c3':1})
ls.append({'c1':'a', 'c2':'c', 'c3':3})
ls.append({'c1':'a', 'c2':'b', 'c3':2})
ls.append({'c1':'b', 'c2':'b', 'c3':10})
ls.append({'c1':'b', 'c2':'c', 'c3':12})
ls.append({'c1':'b', 'c2':'a', 'c3':7})
df = pandas.DataFrame(ls, columns=['c1','c2','c3'])
print df
print '--------------------'
print df.groupby('c1').apply(lambda df:df.irow(df['c3'].argmax()))
As a result I get: 结果我得到:
c1 c2 c3
0 a a 1
1 a c 3
2 a b 2
3 b b 10
4 b c 12
5 b a 7
--------------------
c1 c2 c3
c1
a a c 3
b b c 12
My problem is that, I do not want to have c1
as index. 我的问题是,我不想让c1
作为索引。 What I want to have is following: 我想要的是:
c1 c2 c3
1 a c 3
4 b c 12
When calling df.groupby(...).apply(foo)
, the type of object returned by foo
affects the way the results are melded together. 当调用df.groupby(...).apply(foo)
, foo
返回的对象类型会影响结果融合在一起的方式。
If you return a Series, the index of the Series become columns of the final result, and the groupby key becomes the index (a bit of a mind-twister). 如果返回一个Series,则Series的索引将成为最终结果的列,groupby键将成为索引(有点令人费解)。
If instead you return a DataFrame, the final result uses the index of the DataFrame as index values, and the columns of the DataFrame as columns (very sensible). 如果您返回一个DataFrame,最终结果使用DataFrame的索引作为索引值,并将DataFrame的列作为列(非常明智)。
So, you can arrange for the type of output you desire by converting your Series into a DataFrame. 因此,您可以通过将Series转换为DataFrame来安排所需的输出类型。
With Pandas 0.13 you can use the to_frame().T
method: 使用Pandas 0.13,您可以使用to_frame().T
方法:
def maxrow(x, col):
return x.loc[x[col].argmax()].to_frame().T
result = df.groupby('c1').apply(maxrow, 'c3')
result = result.reset_index(level=0, drop=True)
print(result)
yields 产量
c1 c2 c3
1 a c 3
4 b c 12
In Pandas 0.12 or older, the equivalent would be: 在Pandas 0.12或更早版本中,相当于:
def maxrow(x, col):
ser = x.loc[x[col].idxmax()]
df = pd.DataFrame({ser.name: ser}).T
return df
By the way, behzad.nouri's clever and elegant solution is quicker than mine for small DataFrames. 顺便说一句, behzad.nouri的聪明而优雅的解决方案对于小型DataFrame来说比我的快。 The sort
lifts the time complexity from O(n)
to O(n log n)
however, so it becomes slower than the to_frame
solution shown above when applied to larger DataFrames. 然而,该sort
将时间复杂度从O(n)
提升到O(n log n)
,因此当应用于更大的DataFrame时,它变得比上面显示的to_frame
解决方案慢。
Here is how I benchmarked it: 以下是我对它进行基准测试的方法:
import pandas as pd
import numpy as np
import timeit
def reset_df_first(df):
df2 = df.reset_index()
result = df2.groupby('c1').apply(lambda x: x.loc[x['c3'].idxmax()])
result.set_index(['index'], inplace=True)
return result
def maxrow(x, col):
result = x.loc[x[col].argmax()].to_frame().T
return result
def using_to_frame(df):
result = df.groupby('c1').apply(maxrow, 'c3')
result.reset_index(level=0, drop=True, inplace=True)
return result
def using_sort(df):
return df.sort('c3').groupby('c1', as_index=False).tail(1)
for N in (100, 1000, 2000):
df = pd.DataFrame({'c1': {0: 'a', 1: 'a', 2: 'a', 3: 'b', 4: 'b', 5: 'b'},
'c2': {0: 'a', 1: 'c', 2: 'b', 3: 'b', 4: 'c', 5: 'a'},
'c3': {0: 1, 1: 3, 2: 2, 3: 10, 4: 12, 5: 7}})
df = pd.concat([df]*N)
df.reset_index(inplace=True, drop=True)
timing = dict()
for func in (reset_df_first, using_to_frame, using_sort):
timing[func] = timeit.timeit('m.{}(m.df)'.format(func.__name__),
'import __main__ as m ',
number=10)
print('For N = {}'.format(N))
for func in sorted(timing, key=timing.get):
print('{:<20}: {:<0.3g}'.format(func.__name__, timing[func]))
print
yields 产量
For N = 100
using_sort : 0.018
using_to_frame : 0.0265
reset_df_first : 0.0303
For N = 1000
using_to_frame : 0.0358 \
using_sort : 0.036 / this is roughly where the two methods cross over in terms of performance
reset_df_first : 0.0432
For N = 2000
using_to_frame : 0.0457
reset_df_first : 0.0523
using_sort : 0.0569
( reset_df_first
was another possibility I tried.) ( reset_df_first
是我尝试的另一种可能性。)
试试这个:
df.sort('c3').groupby('c1', as_index=False).tail(1)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.