如何在pandas数据框中获取行，在列中使用最大值并保留原始索引？

Question

I have a pandas data frame. 我有一个熊猫数据框。 In the first column it can have the same value several times (in other words, the values in the first column are not unique). 在第一列中，它可以多次具有相同的值（换句话说，第一列中的值不是唯一的）。

Whenever I have several rows that contain the same value in the first column, I would like to leave only those that have maximal value in the third column. 每当我在第一列中有多个包含相同值的行时，我只想留下第三列中具有最大值的行。 I almost found a solution: 我几乎找到了解决方案：

import pandas

ls = []
ls.append({'c1':'a', 'c2':'a', 'c3':1})
ls.append({'c1':'a', 'c2':'c', 'c3':3})
ls.append({'c1':'a', 'c2':'b', 'c3':2})
ls.append({'c1':'b', 'c2':'b', 'c3':10})
ls.append({'c1':'b', 'c2':'c', 'c3':12})
ls.append({'c1':'b', 'c2':'a', 'c3':7})

df = pandas.DataFrame(ls, columns=['c1','c2','c3'])
print df
print '--------------------'
print df.groupby('c1').apply(lambda df:df.irow(df['c3'].argmax()))

As a result I get: 结果我得到：

  c1 c2  c3
0  a  a   1
1  a  c   3
2  a  b   2
3  b  b  10
4  b  c  12
5  b  a   7
--------------------
   c1 c2  c3
c1          
a   a  c   3
b   b  c  12

My problem is that, I do not want to have c1 as index. 我的问题是，我不想让c1作为索引。 What I want to have is following: 我想要的是：

  c1 c2  c3
1  a  c   3
4  b  c  12

Answer 1

When calling df.groupby(...).apply(foo) , the type of object returned by foo affects the way the results are melded together. 当调用df.groupby(...).apply(foo) ， foo返回的对象类型会影响结果融合在一起的方式。

If you return a Series, the index of the Series become columns of the final result, and the groupby key becomes the index (a bit of a mind-twister). 如果返回一个Series，则Series的索引将成为最终结果的列，groupby键将成为索引（有点令人费解）。

If instead you return a DataFrame, the final result uses the index of the DataFrame as index values, and the columns of the DataFrame as columns (very sensible). 如果您返回一个DataFrame，最终结果使用DataFrame的索引作为索引值，并将DataFrame的列作为列（非常明智）。

So, you can arrange for the type of output you desire by converting your Series into a DataFrame. 因此，您可以通过将Series转换为DataFrame来安排所需的输出类型。

With Pandas 0.13 you can use the to_frame().T method: 使用Pandas 0.13，您可以使用to_frame().T方法：

def maxrow(x, col):
    return x.loc[x[col].argmax()].to_frame().T

result = df.groupby('c1').apply(maxrow, 'c3')
result = result.reset_index(level=0, drop=True)
print(result)

yields 产量

  c1 c2  c3
1  a  c   3
4  b  c  12

In Pandas 0.12 or older, the equivalent would be: 在Pandas 0.12或更早版本中，相当于：

def maxrow(x, col):
    ser = x.loc[x[col].idxmax()]
    df = pd.DataFrame({ser.name: ser}).T
    return df

By the way, behzad.nouri's clever and elegant solution is quicker than mine for small DataFrames. 顺便说一句， behzad.nouri的聪明而优雅的解决方案对于小型DataFrame来说比我的快。 The sort lifts the time complexity from O(n) to O(n log n) however, so it becomes slower than the to_frame solution shown above when applied to larger DataFrames. 然而，该sort将时间复杂度从O(n)提升到O(n log n) ，因此当应用于更大的DataFrame时，它变得比上面显示的to_frame解决方案慢。

Here is how I benchmarked it: 以下是我对它进行基准测试的方法：

import pandas as pd
import numpy as np
import timeit


def reset_df_first(df):
    df2 = df.reset_index()
    result = df2.groupby('c1').apply(lambda x: x.loc[x['c3'].idxmax()])
    result.set_index(['index'], inplace=True)
    return result

def maxrow(x, col):
    result = x.loc[x[col].argmax()].to_frame().T
    return result

def using_to_frame(df):
    result = df.groupby('c1').apply(maxrow, 'c3')
    result.reset_index(level=0, drop=True, inplace=True)
    return result

def using_sort(df):
    return df.sort('c3').groupby('c1', as_index=False).tail(1)


for N in (100, 1000, 2000):
    df = pd.DataFrame({'c1': {0: 'a', 1: 'a', 2: 'a', 3: 'b', 4: 'b', 5: 'b'},
                       'c2': {0: 'a', 1: 'c', 2: 'b', 3: 'b', 4: 'c', 5: 'a'},
                       'c3': {0: 1, 1: 3, 2: 2, 3: 10, 4: 12, 5: 7}})

    df = pd.concat([df]*N)
    df.reset_index(inplace=True, drop=True)

    timing = dict()
    for func in (reset_df_first, using_to_frame, using_sort):
        timing[func] = timeit.timeit('m.{}(m.df)'.format(func.__name__),
                              'import __main__ as m ',
                              number=10)

    print('For N = {}'.format(N))
    for func in sorted(timing, key=timing.get):
        print('{:<20}: {:<0.3g}'.format(func.__name__, timing[func]))
    print

yields 产量

For N = 100
using_sort          : 0.018
using_to_frame      : 0.0265
reset_df_first      : 0.0303

For N = 1000
using_to_frame      : 0.0358    \
using_sort          : 0.036     / this is roughly where the two methods cross over in terms of performance
reset_df_first      : 0.0432

For N = 2000
using_to_frame      : 0.0457
reset_df_first      : 0.0523
using_sort          : 0.0569

( reset_df_first was another possibility I tried.) （ reset_df_first是我尝试的另一种可能性。）

Answer 2

试试这个：

df.sort('c3').groupby('c1', as_index=False).tail(1)

如何在pandas数据框中获取行，在列中使用最大值并保留原始索引？

问题描述

2 个解决方案

解决方案1
4 已采纳 2013-12-20 12:54:12

解决方案2
1 2013-12-20 12:33:01

如何在pandas数据框中获取行，在列中使用最大值并保留原始索引？

问题描述

2 个解决方案

解决方案1 4 已采纳 2013-12-20 12:54:12

解决方案2 1 2013-12-20 12:33:01

解决方案1
4 已采纳 2013-12-20 12:54:12

解决方案2
1 2013-12-20 12:33:01