简体   繁体   English

如何在pandas数据框中获取行,在列中使用最大值并保留原始索引?

[英]How to get rows in pandas data frame, with maximal values in a column and keep the original index?

I have a pandas data frame. 我有一个熊猫数据框。 In the first column it can have the same value several times (in other words, the values in the first column are not unique). 在第一列中,它可以多次具有相同的值(换句话说,第一列中的值不是唯一的)。

Whenever I have several rows that contain the same value in the first column, I would like to leave only those that have maximal value in the third column. 每当我在第一列中有多个包含相同值的行时,我只想留下第三列中具有最大值的行。 I almost found a solution: 我几乎找到了解决方案:

import pandas

ls = []
ls.append({'c1':'a', 'c2':'a', 'c3':1})
ls.append({'c1':'a', 'c2':'c', 'c3':3})
ls.append({'c1':'a', 'c2':'b', 'c3':2})
ls.append({'c1':'b', 'c2':'b', 'c3':10})
ls.append({'c1':'b', 'c2':'c', 'c3':12})
ls.append({'c1':'b', 'c2':'a', 'c3':7})

df = pandas.DataFrame(ls, columns=['c1','c2','c3'])
print df
print '--------------------'
print df.groupby('c1').apply(lambda df:df.irow(df['c3'].argmax()))

As a result I get: 结果我得到:

  c1 c2  c3
0  a  a   1
1  a  c   3
2  a  b   2
3  b  b  10
4  b  c  12
5  b  a   7
--------------------
   c1 c2  c3
c1          
a   a  c   3
b   b  c  12

My problem is that, I do not want to have c1 as index. 我的问题是,我不想让c1作为索引。 What I want to have is following: 我想要的是:

  c1 c2  c3
1  a  c   3
4  b  c  12

When calling df.groupby(...).apply(foo) , the type of object returned by foo affects the way the results are melded together. 当调用df.groupby(...).apply(foo)foo返回的对象类型会影响结果融合在一起的方式。

If you return a Series, the index of the Series become columns of the final result, and the groupby key becomes the index (a bit of a mind-twister). 如果返回一个Series,则Series的索引将成为最终结果的列,groupby键将成为索引(有点令人费解)。

If instead you return a DataFrame, the final result uses the index of the DataFrame as index values, and the columns of the DataFrame as columns (very sensible). 如果您返回一个DataFrame,最终结果使用DataFrame的索引作为索引值,并将DataFrame的列作为列(非常明智)。

So, you can arrange for the type of output you desire by converting your Series into a DataFrame. 因此,您可以通过将Series转换为DataFrame来安排所需的输出类型。

With Pandas 0.13 you can use the to_frame().T method: 使用Pandas 0.13,您可以使用to_frame().T方法:

def maxrow(x, col):
    return x.loc[x[col].argmax()].to_frame().T

result = df.groupby('c1').apply(maxrow, 'c3')
result = result.reset_index(level=0, drop=True)
print(result)

yields 产量

  c1 c2  c3
1  a  c   3
4  b  c  12

In Pandas 0.12 or older, the equivalent would be: 在Pandas 0.12或更早版本中,相当于:

def maxrow(x, col):
    ser = x.loc[x[col].idxmax()]
    df = pd.DataFrame({ser.name: ser}).T
    return df

By the way, behzad.nouri's clever and elegant solution is quicker than mine for small DataFrames. 顺便说一句, behzad.nouri的聪明而优雅的解决方案对于小型DataFrame来说比我的快。 The sort lifts the time complexity from O(n) to O(n log n) however, so it becomes slower than the to_frame solution shown above when applied to larger DataFrames. 然而,该sort将时间复杂度从O(n)提升到O(n log n) ,因此当应用于更大的DataFrame时,它变得比上面显示的to_frame解决方案慢。

Here is how I benchmarked it: 以下是我对它进行基准测试的方法:

import pandas as pd
import numpy as np
import timeit


def reset_df_first(df):
    df2 = df.reset_index()
    result = df2.groupby('c1').apply(lambda x: x.loc[x['c3'].idxmax()])
    result.set_index(['index'], inplace=True)
    return result

def maxrow(x, col):
    result = x.loc[x[col].argmax()].to_frame().T
    return result

def using_to_frame(df):
    result = df.groupby('c1').apply(maxrow, 'c3')
    result.reset_index(level=0, drop=True, inplace=True)
    return result

def using_sort(df):
    return df.sort('c3').groupby('c1', as_index=False).tail(1)


for N in (100, 1000, 2000):
    df = pd.DataFrame({'c1': {0: 'a', 1: 'a', 2: 'a', 3: 'b', 4: 'b', 5: 'b'},
                       'c2': {0: 'a', 1: 'c', 2: 'b', 3: 'b', 4: 'c', 5: 'a'},
                       'c3': {0: 1, 1: 3, 2: 2, 3: 10, 4: 12, 5: 7}})

    df = pd.concat([df]*N)
    df.reset_index(inplace=True, drop=True)

    timing = dict()
    for func in (reset_df_first, using_to_frame, using_sort):
        timing[func] = timeit.timeit('m.{}(m.df)'.format(func.__name__),
                              'import __main__ as m ',
                              number=10)

    print('For N = {}'.format(N))
    for func in sorted(timing, key=timing.get):
        print('{:<20}: {:<0.3g}'.format(func.__name__, timing[func]))
    print

yields 产量

For N = 100
using_sort          : 0.018
using_to_frame      : 0.0265
reset_df_first      : 0.0303

For N = 1000
using_to_frame      : 0.0358    \
using_sort          : 0.036     / this is roughly where the two methods cross over in terms of performance
reset_df_first      : 0.0432

For N = 2000
using_to_frame      : 0.0457
reset_df_first      : 0.0523
using_sort          : 0.0569

( reset_df_first was another possibility I tried.) reset_df_first是我尝试的另一种可能性。)

试试这个:

df.sort('c3').groupby('c1', as_index=False).tail(1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何遍历具有已排序数字索引的数据框唯一行的列值,并在熊猫中进行重复? - How to iterate over column values for unique rows of a data frame with sorted, numerical index with duplicates in pandas? pandas - 如何通过匹配索引值将列附加到数据框? - pandas - How to append a column to data frame by matching index values? 如何根据其他行值添加 pandas 数据框列 - How to add pandas data frame column based on other rows values 如何从 pandas 数据框的列值创建新行 - How to create a new rows from column values of pandas data frame 如何在熊猫中设置新值但保留原始数据框 - How to set up a new value in pandas but keep the original data frame 使用数据框的列值来索引多索引数据框的行 - Using column values of a data frame to index rows of a multiindex data frame 如何有条件地根据同一数据帧另一列中的值对Pandas数据帧中的行进行计数? - How to count rows in a data frame in Pandas conditionally against values in another column of the same data frame? 在熊猫数据框中保留第一行连续的特定值? - Keep the first rows of continuous specific values in a pandas data frame? 如何使熊猫多索引数据框成为只有一列行的简单表? - How to make the pandas multi-index data frame a simple table with only one column rows? 如何交换 pandas 数据帧中的列值 - How to swap column values in pandas data frame
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM