简体   繁体   English

熊猫为另一列的每个不同值选择具有某些列最大值的行

[英]pandas select rows with the max value of some columns for each different value of another column

I have a dataframe in pandas like this: 我在像这样的大熊猫中有一个数据框:

    id  some_type   some_date   some_data
0   1   A           19/12/1995  X
1   2   A           10/04/1997  Y
2   2   B           05/03/2013  Z
3   2   B           09/05/2017  W
4   2   B           09/05/2017  R
5   3   A           01/07/1998  M
6   3   B           09/08/2009  N

I need for each value of id, the rows that have the max value of some_type and some_date without deleting any value of some_data. 我需要id的每个值,具有some_type和some_date最大值的行而不删除some_data的任何值。

In other words, what I need is the following: 换句话说,我需要以下内容:

    id  some_type   some_date   some_data
0   1   A           19/12/1995  X
3   2   B           09/05/2017  W
4   2   B           09/05/2017  R
6   3   B           09/08/2009  N

you can do it with sort_values , groupby and apply by keeping the rows with the last value some_type and some_date: 你可以做到这一点sort_valuesgroupbyapply通过保持与最后的值some_type和some_date行:

df_output = (df.sort_values(by=['some_type','some_date']).groupby('id')
                .apply(lambda df_g: df_g[(df_g['some_type'] == df_g['some_type'].iloc[-1]) & 
                                          (df_g['some_date'] == df_g['some_date'].iloc[-1])])
                  .reset_index(0,drop=True))

and the output is: 输出为:

   id some_type  some_date some_data
0   1         A 1995-12-19         X
3   2         B 2017-09-05         W
4   2         B 2017-09-05         R
6   3         B 2009-09-08         N

EDIT: if you don't care about the indexes, you can also use merge : 编辑:如果您不在乎索引,也可以使用merge

#first get the last one once sorting
df_last = df.sort_values(['some_type','some_date']).groupby('id')['some_type','some_date'].last()
# now merge with inner to keep the one you want
df_output  = df.merge(df_last ,how='inner')

you will get the same result besides indexes 除了索引,您将获得相同的结果

Create a mask with groupby and max() and apply. 使用groupbymax()创建一个遮罩并应用。 But first convert to datetime: 但首先转换为日期时间:

df['some_date'] = pd.to_datetime(df['some_date'])
m = df.groupby('id')['some_type','some_date'].transform(lambda x: x == x.max()).all(1)  
df = df[m]

Full example: 完整示例:

import pandas as pd

text = '''\
id  some_type   some_date   some_data
1   A           19/12/1995  X
2   A           10/04/1997  Y
2   B           05/03/2013  Z
2   B           09/05/2017  W
2   B           09/05/2017  R
3   A           01/07/1998  M
3   B           09/08/2009  N'''

fileobj = pd.compat.StringIO(text)
df = pd.read_csv(fileobj, sep='\s+')

df['some_date'] = pd.to_datetime(df['some_date'])

m = df.groupby('id')['some_type','some_date'].transform(lambda x: x == x.max()).all(1)

df = df[m]

print(df)

Returns: 返回:

   id some_type  some_date some_data
0   1         A 1995-12-19         X
3   2         B 2017-09-05         W
4   2         B 2017-09-05         R
6   3         B 2009-09-08         N

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Pandas 根据另一列中的值合并某些行中的列以处理不均匀的 csv 数据 - Pandas merge columns in some rows based on value in another column to deal with non-uniform csv data 删除日期与pandas中另一列的最大值对齐的行 - dropping rows where date aligns with max value of another column in pandas sqlalchemy:另一列中每个值的最大值 - sqlalchemy: max value for each value in another column 熊猫为每一列选择不同的行 - Pandas select different rows for each column 如何为具有最大值的行选择所有列 - How to select all columns for rows with max value 将 Pandas 中每列的最大值替换为 0 - Replace the max value for each column to 0 in Pandas 识别另一列中具有不同值的重复行 pandas dataframe - Identify duplicated rows with different value in another column pandas dataframe Pandas - 查看2列并检查每列是否有不同的元素,如果两列都包含元素,则返回不同列中的值 - Pandas - Look in 2 columns and check each column for a different element, if both columns contain the elements return the value in a different column Pandas groupby max 不返回某些列的最大值 - Pandas groupby max not returning max value for some columns 如何计算 Pandas dataframe 中同时包含一组列中的值和另一列中的另一个值的行数? - How to count the number of rows containing both a value in a set of columns and another value in another column in a Pandas dataframe?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM