简体   繁体   中英

Dropping rows in pandas based on a more complex condition

I have the following data frame:

time        id  type
2012-12-19  1   abcF1
2013-11-02  1   xF1yz
2012-12-19  1   abcF1
2012-12-18  1   abcF1
2013-11-02  1   xF1yz
2006-07-07  5   F5spo
2006-07-06  5   F5spo
2005-07-07  5   F5abc

For a given id, I need to find the max date.

For that max date I need to check the type.

I have to drop every row for the given id if the type differs from the type of the max date.

Example for target data frame:

time        id  type
<deleted because for id 1 the date is not the max value and the type differs from the type of the max date for id 1>
2013-11-02  1   xF1yz
<deleted because for id 1 the date is not the max value and the type differs from the type of the max date for id 1>
<deleted because for id 1 the date is not the max value and the type differs from the type of the max date for id 1>
2013-11-02  1   xF1yz
2006-07-07  5   F5spo
2006-07-06  5   F5spo //kept because although the date is not max, it has the same type as the row with the max date for id 5
<deleted because for id 5 the date is not the max value and the type differs from the type of the max date for id 5>

How can I achieve this? I am new to pandas and trying to learn the proper way to use the library.

Use DataFrameGroupBy.idxmax for get indices of max values, filter only columns id and type and DataFrame.merge :

df = df.merge(df.loc[df.groupby('id')['time'].idxmax(), ['id','type']])
print (df)
        time  id   type
0 2013-11-02   1  xF1yz
1 2013-11-02   1  xF1yz
2 2006-07-07   5  F5spo
3 2006-07-06   5  F5spo

Or use DataFrame.sort_values with DataFrame.drop_duplicates :

df = df.merge(df.sort_values('time').drop_duplicates('id', keep='last')[["id", "type"]])

You can sort the dataframe by time, then group by id and choose the last row in each group. That is the row with the largest date.

last_rows = df.sort_values('time').groupby('id').last()

Then merge the original dataframe with the new one:

result = df.merge(last_rows, on=["id", "type"])
#       time_x  id   type      time_y
#0  2013-11-02   1  xF1yz  2013-11-02
#1  2013-11-02   1  xF1yz  2013-11-02
#2  2006-07-07   5  F5spo  2006-07-07
#3  2006-07-06   5  F5spo  2006-07-07

If needed, drop the last duplicate column:

result.drop('time_y', axis=1, inplace=True)

Create a helper Series using set_index , groupby and transform idxmax . Then use boolean indexing :

# If neccessary cast to datetime dtype
# df['time'] = pd.to_datetime(df['time'])

s = df.set_index('type').groupby('id')['time'].transform('idxmax')
df[df.type == s.values]

[out]

        time  id   type
1 2013-11-02   1  xF1yz
4 2013-11-02   1  xF1yz
5 2006-07-07   5  F5spo
6 2006-07-06   5  F5spo
import pandas as pd

df = pd.DataFrame({
    'time': ['2012-12-19', '2013-11-02', '2013-12-19', '2013-12-18', '2013-11-02', '2006-07-07', '2006-07-06', '2005-07-07'],
    'id': [1,1,1,1,1,5,5,5],
    'type': ['abcF1', 'xF1yz', 'abcF1', 'abcF1', 'xF1yz', 'F5spo', 'F5spo', 'F5abc']
})

df['time'] = pd.to_datetime(df['time'])
def remove_non_max_date_ids(df):
    max_type = df.loc[df['time'].idxmax()]['type']
    print(max_type)
    return df[
        df['type'] != max_type
    ]

df.groupby('id').apply(remove_non_max_date_ids)

Create a helper function that filters out the rows that do not have the same type as the max date, then apply it to each group df based on id

The other way using duplicated .

import pandas as pd
import datetime

# if needed
df['time'] = pd.to_datetime(df['time'])

# sort values of id and time ascendingly, and tagged the duplicates
df = df.sort_values(by=['id','time'], ascending=[True,True])
df['time_max'] = df.duplicated(subset=['id'], keep='last')
# keep the max value only
df2 = df.loc[~df['time_max'],['id','type']].rename(columns={'type':'type_max'}).copy()

# merge with the original df
df = pd.merge(df, df2, on=['id'], how='left')
# get the result
df['for_drop'] = df['type']==df['type_max']
df = df.loc[df['for_drop'],:]

[out]:

df
    time        id  type    time_max    type_max    for_drop
3   2013-11-02  1   xF1yz   True          xF1yz       True
4   2013-11-02  1   xF1yz   False         xF1yz       True
6   2006-07-06  5   F5spo   True          F5spo       True
7   2006-07-07  5   F5spo   False         F5spo       True

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM