You have to duplicates when performing the Join, for instance:
import pandas as pd
left_data = {'name':['John','Mark'],'value':[1,5]}
right_data = {'name':['John','Mark','John','Mark'],'children':['Celius','Stingher','Celius','Stingher'],'process_date':['2019-02-05','2019-02-05','2019-03-05','2019-03-05']}
left_df = pd.DataFrame(left_data)
right_df = pd.DataFrame(right_data)
right_df['process_date'] = pd.to_datetime(right_df['process_date'])
This is how they look like:
print(left_df)
name value
0 John 1
1 Mark 5
print(right_df)
name children process_date
0 John Celius 2019-02-05
1 Mark Stingher 2019-02-05
2 John Celius 2019-03-05
3 Mark Stingher 2019-03-05
Even when the merge is left
since there are multiple process_date
values in the right_df
, therefore the left
dataframe will be duplicated in order to fit all values beings passed by the right
dataframe.
df = left_df.merge(right_df,how='left',left_on='name',right_on='name')
print(df)
name value children process_date
0 John 1 Celius 2019-02-05
1 John 1 Celius 2019-03-05
2 Mark 5 Stingher 2019-02-05
3 Mark 5 Stingher 2019-03-05
One approach to filter this is to .sort_values()
by an specific order and then .drop_duplicates(subset=list(left_df),keep={'last','first'})
. This way, we are eliminating duplicates rows and keeping with most recent information available:
df = df.sort_values('process_date',ascending=True).drop_duplicates(list(left_df),keep='last')
print(df)
name value children process_date
1 John 1 Celius 2019-03-05
3 Mark 5 Stingher 2019-03-05
Length of the merged dataframe, matches length of left_df
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.