I'm trying to merge 2 dataframes on multiple columns: ['Unit','Geo','Region']
. And, the condition is: When a value from right_df
encounters an 'empty string' on left_df
, it should consider as a match.
eg.,when first row of right_df
joins with first row of left_df
, we have a empty string for column: 'Region'
. So,need to consider the empty string as a match to 'AU' and get the final result 'DE".
left_df = pd.DataFrame({'Unit':['DEV','DEV','DEV','DEV','DEV','TEST1','TEST2','ACCTEST1','ACCTEST1','ACCTEST1'],
'Geo':['AP','JAPAN','NA','Europe','Europe','','','AP','Europe','NA'],
'Region':['','','','France','BENELUX','','','','',''],
'Resp':['DE','FG','BO','MD','KR','PM','NJ','JI','HN','FG']})
right_df = pd.DataFrame({'Unit':['DEV','DEV','DEV','DEV','TEST1','TEST2','ACCTEST1','DEV','ACCTEST1','TEST1','TEST2','DEV','TEST1','TEST2'],
'Geo':['AP','JAPAN','AP','NA','AP','Europe','Europe','Europe','AP','JAPAN','AP','Europe','Europe','Europe'],
'Region':['AU','JAPAN','ISA','USA','AU/NZ','France','CEE','France','ISA','JAPAN','ISA','BENELUX','CEE','CEE']})
I tried with the below code but it works only if the 'empty strings' have values. I'm struggling to add a condition saying 'consider empty string as a match' or 'ignore if right_df
encounters empty string and continue with available match'. Would appreciate for any help. Thanks!!
result_df = pd.merge(left_df, right_df, how='inner', on=['Unit','Geo','Region'])
Looks like there's some mismatch in your mapping, however you can use update
method to handle empty strings:
# replace empty strings with nan
left_df = left_df.replace('', np.nan)
# replace np.nan with values from other dataframe
left_df.update(right_df, overwrite=False)
# merge
df = pd.merge(left_df, right_df, how='right', on=['Unit','Geo','Region'])
Hope this gives you some idea.
Use DataFrame.merge
inside a list comprehension and perform the left
merge operations in the following order:
Merge right_df
with left_df
on columns Unit
, Geo
and Region
and select column Resp
.
Merge right_df
with left_df
(drop duplicate values in Unit and Geo) on columns Unit
, Geo
and select column Resp
.
Merge right_df
with left_df
(drop duplicate values in Unit) on column Unit
and select column Resp
.
Then use functools.reduce
with a reducing function Series.combine_first
to combine the all the series in the list s
and assign this result to Resp
column in right_df
.
from functools import reduce
c = ['Unit', 'Geo', 'Region']
s = [right_df.merge(left_df.drop_duplicates(c[:len(c) - i]),
on=c[:len(c) - i], how='left')['Resp'] for i in range(len(c))]
right_df['Resp'] = reduce(pd.Series.combine_first, s)
Result:
print(right_df)
Unit Geo Region Resp
0 DEV AP AU DE
1 DEV JAPAN JAPAN FG
2 DEV AP ISA DE
3 DEV NA USA BO
4 TEST1 AP AU/NZ PM
5 TEST2 Europe France NJ
6 ACCTEST1 Europe CEE HN
7 DEV Europe France MD
8 ACCTEST1 AP ISA JI
9 TEST1 JAPAN JAPAN PM
10 TEST2 AP ISA NJ
11 DEV Europe BENELUX KR
12 TEST1 Europe CEE PM
13 TEST2 Europe CEE NJ
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.