简体   繁体   中英

Python / Pandas - Consider 'empty string' as a match during merge using multiple columns

I'm trying to merge 2 dataframes on multiple columns: ['Unit','Geo','Region'] . And, the condition is: When a value from right_df encounters an 'empty string' on left_df , it should consider as a match.

eg.,when first row of right_df joins with first row of left_df , we have a empty string for column: 'Region' . So,need to consider the empty string as a match to 'AU' and get the final result 'DE".

left_df = pd.DataFrame({'Unit':['DEV','DEV','DEV','DEV','DEV','TEST1','TEST2','ACCTEST1','ACCTEST1','ACCTEST1'],
                    'Geo':['AP','JAPAN','NA','Europe','Europe','','','AP','Europe','NA'],
                    'Region':['','','','France','BENELUX','','','','',''],
                    'Resp':['DE','FG','BO','MD','KR','PM','NJ','JI','HN','FG']})


right_df = pd.DataFrame({'Unit':['DEV','DEV','DEV','DEV','TEST1','TEST2','ACCTEST1','DEV','ACCTEST1','TEST1','TEST2','DEV','TEST1','TEST2'],
                    'Geo':['AP','JAPAN','AP','NA','AP','Europe','Europe','Europe','AP','JAPAN','AP','Europe','Europe','Europe'],
                    'Region':['AU','JAPAN','ISA','USA','AU/NZ','France','CEE','France','ISA','JAPAN','ISA','BENELUX','CEE','CEE']})    

在此处输入图片说明

I tried with the below code but it works only if the 'empty strings' have values. I'm struggling to add a condition saying 'consider empty string as a match' or 'ignore if right_df encounters empty string and continue with available match'. Would appreciate for any help. Thanks!!

result_df = pd.merge(left_df, right_df, how='inner', on=['Unit','Geo','Region'])

Looks like there's some mismatch in your mapping, however you can use update method to handle empty strings:

# replace empty strings with nan
left_df = left_df.replace('', np.nan)

# replace np.nan with values from other dataframe
left_df.update(right_df, overwrite=False)

# merge
df = pd.merge(left_df, right_df, how='right', on=['Unit','Geo','Region'])

Hope this gives you some idea.

Use DataFrame.merge inside a list comprehension and perform the left merge operations in the following order:

  1. Merge right_df with left_df on columns Unit , Geo and Region and select column Resp .

  2. Merge right_df with left_df (drop duplicate values in Unit and Geo) on columns Unit , Geo and select column Resp .

  3. Merge right_df with left_df (drop duplicate values in Unit) on column Unit and select column Resp .

Then use functools.reduce with a reducing function Series.combine_first to combine the all the series in the list s and assign this result to Resp column in right_df .


from functools import reduce

c = ['Unit', 'Geo', 'Region']
s = [right_df.merge(left_df.drop_duplicates(c[:len(c) - i]), 
              on=c[:len(c) - i], how='left')['Resp'] for i in range(len(c))]
right_df['Resp'] = reduce(pd.Series.combine_first, s)

Result:

print(right_df)

        Unit     Geo   Region Resp
0        DEV      AP       AU   DE
1        DEV   JAPAN    JAPAN   FG
2        DEV      AP      ISA   DE
3        DEV      NA      USA   BO
4      TEST1      AP    AU/NZ   PM
5      TEST2  Europe   France   NJ
6   ACCTEST1  Europe      CEE   HN
7        DEV  Europe   France   MD
8   ACCTEST1      AP      ISA   JI
9      TEST1   JAPAN    JAPAN   PM
10     TEST2      AP      ISA   NJ
11       DEV  Europe  BENELUX   KR
12     TEST1  Europe      CEE   PM
13     TEST2  Europe      CEE   NJ

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM