简体   繁体   English

Python / Pandas - 在使用多列的合并期间将“空字符串”视为匹配项

[英]Python / Pandas - Consider 'empty string' as a match during merge using multiple columns

I'm trying to merge 2 dataframes on multiple columns: ['Unit','Geo','Region'] .我正在尝试在多列上合并 2 个数据框: ['Unit','Geo','Region'] And, the condition is: When a value from right_df encounters an 'empty string' on left_df , it should consider as a match.而且,条件是:当从价值right_df遇到一个“空字符串” left_df ,就应考虑为匹配。

eg.,when first row of right_df joins with first row of left_df , we have a empty string for column: 'Region' . 。例如,当第一排right_df与第一行加入left_df ,我们有一列空字符串: 'Region' So,need to consider the empty string as a match to 'AU' and get the final result 'DE".因此,需要将空字符串视为与“AU”的匹配并得到最终结果“DE”。

left_df = pd.DataFrame({'Unit':['DEV','DEV','DEV','DEV','DEV','TEST1','TEST2','ACCTEST1','ACCTEST1','ACCTEST1'],
                    'Geo':['AP','JAPAN','NA','Europe','Europe','','','AP','Europe','NA'],
                    'Region':['','','','France','BENELUX','','','','',''],
                    'Resp':['DE','FG','BO','MD','KR','PM','NJ','JI','HN','FG']})


right_df = pd.DataFrame({'Unit':['DEV','DEV','DEV','DEV','TEST1','TEST2','ACCTEST1','DEV','ACCTEST1','TEST1','TEST2','DEV','TEST1','TEST2'],
                    'Geo':['AP','JAPAN','AP','NA','AP','Europe','Europe','Europe','AP','JAPAN','AP','Europe','Europe','Europe'],
                    'Region':['AU','JAPAN','ISA','USA','AU/NZ','France','CEE','France','ISA','JAPAN','ISA','BENELUX','CEE','CEE']})    

在此处输入图片说明

I tried with the below code but it works only if the 'empty strings' have values.我尝试使用以下代码,但仅当“空字符串”具有值时才有效。 I'm struggling to add a condition saying 'consider empty string as a match' or 'ignore if right_df encounters empty string and continue with available match'.我正在努力添加一个条件,说“将空字符串视为匹配项”或“如果right_df遇到空字符串则忽略并继续可用匹配项”。 Would appreciate for any help.将不胜感激任何帮助。 Thanks!!谢谢!!

result_df = pd.merge(left_df, right_df, how='inner', on=['Unit','Geo','Region'])

Looks like there's some mismatch in your mapping, however you can use update method to handle empty strings:看起来您的映射中存在一些不匹配,但是您可以使用update方法来处理空字符串:

# replace empty strings with nan
left_df = left_df.replace('', np.nan)

# replace np.nan with values from other dataframe
left_df.update(right_df, overwrite=False)

# merge
df = pd.merge(left_df, right_df, how='right', on=['Unit','Geo','Region'])

Hope this gives you some idea.希望这能给你一些想法。

Use DataFrame.merge inside a list comprehension and perform the left merge operations in the following order:在列表DataFrame.merge使用DataFrame.merge以下顺序执行left合并操作:

  1. Merge right_df with left_df on columns Unit , Geo and Region and select column Resp .UnitGeoRegionright_dfleft_df合并,然后选择Resp列。

  2. Merge right_df with left_df (drop duplicate values in Unit and Geo) on columns Unit , Geo and select column Resp .合并right_dfleft_df在列(滴在单位和地质重复的值) UnitGeo和选择列Resp

  3. Merge right_df with left_df (drop duplicate values in Unit) on column Unit and select column Resp .合并right_dfleft_df在列(在单位下降重复的值) Unit ,然后选择列Resp

Then use functools.reduce with a reducing function Series.combine_first to combine the all the series in the list s and assign this result to Resp column in right_df .然后使用functools.reduce和一个减少函数Series.combine_first来组合列表中s所有系列,并将这个结果分配给right_df Resp列。


from functools import reduce

c = ['Unit', 'Geo', 'Region']
s = [right_df.merge(left_df.drop_duplicates(c[:len(c) - i]), 
              on=c[:len(c) - i], how='left')['Resp'] for i in range(len(c))]
right_df['Resp'] = reduce(pd.Series.combine_first, s)

Result:结果:

print(right_df)

        Unit     Geo   Region Resp
0        DEV      AP       AU   DE
1        DEV   JAPAN    JAPAN   FG
2        DEV      AP      ISA   DE
3        DEV      NA      USA   BO
4      TEST1      AP    AU/NZ   PM
5      TEST2  Europe   France   NJ
6   ACCTEST1  Europe      CEE   HN
7        DEV  Europe   France   MD
8   ACCTEST1      AP      ISA   JI
9      TEST1   JAPAN    JAPAN   PM
10     TEST2      AP      ISA   NJ
11       DEV  Europe  BENELUX   KR
12     TEST1  Europe      CEE   PM
13     TEST2  Europe      CEE   NJ

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM