简体   繁体   中英

How do I merge two data frames in pandas on a common column which have similar values (but not the same)?

I am trying to merge two dataframes in pandas on a common column showing names of the geographical area. The column has similar values but not the same. For example, the value in one DataFrame is London whereas the other one is London / Greater London - they are treated as different values but should be treated as the same value when merging.

In[1]: 
import pandas as pd
df1 = pd.DataFrame([['London', 2], ['Bristol', 3], ['Liverpool', 6]], columns=['Area', 'B'])
df2 = pd.DataFrame([['London / Greater London', 7], ['Bristol_', 9], ['Liverpool / Liverpool', 1]], columns=['Area', 'B'])
df_merged = pd.merge(df1, df2, on="Area", indicator=True, how='outer')
df_merged

Out[1]: 
                      Area  B_x  B_y      _merge
0                   London  2.0  NaN   left_only
1                  Bristol  3.0  NaN   left_only
2                Liverpool  6.0  NaN   left_only
3  London / Greater London  NaN  7.0  right_only
4                 Bristol_  NaN  9.0  right_only
5    Liverpool / Liverpool  NaN  1.0  right_only

The ideal output would be something like the below:

Out[1]: 
                      Area  B_x  B_y      _merge
0                   London  2.0  7.0   both
1                  Bristol  3.0  9.0   both
2                Liverpool  6.0  1.0   both

Is there a way to merge these two dataframes based on a certain level of similarities in values so that London and London / Greater London values are treated as the same value? Thank you!

You can first create two arrays containing the indices of overlapping Area and City using np.where() . I used a list comprehension to check if each City was present in the list of Areas and save the index.

Note : This only works if the string of an Area contains the City string . (ie London is only matched with London / Greater London if this area contains the word London .

The code:

# Alter the column names B (present in both dfs to B_x and B_y )
df1 = pd.DataFrame([['London', 2], ['Bristol', 3], ['Liverpool', 6]], columns=['Area', 'B_x'])
df2 = pd.DataFrame([['London / Greater London', 7], ['Bristol_', 9], ['Liverpool / Liverpool', 1]], columns=['Area', 'B_y'])

# Create indices of matching string patterns
i, j = np.where([[city in area for area in df2['Area'].values] for city in df1['Area'].values])

# Create new dataframe with found indices
pd.DataFrame(np.column_stack([df1.iloc[i], df2.iloc[j]]), columns=df1.columns.append(df2.columns))

Result

    Area        B_x     Area                    B_y
0   London      2   London / Greater London     7
1   Bristol     3   Bristol_                    9
2   Liverpool   6   Liverpool / Liverpool       1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM