I am trying to merge two dataframes in pandas on a common column showing names of the geographical area. The column has similar values but not the same. For example, the value in one DataFrame is London
whereas the other one is London / Greater London
- they are treated as different values but should be treated as the same value when merging.
In[1]:
import pandas as pd
df1 = pd.DataFrame([['London', 2], ['Bristol', 3], ['Liverpool', 6]], columns=['Area', 'B'])
df2 = pd.DataFrame([['London / Greater London', 7], ['Bristol_', 9], ['Liverpool / Liverpool', 1]], columns=['Area', 'B'])
df_merged = pd.merge(df1, df2, on="Area", indicator=True, how='outer')
df_merged
Out[1]:
Area B_x B_y _merge
0 London 2.0 NaN left_only
1 Bristol 3.0 NaN left_only
2 Liverpool 6.0 NaN left_only
3 London / Greater London NaN 7.0 right_only
4 Bristol_ NaN 9.0 right_only
5 Liverpool / Liverpool NaN 1.0 right_only
The ideal output would be something like the below:
Out[1]:
Area B_x B_y _merge
0 London 2.0 7.0 both
1 Bristol 3.0 9.0 both
2 Liverpool 6.0 1.0 both
Is there a way to merge these two dataframes based on a certain level of similarities in values so that London
and London / Greater London
values are treated as the same value? Thank you!
You can first create two arrays
containing the indices of overlapping Area
and City
using np.where()
. I used a list comprehension
to check if each City
was present in
the list of Areas
and save the index.
Note : This only works if the string
of an Area
contains the City
string
. (ie London
is only matched with London / Greater London
if this area
contains the word London
.
The code:
# Alter the column names B (present in both dfs to B_x and B_y )
df1 = pd.DataFrame([['London', 2], ['Bristol', 3], ['Liverpool', 6]], columns=['Area', 'B_x'])
df2 = pd.DataFrame([['London / Greater London', 7], ['Bristol_', 9], ['Liverpool / Liverpool', 1]], columns=['Area', 'B_y'])
# Create indices of matching string patterns
i, j = np.where([[city in area for area in df2['Area'].values] for city in df1['Area'].values])
# Create new dataframe with found indices
pd.DataFrame(np.column_stack([df1.iloc[i], df2.iloc[j]]), columns=df1.columns.append(df2.columns))
Result
Area B_x Area B_y
0 London 2 London / Greater London 7
1 Bristol 3 Bristol_ 9
2 Liverpool 6 Liverpool / Liverpool 1
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.