[英]Merge dataframes based on partial string-match between columns
I have two dataframes df1 and df2:我有两个数据框 df1 和 df2:
df1 = pd.DataFrame({'a':['123456','123457', '23456', '23457', '345678','345679'],
'b':['e','f','g','h','i','j']})
df2 = pd.DataFrame({'id':['2', '123', '3456'],
'b1':['c1','c2','c3']})
ID b1
2 c1
123 c2
3456 c3
a b
123456 e
123457 f
23456 g
23457 h
456789 i
456789 j
What I want to create:我要创建的内容:
df3 = pd.DataFrame({'a':['123456','123457', '23456', '23457', '345678','345679'],
'b':['e','f','g','h','i','j'],
'id':['123','123','2','2','3456','3456'],
'b1':['c2','c2','c1','c1','c3','c3']})
a b id b1
123456 e 123 c2
123457 f 123 c2
23456 g 2 c1
23457 h 2 c1
456789 i 4567 c3
456789 j 4567 c3
How can I merge the data from df2 into df1 based on the 'a1' match gaven the substring in 'a' for chars 0-N(first N charcters, N based on the length of string in 'a1').如何根据给定字符 0-N 的“a”中的子字符串(前 N 个字符,N 基于“a1”中字符串的长度)的“a1”匹配将 df2 中的数据合并到 df1 中。
You could try this:你可以试试这个:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'a':['123456','123457', '23456', '23457', '345678','345679'],
'b':['e','f','g','h','i','j']})
df2 = pd.DataFrame({'id':['2', '123', '3456'],
'b1':['c1','c2','c3']})
df3_test = pd.DataFrame({'a':['123456','123457', '23456', '23457', '345678','345679'],
'b':['e','f','g','h','i','j'],
'id':['123','123','2','2','3456','3456'],
'b1':['c2','c2','c1','c1','c3','c3']})
starts_with_map = map(df1['a'].str.startswith, df2['id'])
conditions = list(starts_with_map)
choices = range(len(conditions))
select_arr = np.select(conditions,
choices,np.nan)
# array([1., 1., 0., 0., 2., 2.]), we'll use this to access df2.index below in pd.concat
if np.isnan(select_arr).any():
vals = [df2.iloc[int(x),:].values if not np.isnan(x) else [np.nan]*df2.shape[1] for x in select_arr]
df3 = pd.concat([df1,pd.DataFrame(vals, columns=df2.columns)], axis=1)
else:
df3 = pd.concat([df1,df2.iloc[select_arr].reset_index(drop=True)],axis=1)
df3.equals(df3_test)
# True: i.e. result equals your df3_test
Explanation code:解释代码:
map(df1['a'].str.startswith, df2['id'])
uses func startswith
on iterable df2['id']
, so on '2', '123', and '3456'. map(df1['a'].str.startswith, df2['id'])
在可迭代df2['id']
上使用 func startswith
,等等 '2'、'123' 和 '3456'。np.select
.我们想将此地图提供给np.select
。 So, we need conditions
, choices
, and a default
(if no match).因此,我们需要conditions
、 choices
和default
(如果不匹配)。conditions = list(starts_with_map)
we turn the map into a list of 3 lists (for each elem in df2['id']
).使用conditions = list(starts_with_map)
我们将地图转换为 3 个列表的列表(对于df2['id']
中的每个元素)。 For the first, it will be:首先,它将是:print(conditions[0])
0 False
1 False
2 True # match '2' on '23456' (df1.loc[2,'a'])
3 True # match '2' on '23456' (df1.loc[3,'a'])
4 False
5 False
Name: a, dtype: bool
We also define the choices: we want the appropriate index for df2, so just 0,1,2, hence: choices = range(len(conditions))
.我们还定义了选择:我们想要 df2 的适当索引,所以只有 0,1,2,因此: choices = range(len(conditions))
。
finally, we want to add the if/else
construction to make sure you don't run into errors if no match is find.最后,我们要添加if/else
构造,以确保在找不到匹配项时不会出错。 Eg suppose that df2
would look like this:例如,假设df2
看起来像这样:
df2 = pd.DataFrame({'id':['2', '321', '3456'],
'b1':['c1','c2','c3']})
In that case select_arr
would become array([nan, nan, 0., 0., 2., 2.])
(ie no matches for '123456','123457
in df1
), and we would run into an error trying to access df2
at index nan
, which does not exist.在这种情况下, select_arr
将变为array([nan, nan, 0., 0., 2., 2.])
(即df1
中没有匹配'123456','123457
),我们会遇到错误尝试访问索引nan
处的df2
,该索引不存在。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.