基于列之间的部分字符串匹配合并数据帧

Question

I have two dataframes df1 and df2:我有两个数据框 df1 和 df2：

df1 = pd.DataFrame({'a':['123456','123457',  '23456', '23457', '345678','345679'],
                               'b':['e','f','g','h','i','j']})
df2 = pd.DataFrame({'id':['2', '123', '3456'],
                              'b1':['c1','c2','c3']})
 ID       b1    
2         c1   
123       c2      
3456      c3

 a       b    
123456   e   
123457   f      
23456    g
23457    h
456789   i 
456789   j

What I want to create:我要创建的内容：

df3 = pd.DataFrame({'a':['123456','123457',  '23456', '23457', '345678','345679'],
                               'b':['e','f','g','h','i','j'],
                               'id':['123','123','2','2','3456','3456'],
                               'b1':['c2','c2','c1','c1','c3','c3']})

 a       b     id     b1 
123456   e     123    c2
123457   f     123    c2
23456    g     2      c1
23457    h     2      c1
456789   i     4567   c3
456789   j     4567   c3

How can I merge the data from df2 into df1 based on the 'a1' match gaven the substring in 'a' for chars 0-N(first N charcters, N based on the length of string in 'a1').如何根据给定字符 0-N 的“a”中的子字符串（前 N 个字符，N 基于“a1”中字符串的长度）的“a1”匹配将 df2 中的数据合并到 df1 中。

Answer 1

You could try this:你可以试试这个：

import pandas as pd
import numpy as np

df1 = pd.DataFrame({'a':['123456','123457',  '23456', '23457', '345678','345679'],
                               'b':['e','f','g','h','i','j']})
df2 = pd.DataFrame({'id':['2', '123', '3456'],
                              'b1':['c1','c2','c3']})

df3_test = pd.DataFrame({'a':['123456','123457',  '23456', '23457', '345678','345679'],
                               'b':['e','f','g','h','i','j'],
                               'id':['123','123','2','2','3456','3456'],
                               'b1':['c2','c2','c1','c1','c3','c3']})

starts_with_map = map(df1['a'].str.startswith, df2['id'])

conditions = list(starts_with_map)
choices = range(len(conditions))

select_arr = np.select(conditions,
                choices,np.nan)
# array([1., 1., 0., 0., 2., 2.]), we'll use this to access df2.index below in pd.concat

if np.isnan(select_arr).any():
    vals = [df2.iloc[int(x),:].values if not np.isnan(x) else [np.nan]*df2.shape[1] for x in select_arr]
    df3 = pd.concat([df1,pd.DataFrame(vals, columns=df2.columns)], axis=1)
else:
    df3 = pd.concat([df1,df2.iloc[select_arr].reset_index(drop=True)],axis=1)

df3.equals(df3_test)
# True: i.e. result equals your df3_test

Explanation code:解释代码：

map(df1['a'].str.startswith, df2['id']) uses func startswith on iterable df2['id'] , so on '2', '123', and '3456'. map(df1['a'].str.startswith, df2['id'])在可迭代df2['id']上使用 func startswith ，等等 '2'、'123' 和 '3456'。
We want to feed this map to np.select .我们想将此地图提供给np.select 。 So, we need conditions , choices , and a default (if no match).因此，我们需要conditions 、 choices和default （如果不匹配）。
With conditions = list(starts_with_map) we turn the map into a list of 3 lists (for each elem in df2['id'] ).使用conditions = list(starts_with_map)我们将地图转换为 3 个列表的列表（对于df2['id']中的每个元素）。 For the first, it will be:首先，它将是：

print(conditions[0])
0    False
1    False
2     True   # match '2' on '23456' (df1.loc[2,'a'])
3     True   # match '2' on '23456' (df1.loc[3,'a'])
4    False
5    False
Name: a, dtype: bool

We also define the choices: we want the appropriate index for df2, so just 0,1,2, hence: choices = range(len(conditions)) .我们还定义了选择：我们想要 df2 的适当索引，所以只有 0,1,2，因此： choices = range(len(conditions)) 。
finally, we want to add the if/else construction to make sure you don't run into errors if no match is find.最后，我们要添加if/else构造，以确保在找不到匹配项时不会出错。 Eg suppose that df2 would look like this:例如，假设df2看起来像这样：

df2 = pd.DataFrame({'id':['2', '321', '3456'],
                              'b1':['c1','c2','c3']})

In that case select_arr would become array([nan, nan, 0., 0., 2., 2.]) (ie no matches for '123456','123457 in df1 ), and we would run into an error trying to access df2 at index nan , which does not exist.在这种情况下， select_arr将变为array([nan, nan, 0., 0., 2., 2.]) （即df1中没有匹配'123456','123457 ），我们会遇到错误尝试访问索引nan处的df2 ，该索引不存在。

基于列之间的部分字符串匹配合并数据帧

问题描述

1 个解决方案

解决方案1
0 2022-07-12 12:12:19

基于列之间的部分字符串匹配合并数据帧

问题描述

1 个解决方案

解决方案1 0 2022-07-12 12:12:19

解决方案1
0 2022-07-12 12:12:19