[英]How to partially match a list and write matched characters in a data frame in python
I have two data frames df1 and df2.我有两个数据框 df1 和 df2。 I want to match those two so that df two values match to one column of df1 and show up in a row.
我想匹配这两个,以便 df 两个值匹配 df1 的一列并显示在一行中。 Here is a sample data I made
这是我制作的示例数据
import pandas as p`enter code here`d
# initialize list of lists
data = [["AA", 'ABC_111' ], ["BB", 'ABC_112'], ["CC", 'ABC_113']]
data1= [['ABC_111_12'], ['ABC_112_45'], ['ABC_112_89'],['ABC_113_06'], ['ABC_113_25'], ['ABC_113_89']]
result= [['AA' ,'ABC_111', 'ABC_111_12','ABC_111_19'], ['BB', 'ABC_112', "ABC_112_45",'ABC_112_89' ],
['CC','ABC_113', 'ABC_113_89','ABC_113_06', 'ABC_113_25', 'ABC_113_29']]
# Create the pandas DataFrame
df1= pd.DataFrame(data, columns = [0, 1])
df2= pd.DataFrame(data1, columns = [0])
result_df = pd.DataFrame(result, columns = [0, 1, 2, 3, 4,5])
# print dataframe.
print("df1: \n",df1)
print("df2: \n",df2)
print("expected_result: \n",result_df)
df1:
0 1
0 AA ABC_111
1 BB ABC_112
2 CC ABC_113
df2:
0
0 ABC_111_12
1 ABC_112_45
2 ABC_112_89
3 ABC_113_06
4 ABC_113_25
5 ABC_113_89
So my expected result is something like this:所以我的预期结果是这样的:
expected_result:
0 1 2 3 4 5
0 AA ABC_111 ABC_111_12 ABC_111_19 None None
1 BB ABC_112 ABC_112_45 ABC_112_89 None None
2 CC ABC_113 ABC_113_89 ABC_113_06 ABC_113_25 ABC_113_29
I'm going to answer assuming the structure of the text entries are indicative of the real data you want to work with, as it is an important part of how I would go about solving this.我将假设文本条目的结构表示您要使用的真实数据来回答,因为它是我将如何解决此问题的重要部分。
The most important thing here is isolating which bits of text in each value are static and which are variable.这里最重要的是隔离每个值中的哪些文本位是 static 以及哪些是可变的。
If they are something like:如果它们是这样的:
AAA_NNN_NN
A=alpha
N=numeric
The length of As or Ns can be variable, so long as the number of _ is static. As 或 Ns 的长度可以是可变的,只要 _ 的数量是 static。 If there are always 2 in the data1 list, then we are in business.
如果 data1 列表中总是有 2 个,那么我们在做生意。
There are a few ways to go about this, and the ultimate structure would depend on how well you know the data and what shortcuts you can engineer into the solution, but a slow method would be to do some exact matching on string splits. go 有几种方法可以解决这个问题,最终结构将取决于您对数据的了解程度以及您可以在解决方案中设计的快捷方式,但一种缓慢的方法是对字符串拆分进行一些精确匹配。
results = []
for i,j in data:
tmp = [i, j]
for k in data1:
h = k[0].split("_")
if h[1] in j:
tmp.append(k[0])
else:
tmp.append(None)
results.append(tmp)
for i in results:
print(i)
The if h[1] in j:
would see if '111'
is in the string 'ABC_111'
in the first case, which it is. if h[1] in j:
将在第一种情况下查看'111'
是否在字符串'ABC_111'
中,它是。
This is very crude, but it should give you an idea of exact matching within structured strings.这是非常粗略的,但它应该让您了解结构化字符串中的精确匹配。 The importance here is on the structure.
这里的重要性在于结构。 There are many ways to match things, but a lot of it comes down to the data you are working with.
有很多方法可以匹配事物,但其中很多都取决于您正在使用的数据。
I hope this helps guide you to a solution.我希望这有助于指导您找到解决方案。
This works for the data provided.这适用于提供的数据。
List item项目清单
Split the data into the 'root' and the remaining value with rsplit()
使用
rsplit()
将数据拆分为“根”和剩余值
Use groupby
and agg
to put the remaining values in a list使用
groupby
和agg
将剩余的值放在一个列表中
Use apply() to piece the data back together, then expand the lists to columns使用 apply() 将数据重新组合在一起,然后将列表展开为列
Concatenate with df1与 df1 连接
df2[['cola', 'colb']] = df2[0].str.rsplit(' ', 1, expand=True) df3 = df2[['cola', 'colb']].groupby('cola').agg(list).reset_index() df3['colb'] = df3.apply(lambda x: [x.cola + ' ' + i for i in x.colb], axis=1 ) df4 = pd.DataFrame(df3['colb'].tolist(), index= df3.index) pd.concat([df1,df4], axis=1) df2[['cola', 'colb']] = df2[0].str.rsplit(' ', 1, expand=True) df3 = df2[['cola', 'colb']].groupby('cola ').agg(list).reset_index() df3['colb'] = df3.apply(lambda x: [x.cola + ' ' + i for i in x.colb], axis=1 ) df4 = pd. DataFrame(df3['colb'].tolist(), index= df3.index) pd.concat([df1,df4], axis=1)
0 1 0 1 2 0 AA ABC_111 ABC_111_12 None None 1 BB ABC_112 ABC_112_45 ABC_112_89 None 2 CC ABC_113 ABC_113_06 ABC_113_25 ABC_113_89 0 1 0 1 2 0 AA ABC_111 ABC_111_12 无 无 1 BB ABC_112 ABC_112_45 ABC_112_89 无 2 CC ABC_113 ABC_113_06 ABC_113_25 ABC_113_89
Try this using rsplit
, groupby
, cumcount
, set_index
and unstack
:尝试使用
rsplit
、 groupby
、 cumcount
、 set_index
和unstack
:
dfm = (df2.assign(keystr=df2[0].str.rsplit('_',1).str[0])
.merge(df1, left_on='keystr', right_on=1))
df_out = (dfm.set_index(['0_y',
'keystr',
dfm.groupby(['0_y','keystr'])['0_x'].cumcount()])['0_x']
.unstack().reset_index())
print(df_out)
Output: Output:
0_y keystr 0 1 2
0 AA ABC_111 ABC_111_12 NaN NaN
1 BB ABC_112 ABC_112_45 ABC_112_89 NaN
2 CC ABC_113 ABC_113_06 ABC_113_25 ABC_113_89
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.