[英]Compare two dataframes and filter the matched values
Previous question: Pandas Compare two dataframes and determine the matched values上一个问题: Pandas 比较两个数据框并确定匹配的值
I have two dataframes:我有两个数据框:
print(a)
ID Value
0 AA12 101 BB101 CC01 DD06 1
1 AA12 101 BB101 CC01 DD06 2
2 AA11 102 BB101 CC01 2341 DD07 2
3 AA10 202 BB101 CC01 3451 DD09 3
4 AA13 103 BB101 CC02 1231 4
5 AA14 203 BB101 CC02 4561 5
print(b)
ID Value
0 AA12 101 BB101 CC01 1351 DD06 1
1 AA12 101 BB101 CC01 1351 DD06 2
2 AA11 102 BB101 CC01 DD07 2
3 AA10 202 BB101 CC01 3451 DD09 3
4 AA13 103 BB101 CC02 4
5 AA14 203 BB101 CC02 4561 6
Desired output :所需的输出:
ID Value ID Matched? Value Matched?
0 AA12 101 BB101 CC01 DD06 1 Yes Yes
1 AA12 101 BB101 CC01 DD06 2 Yes Yes
2 AA11 102 BB101 CC01 2341 DD07 2 Yes Yes
3 AA10 202 BB101 CC01 3451 DD09 3 Yes Yes
4 AA13 103 BB101 CC02 1231 4 No Yes
5 AA14 203 BB101 CC02 4561 5 Yes No
Here's the code written by @MaxU from the previous post:这是@MaxU 在上一篇文章中编写的代码:
pd.merge(a.assign(x=a.ID.str.split().apply(sorted).str.join(' ')),
b.assign(x=b.ID.str.split().apply(sorted).str.join(' ')),
on=['x','Value'],
how='outer',
indicator=True)
What I want to achieve :我想要实现的目标:
The result of this code is here .这段代码的结果在这里。 Unfortunately, it doesn't achieve the desired result.
不幸的是,它没有达到预期的结果。 Only Index 3 gets matched.
只有索引 3 匹配。 I was tweaking the code but couldn't figure out the next step.
我正在调整代码,但无法弄清楚下一步。
Thanks you so much for your time and consideration!非常感谢您的时间和考虑!
Try this:尝试这个:
first let's split and stack ID column in both DFs:首先让我们在两个 DF 中拆分和堆叠 ID 列:
In [248]: d1 = df1.set_index('Value').ID.str.split(expand=True).stack().to_frame('ID').reset_index().rename(columns={'level_1':'idx'})
...: d2 = df2.set_index('Value').ID.str.split(expand=True).stack().to_frame('ID').reset_index().rename(columns={'level_1':'idx'})
In [249]: d1
Out[249]:
Value idx ID
0 1 0 AA12
1 1 1 101
2 1 2 BB101
3 1 3 CC01
4 1 4 DD06
5 2 0 AA12
6 2 1 101
7 2 2 BB101
8 2 3 CC01
9 2 4 DD06
10 2 0 AA11
11 2 1 102
12 2 2 BB101
13 2 3 CC01
14 2 4 2341
15 2 5 DD07
16 3 0 AA10
17 3 1 202
18 3 2 BB101
19 3 3 CC01
20 3 4 3451
21 3 5 DD09
22 4 0 AA13
23 4 1 103
24 4 2 BB101
25 4 3 CC02
26 4 4 1231
27 5 0 AA14
28 5 1 203
29 5 2 BB101
30 5 3 CC02
31 5 4 4561
In [250]: d2
Out[250]:
Value idx ID
0 1 0 AA12
1 1 1 101
2 1 2 BB101
3 1 3 CC01
4 1 4 1351
5 1 5 DD06
6 2 0 AA12
7 2 1 101
8 2 2 BB101
9 2 3 CC01
10 2 4 1351
11 2 5 DD06
12 2 0 AA11
13 2 1 102
14 2 2 BB101
15 2 3 CC01
16 2 4 DD07
17 3 0 AA10
18 3 1 202
19 3 2 BB101
20 3 3 CC01
21 3 4 3451
22 3 5 DD09
23 4 0 AA13
24 4 1 103
25 4 2 BB101
26 4 3 CC02
27 6 0 AA14
28 6 1 203
29 6 2 BB101
30 6 3 CC02
31 6 4 4561
now we can search for 'not matched'
IDs:现在我们可以搜索
'not matched'
ID:
In [251]: no_match_idx = d1.loc[~d1.ID.isin(d2.ID), 'idx'].unique()
In [252]: no_match_idx
Out[252]: array([4], dtype=int64)
In [253]: df1['Matched_ID'] = ~df1.index.isin(no_match_idx)
...: df1['Matched_Value'] = df1.Value.isin(df2.Value)
Result:结果:
In [254]: df1
Out[254]:
ID Value Matched_ID Matched_Value
0 AA12 101 BB101 CC01 DD06 1 True True
1 AA12 101 BB101 CC01 DD06 2 True True
2 AA11 102 BB101 CC01 2341 DD07 2 True True
3 AA10 202 BB101 CC01 3451 DD09 3 True True
4 AA13 103 BB101 CC02 1231 4 False True
5 AA14 203 BB101 CC02 4561 5 True False
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.