简体   繁体   English

比较两个数据框并过滤匹配的值

[英]Compare two dataframes and filter the matched values

Previous question: Pandas Compare two dataframes and determine the matched values上一个问题: Pandas 比较两个数据框并确定匹配的值

I have two dataframes:我有两个数据框:

print(a)

    ID                              Value
0   AA12 101 BB101 CC01 DD06        1
1   AA12 101 BB101 CC01 DD06        2
2   AA11 102 BB101 CC01 2341 DD07   2
3   AA10 202 BB101 CC01 3451 DD09   3
4   AA13 103 BB101 CC02 1231        4
5   AA14 203 BB101 CC02 4561        5
print(b)

    ID                              Value
0   AA12 101 BB101 CC01 1351 DD06   1
1   AA12 101 BB101 CC01 1351 DD06   2
2   AA11 102 BB101 CC01 DD07        2
3   AA10 202 BB101 CC01 3451 DD09   3
4   AA13 103 BB101 CC02             4
5   AA14 203 BB101 CC02 4561        6 

Desired output :所需的输出

     ID                              Value   ID Matched?   Value Matched?
0    AA12 101 BB101 CC01 DD06        1       Yes           Yes 
1    AA12 101 BB101 CC01 DD06        2       Yes           Yes 
2    AA11 102 BB101 CC01 2341 DD07   2       Yes           Yes
3    AA10 202 BB101 CC01 3451 DD09   3       Yes           Yes
4    AA13 103 BB101 CC02 1231        4       No            Yes
5    AA14 203 BB101 CC02 4561        5       Yes           No

Here's the code written by @MaxU from the previous post:这是@MaxU 在上一篇文章中编写的代码:

pd.merge(a.assign(x=a.ID.str.split().apply(sorted).str.join(' ')),
             b.assign(x=b.ID.str.split().apply(sorted).str.join(' ')),
             on=['x','Value'],
             how='outer',
             indicator=True)

What I want to achieve :我想要实现的目标

  • If either dataframes do not contain four-digit item from ['ID'] (ie 2341, 3451), I want to exclude it from the matching process.如果任一数据帧不包含来自 ['ID'] 的四位数项目(即 2341、3451),我想将其从匹配过程中排除。
  • If Same ID appears more than once, they can have different values on ['Value'].如果相同 ID 出现多次,则它们可以在 ['Value'] 上具有不同的值。

The result of this code is here .这段代码的结果在这里 Unfortunately, it doesn't achieve the desired result.不幸的是,它没有达到预期的结果。 Only Index 3 gets matched.只有索引 3 匹配。 I was tweaking the code but couldn't figure out the next step.我正在调整代码,但无法弄清楚下一步。

Thanks you so much for your time and consideration!非常感谢您的时间和考虑!

Try this:尝试这个:

first let's split and stack ID column in both DFs:首先让我们在两个 DF 中拆分和堆叠 ID 列:

In [248]: d1 = df1.set_index('Value').ID.str.split(expand=True).stack().to_frame('ID').reset_index().rename(columns={'level_1':'idx'})
     ...: d2 = df2.set_index('Value').ID.str.split(expand=True).stack().to_frame('ID').reset_index().rename(columns={'level_1':'idx'})

In [249]: d1
Out[249]:
    Value  idx     ID
0       1    0   AA12
1       1    1    101
2       1    2  BB101
3       1    3   CC01
4       1    4   DD06
5       2    0   AA12
6       2    1    101
7       2    2  BB101
8       2    3   CC01
9       2    4   DD06
10      2    0   AA11
11      2    1    102
12      2    2  BB101
13      2    3   CC01
14      2    4   2341
15      2    5   DD07
16      3    0   AA10
17      3    1    202
18      3    2  BB101
19      3    3   CC01
20      3    4   3451
21      3    5   DD09
22      4    0   AA13
23      4    1    103
24      4    2  BB101
25      4    3   CC02
26      4    4   1231
27      5    0   AA14
28      5    1    203
29      5    2  BB101
30      5    3   CC02
31      5    4   4561

In [250]: d2
Out[250]:
    Value  idx     ID
0       1    0   AA12
1       1    1    101
2       1    2  BB101
3       1    3   CC01
4       1    4   1351
5       1    5   DD06
6       2    0   AA12
7       2    1    101
8       2    2  BB101
9       2    3   CC01
10      2    4   1351
11      2    5   DD06
12      2    0   AA11
13      2    1    102
14      2    2  BB101
15      2    3   CC01
16      2    4   DD07
17      3    0   AA10
18      3    1    202
19      3    2  BB101
20      3    3   CC01
21      3    4   3451
22      3    5   DD09
23      4    0   AA13
24      4    1    103
25      4    2  BB101
26      4    3   CC02
27      6    0   AA14
28      6    1    203
29      6    2  BB101
30      6    3   CC02
31      6    4   4561

now we can search for 'not matched' IDs:现在我们可以搜索'not matched' ID:

In [251]: no_match_idx = d1.loc[~d1.ID.isin(d2.ID), 'idx'].unique()

In [252]: no_match_idx
Out[252]: array([4], dtype=int64)

In [253]: df1['Matched_ID'] = ~df1.index.isin(no_match_idx)
     ...: df1['Matched_Value'] = df1.Value.isin(df2.Value)

Result:结果:

In [254]: df1
Out[254]:
                              ID  Value Matched_ID Matched_Value
0       AA12 101 BB101 CC01 DD06      1       True          True
1       AA12 101 BB101 CC01 DD06      2       True          True
2  AA11 102 BB101 CC01 2341 DD07      2       True          True
3  AA10 202 BB101 CC01 3451 DD09      3       True          True
4       AA13 103 BB101 CC02 1231      4      False          True
5       AA14 203 BB101 CC02 4561      5       True         False

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM