繁体   English   中英

比较具有特殊字符的多列并合并数据框

[英]Compare multiple columns with special characters and merge dataframes

我有一个数据框:(df1)

df1_ID    Col1_df1      Col2_df1    Col3_df1
ABC-001   a.102_103i    k159*       Test1
DEF-002   a.36-89E      k188        Test2
GHI-003   ab.23<<X      e542m       Test3

df2:

df2_ID1         df2_ID2    Count    Count_A  Count_B    To_Check
                ABC-001    10       0        10         FIRSTLINE:a.102_103i:ANYTHING:EXTRA
DEF-002;GHI-003            20       2        18         SECONDLINE:ab.23<<X:ANYTHING:EXTRA
ABC-001;DEF-002            15       3        12         THIRDLINE:a.105:a.36-89D:ANYTHING:k188:EXTRA

结果(DF3):

df1_ID  Col1_df1    Col2_df1    Col3_df1    df2_ID1 df2_ID2 Count   Count_A Count_B To_Check
ABC-001 a.102_103i  k159*       Test1               ABC-001 10      0       10      FIRSTLINE:a.102_103i:ANYTHING:EXTRA:k159*
DEF-002 a.36-89     k188        Test2       ABC-001;DEF-002 15      3       12      THIRDLINE:a.105:a.36-89D:ANYTHING:k188:EXTRA
GHI-003 ab.23<<X    e542m       Test3       DEF-002;GHI-003 20      2       18      SECONDLINE:ab.23<<X:ANYTHING:EXTRA

我想检查Col1_df1Col2_df1的值是否存在于To_Checkdf2列中。 如果Col1_df1 AND df1_ID中的值存在并且Col2_df1存在于df2_ID1df2_ID2中,则将该行df2合并到df1 如果它不匹配,那么它应该是空白的。

这个问题的延伸:

Vlookup 功能/合并 Pandas 但不完全匹配

但在这个问题中,我们只是处理字符串。 在我的数据中,我们也有特殊字符。

尝试查找 df2 中存在的值时,此语法似乎也不起作用:

df1 = df1.assign(result=df1['Col1_df1'].isin(df2['To_Check']))

还写了另一种语法,但也不起作用:

output = open("output.csv", "a")
with open("df1.csv", "r") as df1:
    first_line = df2.readline()
    output.write(first_line)
    with open("df2.csv", "r") as df2:
        second_first = df2.readline()
        output.write(second_first)
        for line_df1 in df1:
            df1_names = [x for x in line_df1.split(',')]
            for line_df2 in df2:
                df2_names = [x for x in line_df2.split(',')]
                check1 = any(df1_names[1] in string for string in df2_names[6])
                print(check1)

尽管值存在,但check1始终为False

提前感谢您的帮助。

*更新

data_1={'df1_ID':['ABC-001','DEF-002','GHI-003']
      ,'Col1_df1':['a.102_103i','a.36-89E','ab.23<<X']
      ,'Col2_df1':['k159*','k188','e542m']
      ,'Col3_df1':['Test1','Test2','Test3']}

data_2={'df2_ID1':['','DEF-002;GHI-003','ABC-001;DEF-002']
      ,'df2_ID2':['ABC-001','','']
      ,'Count':['10','20','15']
      ,'Count_A':['0','2','3']
        ,'To_Check':['FIRSTLINE:a.102_103i:ANYTHING:EXTRA','SECONDLINE:ab.23<<X:ANYTHING:EXTRA','THIRDLINE:a.105:a.36-89D:ANYTHING:k188:EXTRA']}

需要澄清一下什么条件意味着一行应该从 df2 合并到 df1,以及合并应该是什么样子。

此代码段执行我认为您在条件方面正在寻找的内容,但我只是向 df1 添加了一个 col,它跟踪 df1 中的哪个 col 与 df2 中的某些行 To_Check 匹配,以及 df2 中该行的 ID

df1 = pd.DataFrame(data_1)
df2 = pd.DataFrame(data_2)


check = []
for df1_ind in df1.index:
    found = ""
    for df2_ind in df2.index:
        col1_check = df1["Col1_df1"][df1_ind] in df2["To_Check"][df2_ind]
        col2_check = df1["Col2_df1"][df1_ind] in df2["To_Check"][df2_ind]
        df1_id_present = df1["df1_ID"][df1_ind] in [df2['df2_ID1'][df2_ind], df2['df2_ID2'][df2_ind]]
        # wasn't sure if that first conditional meant the df1_id being present effected the column checks
        if col1_check:
            found += f"Col_1_present(df2_ID={df2_ind})::"
        if col2_check:
            found += f"Col_2_present(df2_ID={df2_ind})::"
        if not found == "":
            # this means the cols from df1 we are looking for in df2 were found at some row. 
            # leave the inner for loop and save these results
            # unless you expect the row contents to appear in multiple rows of df2
            break 
    if found == "":
        found = "false"
    check.append(found)

df1['Checks'] = check
print(df1.head())

输出:

    df1_ID    Col1_df1 Col2_df1 Col3_df1                      Checks
0  ABC-001  a.102_103i    k159*    Test1  Col_1_present:(df2_ID=0)::
1  DEF-002    a.36-89E     k188    Test2              Col_2_present:
2  GHI-003    ab.23<<X    e542m    Test3              Col_1_present:

下面的代码使用 join & apply 来解决上述问题。 使用您的数据框,新代码在注释行#Data Manipulation之后开始。 应用中使用的过滤条件如下,您可以轻松更改它的其他需要。 ((To_Check 中的col1_df1)或(To_Check 中的col2_df1))和((df2_ID1 中的df1_ID)或(df2_ID2 中的df1_ID))

import pandas as pd
data_1=pd.DataFrame({'df1_ID':['ABC-001','DEF-002','GHI-003']
      ,'Col1_df1':['a.102_103i','a.36-89E','ab.23<<X']
      ,'Col2_df1':['k159*','k188','e542m']
      ,'Col3_df1':['Test1','Test2','Test3']})

data_2=pd.DataFrame({'df2_ID1':['','DEF-002;GHI-003','ABC-001;DEF-002']
      ,'df2_ID2':['ABC-001','','']
      ,'Count':['10','20','15']
      ,'Count_A':['0','2','3']
        ,'To_Check':['FIRSTLINE:a.102_103i:ANYTHING:EXTRA','SECONDLINE:ab.23<<X:ANYTHING:EXTRA','THIRDLINE:a.105:a.36-89D:ANYTHING:k188:EXTRA']})

# Data Manuplation 
data_1['join_col'] = 1
data_2['join_col'] = 1

df = pd.merge(data_1, data_2, left_on='join_col', right_on='join_col')

df['Is_Match'] = df[['df1_ID','Col1_df1','Col2_df1','df2_ID1','df2_ID2','To_Check']].\
apply(lambda x: (((x[5].find(x[1].strip()))>-1)|((x[5].find(x[2].strip()))>-1)) and  ((x[3].find(x[0].strip())>-1)|(x[4].find(x[0].strip())>-1)) , axis=1)

df[df['Is_Match']==True].drop(['join_col','Is_Match'], axis=1)

输出:

在此处输入图像描述

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM