[英]Compare multiple columns with special characters and merge dataframes
我有一个数据框:(df1)
df1_ID Col1_df1 Col2_df1 Col3_df1
ABC-001 a.102_103i k159* Test1
DEF-002 a.36-89E k188 Test2
GHI-003 ab.23<<X e542m Test3
df2:
df2_ID1 df2_ID2 Count Count_A Count_B To_Check
ABC-001 10 0 10 FIRSTLINE:a.102_103i:ANYTHING:EXTRA
DEF-002;GHI-003 20 2 18 SECONDLINE:ab.23<<X:ANYTHING:EXTRA
ABC-001;DEF-002 15 3 12 THIRDLINE:a.105:a.36-89D:ANYTHING:k188:EXTRA
结果(DF3):
df1_ID Col1_df1 Col2_df1 Col3_df1 df2_ID1 df2_ID2 Count Count_A Count_B To_Check
ABC-001 a.102_103i k159* Test1 ABC-001 10 0 10 FIRSTLINE:a.102_103i:ANYTHING:EXTRA:k159*
DEF-002 a.36-89 k188 Test2 ABC-001;DEF-002 15 3 12 THIRDLINE:a.105:a.36-89D:ANYTHING:k188:EXTRA
GHI-003 ab.23<<X e542m Test3 DEF-002;GHI-003 20 2 18 SECONDLINE:ab.23<<X:ANYTHING:EXTRA
我想检查Col1_df1
和Col2_df1
的值是否存在于To_Check
的df2
列中。 如果Col1_df1
AND
df1_ID
中的值存在并且Col2_df1
存在于df2_ID1
或df2_ID2
中,则将该行df2
合并到df1
。 如果它不匹配,那么它应该是空白的。
这个问题的延伸:
但在这个问题中,我们只是处理字符串。 在我的数据中,我们也有特殊字符。
尝试查找 df2 中存在的值时,此语法似乎也不起作用:
df1 = df1.assign(result=df1['Col1_df1'].isin(df2['To_Check']))
还写了另一种语法,但也不起作用:
output = open("output.csv", "a")
with open("df1.csv", "r") as df1:
first_line = df2.readline()
output.write(first_line)
with open("df2.csv", "r") as df2:
second_first = df2.readline()
output.write(second_first)
for line_df1 in df1:
df1_names = [x for x in line_df1.split(',')]
for line_df2 in df2:
df2_names = [x for x in line_df2.split(',')]
check1 = any(df1_names[1] in string for string in df2_names[6])
print(check1)
尽管值存在,但check1
始终为False
。
提前感谢您的帮助。
*更新
data_1={'df1_ID':['ABC-001','DEF-002','GHI-003']
,'Col1_df1':['a.102_103i','a.36-89E','ab.23<<X']
,'Col2_df1':['k159*','k188','e542m']
,'Col3_df1':['Test1','Test2','Test3']}
data_2={'df2_ID1':['','DEF-002;GHI-003','ABC-001;DEF-002']
,'df2_ID2':['ABC-001','','']
,'Count':['10','20','15']
,'Count_A':['0','2','3']
,'To_Check':['FIRSTLINE:a.102_103i:ANYTHING:EXTRA','SECONDLINE:ab.23<<X:ANYTHING:EXTRA','THIRDLINE:a.105:a.36-89D:ANYTHING:k188:EXTRA']}
需要澄清一下什么条件意味着一行应该从 df2 合并到 df1,以及合并应该是什么样子。
此代码段执行我认为您在条件方面正在寻找的内容,但我只是向 df1 添加了一个 col,它跟踪 df1 中的哪个 col 与 df2 中的某些行 To_Check 匹配,以及 df2 中该行的 ID
df1 = pd.DataFrame(data_1)
df2 = pd.DataFrame(data_2)
check = []
for df1_ind in df1.index:
found = ""
for df2_ind in df2.index:
col1_check = df1["Col1_df1"][df1_ind] in df2["To_Check"][df2_ind]
col2_check = df1["Col2_df1"][df1_ind] in df2["To_Check"][df2_ind]
df1_id_present = df1["df1_ID"][df1_ind] in [df2['df2_ID1'][df2_ind], df2['df2_ID2'][df2_ind]]
# wasn't sure if that first conditional meant the df1_id being present effected the column checks
if col1_check:
found += f"Col_1_present(df2_ID={df2_ind})::"
if col2_check:
found += f"Col_2_present(df2_ID={df2_ind})::"
if not found == "":
# this means the cols from df1 we are looking for in df2 were found at some row.
# leave the inner for loop and save these results
# unless you expect the row contents to appear in multiple rows of df2
break
if found == "":
found = "false"
check.append(found)
df1['Checks'] = check
print(df1.head())
输出:
df1_ID Col1_df1 Col2_df1 Col3_df1 Checks
0 ABC-001 a.102_103i k159* Test1 Col_1_present:(df2_ID=0)::
1 DEF-002 a.36-89E k188 Test2 Col_2_present:
2 GHI-003 ab.23<<X e542m Test3 Col_1_present:
下面的代码使用 join & apply 来解决上述问题。 使用您的数据框,新代码在注释行#Data Manipulation之后开始。 应用中使用的过滤条件如下,您可以轻松更改它的其他需要。 ((To_Check 中的col1_df1)或(To_Check 中的col2_df1))和((df2_ID1 中的df1_ID)或(df2_ID2 中的df1_ID))
import pandas as pd
data_1=pd.DataFrame({'df1_ID':['ABC-001','DEF-002','GHI-003']
,'Col1_df1':['a.102_103i','a.36-89E','ab.23<<X']
,'Col2_df1':['k159*','k188','e542m']
,'Col3_df1':['Test1','Test2','Test3']})
data_2=pd.DataFrame({'df2_ID1':['','DEF-002;GHI-003','ABC-001;DEF-002']
,'df2_ID2':['ABC-001','','']
,'Count':['10','20','15']
,'Count_A':['0','2','3']
,'To_Check':['FIRSTLINE:a.102_103i:ANYTHING:EXTRA','SECONDLINE:ab.23<<X:ANYTHING:EXTRA','THIRDLINE:a.105:a.36-89D:ANYTHING:k188:EXTRA']})
# Data Manuplation
data_1['join_col'] = 1
data_2['join_col'] = 1
df = pd.merge(data_1, data_2, left_on='join_col', right_on='join_col')
df['Is_Match'] = df[['df1_ID','Col1_df1','Col2_df1','df2_ID1','df2_ID2','To_Check']].\
apply(lambda x: (((x[5].find(x[1].strip()))>-1)|((x[5].find(x[2].strip()))>-1)) and ((x[3].find(x[0].strip())>-1)|(x[4].find(x[0].strip())>-1)) , axis=1)
df[df['Is_Match']==True].drop(['join_col','Is_Match'], axis=1)
输出:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.