简体   繁体   English

不同 pandas 数据帧的两列之间的部分字匹配

[英]Partial word match between two columns of different pandas dataframes

I have two data-frames like:我有两个数据框,例如:

df1: df1:

在此处输入图像描述

df2: df2:

在此处输入图像描述

I am trying make a match of any term to text.我正在尝试将任何术语与文本进行匹配。

MyCode:我的代码:

import sys,os
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import csv
import re

# data
data1 = {'termID': [1,55,341,41,5685], 'term':['Cardic Arrest','Headache','Chest Pain','Muscle Pain', 'Knee Pain']}
data2 = {'textID': [25,12,52,35], 'text':['Hello Mike, Good Morning!!',
                                         'Oops!! My Knee pains!!',
                                          'Stop Music!! my head pains',
                                          'Arrest Innocent!!'
                                         ]}

#Dataframes 
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

# Matching logic
matchList=[]
for index_b, row_b in df2.iterrows():
    for index_a, row_a in df1.iterrows():
        if  row_a.term.lower() in row_b.text.lower() :   
            #print(row_b.text, row_a.term)
            matchList.append([row_b.textID,row_b.text ,row_a.term, row_a.termID] )

cols = ['textID', 'text,','term ','termID' ]
d = pd.DataFrame(matchList, columns = cols)
print(d)

Which gave me only single row as output:这给了我只有单行 output:

在此处输入图像描述


I have two issues to fix:我有两个问题要解决:

  1. Not sure how can I get output for any partial match like this:对于像这样的任何部分匹配,不知道如何获得 output:

在此处输入图像描述

  1. Both DF1 and DF2 are of size of around 0.4M and 13M records. DF1 和 DF2 的大小约为 0.4M 和 13M 记录。

What optimum ways are there to fix these two issues?解决这两个问题的最佳方法是什么?

I've a quick fix for problem 1 but not an optimisation.我对问题 1 有一个快速修复,但不是优化。 You only get one match because "Knee pain" is the only term that appears in full in df1.您只会得到一场比赛,因为“膝盖疼痛”是 df1 中唯一完整出现的术语。 I've modified the if statement to split the text from df2 and check if there are any matches from the list.我已经修改了 if 语句以从 df2 中拆分文本并检查列表中是否有任何匹配项。 Agree with @jakub that there are libraries that will do this quicker.同意@jakub 的观点,有些库可以更快地做到这一点。

# Matching logic
matchList=[]
for index_b, row_b in df2.iterrows():
    print(row_b)
    for index_a, row_a in df1.iterrows():
        if  any(word in row_a.term.lower() for word in row_b.text.lower().split()):
            #print(row_b.text, row_a.term)
            matchList.append([row_b.textID,row_b.text ,row_a.term, row_a.termID] )

cols = ['textID', 'text,','term ','termID' ]
d = pd.DataFrame(matchList, columns = cols)
print(d)

Output Output

   textID                       text,          term   termID
0      12      Oops!! My Knee pains!!      Knee Pain    5685
1      52  Stop Music!! my head pains       Headache      55
2      35           Arrest Innocent!!  Cardic Arrest       1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在两个 pandas 数据帧之间找到部分字符串匹配的最快方法 - Quickest way to find partial string match between two pandas dataframes Python Pandas:为两个数据帧之间的多列找到最接近的匹配 - Python Pandas: Find closest match for multiple columns between two dataframes 在两个不同的pandas Dataframe之间循环和匹配字符串 - Loop and match strings between two different pandas Dataframes Pandas - 在不同数据帧的两列之间“选择条件存在的位置” - Pandas - "select where a condition exists" between two columns of different dataframes Pandas 给定公共列的两个不同大小的数据框之间的算术 - Pandas arthmetic between two different sized dataframes given common columns 在两个不同的DataFrames Pandas中匹配字符串值 - Match string values in two different DataFrames Pandas 查找两个不同数据帧列之间的部分匹配,并在找到匹配项时分配值 - Finding partial matches between two different dataframes' columns, and assigning values when matches are found 在 pandas 中连接两个具有不同列的数据帧 - Concat two dataframes with different columns in pandas 基于列之间的部分字符串匹配连接数据帧 - Join dataframes based on partial string-match between columns 比较来自两个不同数据框熊猫的列 - Compare columns from two different dataframes pandas
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM