简体   繁体   English

基于条件python的2大数据集模糊Wuzzy字符串匹配

[英]Fuzzy Wuzzy String Matching on 2 Large Data Sets Based on a Condition - python

I have 2 large data sets that I have read into Pandas DataFrames (~ 20K rows and ~40K rows respectively). 我有2个大数据集,我已经读入Pandas DataFrames(分别为~20K行和~40K行)。 When I try merging these two DFs outright using pandas.merge on the address field, I get a paltry number of match compared to the number of rows. 当我尝试在地址字段上使用pandas.merge直接合并这两个DF时,与行数相比,我获得了微不足道的匹配数。 So I thought I would try to fuzzy string match to see if it improves the number of output matches. 所以我想我会尝试模糊字符串匹配,看它是否改善了输出匹配的数量。

I approached this by trying to create a new column in DF1 (20K rows) that was the result of applying the fuzzywuzzy extractone function on DF1[addressline] to DF2[addressline]. 我试图在DF1(20K行)中创建一个新列,这是将DF1 [地址线]上的fuzzywuzzy extractone函数应用于DF2 [addressline]的结果。 I shortly realized that this would take forever since it will be doing close to 1 billion comparisons. 我很快意识到这将需要永远,因为它将进行近10亿次比较。

Both of these datasets have "County" fields and my ask is this: is there a way to conditionally do a fuzzy string match on the "addressline" fields in both DFs based on the "county" fields being the same? 这两个数据集都有“县”字段,我的问题是:有没有办法根据“县”字段是否有条件地在两个DF中的“地址线”字段上进行模糊字符串匹配? Researching questions similar to mine I stumbled upon this discussion: Fuzzy logic on big datasets using Python 研究类似于我的问题我偶然发现了这个问题: 使用Python对大数据集进行模糊逻辑

However I am still fuzzy (no pun intended) on how to go about grouping/blocking fields based on county. 然而,我仍然模糊(没有双关语)关于如何分组/阻止基于县的字段。 Any advice would be greatly appreciated! 任何建议将不胜感激!

import pandas as pd
from fuzzywuzzy import process

def fuzzy_match(x, choices, scorer, cutoff):
  return process.extractOne(x, choices = choices, scorer = scorer, score_cutoff= cutoff)[0]

test = pd.DataFrame({'Address1':['123 Cheese Way','234 Cookie Place','345 Pizza Drive','456 Pretzel Junction'],'ID':['X','U','X','Y']}) 
test2 = pd.DataFrame({'Address1':['123 chese wy','234 kookie Pl','345 Pizzza DR','456 Pretzel Junktion'],'ID':['X','U','X','Y']}) 
test['Address1'] = test['Address1'].apply(lambda x: x.lower()) 
test2['Address1'] = test2['Address1'].apply(lambda x: x.lower()) 
test['FuzzyAddress1'] = test['Address1'].apply(fuzzy_match, args = (test2['Address1'], fuzz.ratio, 80))

I've added 2 images that are sample sets of the 2 different DFs imported into Excel. 我添加了2张图像,这些图像是导入Excel的2个不同DF的样本集。 Not all the fields have been included since they aren't important to my question. 并非所有字段都包含在内,因为它们对我的问题并不重要。 To reiterate my end goal, I want a new column in one of the DFs that has the top result from fuzzy matching an address line with the other address lines in the 2nd DF but only for those lines where the counties match between both DFs. 为了重申我的最终目标,我希望在其中一个DF中有一个新列,其中最重要的结果是模糊匹配地址线和第二个DF中的其他地址线,但仅适用于两个DF之间匹配的那些行。 From there I plan to merge the two dfs, one on the fuzzy matched address and the address line column in the 2nd DF. 从那里我计划合并两个dfs,一个在模糊匹配地址和第二个DF中的地址行列。 Hopefully this doesn't sound confusing. 希望这听起来并不令人困惑。

You could adapt your fuzzy_match function to take the id as a variable and use this to subset your choices before doing the fuzzy search (note that this requires applying the function over the whole dataframe rather than just the address column) 您可以调整您的fuzzy_match函数以将id作为变量,并在进行模糊搜索之前使用它来对您的选择进行子集化(请注意,这需要在整个数据帧而不仅仅是地址列上应用函数)

def fuzzy_match(x, choices, scorer, cutoff):
    match = process.extractOne(x['Address1'], 
                               choices=choices.loc[choices['ID'] == x['ID'], 
                                                   'Address1'], 
                               scorer=scorer, 
                               score_cutoff=cutoff)
    if match:
        return match[0]

test['FuzzyAddress1'] = test.apply(fuzzy_match, 
                                   args=(test2, fuzz.ratio, 80), 
                                   axis=1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM