通过Python中的模糊字符串匹配来匹配2个大型csv文件

Question

我正在尝试将600,000个个人名称（全名）与另一个数据库进行比较，该数据库具有超过8,700万个观测值（全名）！

我第一次尝试使用Fuzzywuzzy库太慢了，因此我决定使用速度更快的Fuzzyset模块。 假设我有一台功能强大的计算机，可以将所有数据集加载到内存中，那么我将使用964个观测值的测试文件与50,000个观测值进行匹配来执行以下操作：

import time
from cfuzzyset import cFuzzySet as FuzzySet

df1=pd.read_csv(file1,delimiter='|') # test file with 964 observations
df2=pd.read_csv(file2,delimiter='|') # test file with 50,000 observations to be matched against

a=FuzzySet() # allocate the FuzzySet object
for row in file2['name']:
   a.add(str(row)) # Fill the FuzzySet object with all names from file2

start_time = time.time() # Start recording the time

dicto={'index':[],'name':[]} # Dictionary where I store the output

for names in file1['f_ofulln']:
    dicto['index'].append(a.get(names)[0][0])
    dicto['name'].append(a.get(names)[0][1])

print("--- %s seconds ---" % (time.time() - start_time))   

>>> --- 39.68284249305725 seconds ---

对于较小的数据集（964个观测值与50,000个观测值相匹配），时间为39秒 。

但是，如果我要在整个数据集上执行此方法，这太慢了。

有谁知道如何改善运行时间？ 我认为Cython是不可能的，因为我已经导入了Cython版本的Fuzzyset模块

非常感谢，

阿德里安

Answer 1

因此，我将回答自己的问题，因为我找到了一种非常快的方法。

我使用panda.HDFStore和panda.to_hdf方法将两个数据库保存为HDF5格式。 我将姓氏的每个首字母保存到一个数据框中。 然后，我基于python-Levenshtein模块创建了一个找到最匹配的函数（非常快，因为它是用C编程的）。

最后，我一次发送了26个批处理作业，每个姓氏都发送一个。 这意味着我只匹配姓氏相同的名字的人。

请注意，我还对函数进行了编程，以查找与出生年份相差不超过1年的最接近的匹配项。

编辑：由于被要求，我在下面提供我的功能的摘要。 合并两个数据框的主要功能太长，因此无法在此处发布。

# Needed imports:
from Levenshtein import *
import pandas as pd

# Function that get the closest match of a word in a big list:

def get_closest_match(x, list_strings,fun):
    # fun: the matching method : ratio, wrinkler, ... (cf python-Levenshtein module)
    best_match = None
    highest_ratio = 0
    for current_string in list_strings.values.tolist():
        if highest_ratio!=1:
            current_score = fun(x, current_string)
            if(current_score > highest_ratio):
                highest_ratio = current_score
                best_match = current_string
    return (best_match,highest_ratio)

# the function that matches 2 dataframes (only the idea behind, since too long to write everything
dicto={'Index':[],'score':[], ...} 
def LevRatioMerge(df1,df2,fun,colname,required=[],approx=[],eps=[]):
    # Basically loop over df1 with:
    for name in df1.itertuples():
        result=get_closest_match(name[YourColnumber],df2[YourColname],fun)
        dicto['score'].append(result[1])
        dicto['Index'].append(name[0])
        ...

这是主意。 希望它对您的工作有启发。

通过Python中的模糊字符串匹配来匹配2个大型csv文件

问题描述

1 个解决方案

解决方案1
3 已采纳 2016-08-13 06:04:52

通过Python中的模糊字符串匹配来匹配2个大型csv文件

问题描述

1 个解决方案

解决方案1 3 已采纳 2016-08-13 06:04:52

解决方案1
3 已采纳 2016-08-13 06:04:52