简体   繁体   English

有没有办法加快大型数据集的两个数据帧之间的匹配地址和每次匹配的置信度?

[英]Is there a way to speed up matching addresses and level of confidence per match between two data frames for large datasets?

I have got a script below that check the accuracy of a column of addresses in my dataframe against a column of addresses in another dataframe, to see if they match and how well they match.我在下面有一个脚本,用于检查我的 dataframe 中的一列地址与另一个 dataframe 中的一列地址的准确性,以查看它们是否匹配以及它们匹配的程度。

I am using rapid fuzz I heard it is faster than fuzzywuzzy.我正在使用快速模糊,我听说它比 fuzzywuzzy 更快。 However it is still taking a very long time to do the match and calculations.但是匹配和计算仍然需要很长时间。 Here is the CSV files.这是 CSV 个文件。 main_dataset.csv contains about 3 million records, and reference_dataset.csv contains about 10 records. main_dataset.csv包含约300万条记录,reference_dataset.csv包含约10条记录。

Below is the time it took for each record.以下是每条记录所花费的时间。

start time: Thu Oct  6 10:51:18 2022
end time: Thu Oct  6 10:51:23 2022
start time: Thu Oct  6 10:51:23 2022
end time: Thu Oct  6 10:51:28 2022
start time: Thu Oct  6 10:51:28 2022
end time: Thu Oct  6 10:51:32 2022
start time: Thu Oct  6 10:51:32 2022
end time: Thu Oct  6 10:51:36 2022
start time: Thu Oct  6 10:51:36 2022
end time: Thu Oct  6 10:51:41 2022
start time: Thu Oct  6 10:51:41 2022
end time: Thu Oct  6 10:51:45 2022
start time: Thu Oct  6 10:51:45 2022
end time: Thu Oct  6 10:51:50 2022
start time: Thu Oct  6 10:51:50 2022
end time: Thu Oct  6 10:51:54 2022
start time: Thu Oct  6 10:51:54 2022
end time: Thu Oct  6 10:51:59 2022

My script is here:我的脚本在这里:

import pandas as pd
from rapidfuzz import process, fuzz
import time
from dask import dataframe as dd

ref_df = pd.read_csv('reference_dataset.csv')
df = dd.read_csv('main_dataset.csv', low_memory=False)

contacts_addresses = list(df.address)
ref_addresses = list(ref_df.ref_address.unique())

def scoringMatches(x, s):
    o = process.extract(x, s, score_cutoff = 60)
    if o != None:
        return o[1]

def match_addresses(add, contacts_addresses, min_score=0):
    response = process.extract(add, contacts_addresses, scorer=fuzz.token_sort_ratio)
    return response


def get_highest_score(scores):
    total_scores = []
    for val in scores:
        total_scores.append(val[1])
    max_value = max(total_scores)
    max_index = total_scores.index(max_value)
    return scores[max_index]


scores_list = []
names = []
for x in ref_addresses:
    # start = time.time()
    # print("start time:", time.ctime(start))
    scores = match_addresses(x, contacts_addresses, 75)
    match = get_highest_score(scores)
    name = (str(x), str(match[0]))
    names.append(name)
    score = int(match[1])
    scores_list.append(score)
    # end = time.time()
    # print("end time:", time.ctime(end))
name_dict = dict(names)

match_df = pd.DataFrame(name_dict.items(), columns=['ref_address', 'matched_address'])
scores_df = pd.DataFrame(scores_list)

merged_results_01 = pd.concat([match_df, scores_df], axis=1)

merged_results_02 = pd.merge(ref_df, merged_results_01, how='right', on='ref_address')
merged_results_02.to_csv('results.csv')

It is recommended to use process.cdist which compares two sequences and obtains a similarity matrix instead of process.extract / process.extractOne right now, since a lot of the newer performance improvements only got added to this algorithm so far.建议现在使用process.cdist来比较两个序列并获得相似性矩阵,而不是process.extract / process.extractOne ,因为许多新的性能改进到目前为止才添加到该算法中。

Namely those improvements are:即这些改进是:

  1. support for multithreading using the workers argument使用workers参数支持多线程
  2. support to compare multiple short sequences (<= 64 characters) in parallel using SIMD on x64.支持在 x64 上使用 SIMD 并行比较多个短序列(<= 64 个字符)。

Both of these improvements will be added to process.extract and process.extractOne at some point, but at this point (rapidfuzz==v2.11.1) they only exist.这两项改进都将在某个时候添加到process.extractprocess.extractOne中,但此时 (rapidfuzz==v2.11.1) 它们仅存在。

A couple relevant issues for future improvements on this front are:这方面未来改进的几个相关问题是:

This could be eg implemented in the following way:例如,这可以通过以下方式实现:

from itertools import islice

chunk_size = 100
ref_addr_iter = iter(ref_addresses)
while ref_addr_chunk := list(islice(ref_addr_iter, chunk_size)):
    scores = process.cdist(ref_addr_chunk, contacts_addresses, scorer=fuzz.token_sort_ratio, score_cutoff=75, workers=-1)
    max_scores_idx = scores.argmax(axis=1)
    for ref_addr_idx, score_idx in enumerate(max_scores_idx):
        names.append((ref_addr_chunk[ref_addr_idx], contacts_addresses[score_idx]))
        scores_list.append(scores[ref_addr_idx,score_idx])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM