[英]Is there a way to speed up matching addresses and level of confidence per match between two data frames for large datasets?
我在下面有一个脚本,用于检查我的 dataframe 中的一列地址与另一个 dataframe 中的一列地址的准确性,以查看它们是否匹配以及它们匹配的程度。
我正在使用快速模糊,我听说它比 fuzzywuzzy 更快。 但是匹配和计算仍然需要很长时间。 这是 CSV 个文件。 main_dataset.csv包含约300万条记录,reference_dataset.csv包含约10条记录。
以下是每条记录所花费的时间。
start time: Thu Oct 6 10:51:18 2022
end time: Thu Oct 6 10:51:23 2022
start time: Thu Oct 6 10:51:23 2022
end time: Thu Oct 6 10:51:28 2022
start time: Thu Oct 6 10:51:28 2022
end time: Thu Oct 6 10:51:32 2022
start time: Thu Oct 6 10:51:32 2022
end time: Thu Oct 6 10:51:36 2022
start time: Thu Oct 6 10:51:36 2022
end time: Thu Oct 6 10:51:41 2022
start time: Thu Oct 6 10:51:41 2022
end time: Thu Oct 6 10:51:45 2022
start time: Thu Oct 6 10:51:45 2022
end time: Thu Oct 6 10:51:50 2022
start time: Thu Oct 6 10:51:50 2022
end time: Thu Oct 6 10:51:54 2022
start time: Thu Oct 6 10:51:54 2022
end time: Thu Oct 6 10:51:59 2022
我的脚本在这里:
import pandas as pd
from rapidfuzz import process, fuzz
import time
from dask import dataframe as dd
ref_df = pd.read_csv('reference_dataset.csv')
df = dd.read_csv('main_dataset.csv', low_memory=False)
contacts_addresses = list(df.address)
ref_addresses = list(ref_df.ref_address.unique())
def scoringMatches(x, s):
o = process.extract(x, s, score_cutoff = 60)
if o != None:
return o[1]
def match_addresses(add, contacts_addresses, min_score=0):
response = process.extract(add, contacts_addresses, scorer=fuzz.token_sort_ratio)
return response
def get_highest_score(scores):
total_scores = []
for val in scores:
total_scores.append(val[1])
max_value = max(total_scores)
max_index = total_scores.index(max_value)
return scores[max_index]
scores_list = []
names = []
for x in ref_addresses:
# start = time.time()
# print("start time:", time.ctime(start))
scores = match_addresses(x, contacts_addresses, 75)
match = get_highest_score(scores)
name = (str(x), str(match[0]))
names.append(name)
score = int(match[1])
scores_list.append(score)
# end = time.time()
# print("end time:", time.ctime(end))
name_dict = dict(names)
match_df = pd.DataFrame(name_dict.items(), columns=['ref_address', 'matched_address'])
scores_df = pd.DataFrame(scores_list)
merged_results_01 = pd.concat([match_df, scores_df], axis=1)
merged_results_02 = pd.merge(ref_df, merged_results_01, how='right', on='ref_address')
merged_results_02.to_csv('results.csv')
建议现在使用process.cdist
来比较两个序列并获得相似性矩阵,而不是process.extract
/ process.extractOne
,因为许多新的性能改进到目前为止才添加到该算法中。
即这些改进是:
workers
参数支持多线程 这两项改进都将在某个时候添加到process.extract
和process.extractOne
中,但此时 (rapidfuzz==v2.11.1) 它们仅存在。
这方面未来改进的几个相关问题是:
例如,这可以通过以下方式实现:
from itertools import islice
chunk_size = 100
ref_addr_iter = iter(ref_addresses)
while ref_addr_chunk := list(islice(ref_addr_iter, chunk_size)):
scores = process.cdist(ref_addr_chunk, contacts_addresses, scorer=fuzz.token_sort_ratio, score_cutoff=75, workers=-1)
max_scores_idx = scores.argmax(axis=1)
for ref_addr_idx, score_idx in enumerate(max_scores_idx):
names.append((ref_addr_chunk[ref_addr_idx], contacts_addresses[score_idx]))
scores_list.append(scores[ref_addr_idx,score_idx])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.