python中搜索字符串和字符串列表之间最高百分比Levenshtein距离的最快方法是什么？

Question

I'm writing a program that compares a smaller list of game titles to a master list of many games to see which games in the smaller list more closely match with the titles of the games in the master list than others.我正在编写一个程序，将较小的游戏名称列表与许多游戏的主列表进行比较，以查看较小列表中的哪些游戏与主列表中的游戏名称比其他游戏更匹配。 In order to do this, I've been checking the Levenshtein distance (in percent form) between each game in the smaller list and every game in the master list and taking the maximum of all of these values (the lower the maximum percentage, the more unique the game has to be) using both the difflib and the fuzzywuzzy modules.为了做到这一点，我一直在检查较小列表中的每个游戏和主列表中的每个游戏之间的 Levenshtein 距离（以百分比形式），并取所有这些值中的最大值（最大百分比越低，游戏必须更加独特）同时使用difflib和fuzzywuzzy模块。 The problem that I'm having is that a typical search using either process.extractOne() or difflib.get_close_matches() takes about 5+ seconds per game (with 38000+ strings in the master list), and I have about 4500 games to search through (5 * 4500 is about 6 hours and 15 minutes, which I don't have time for).我遇到的问题是，使用process.extractOne()或difflib.get_close_matches()的典型搜索每场比赛大约需要 5 秒以上（主列表中有 38000 多个字符串），我有大约 4500 场比赛搜索（5 * 4500 大约需要 6 小时 15 分钟，我没有时间）。

In hopes of finding a better and faster method of searching through a list of strings, I'm asking here what the fastest method in python of searching for the highest percent Levenshtein distance between a string and a list of strings is.为了找到一种更好更快的搜索字符串列表的方法，我在这里询问 Python 中搜索字符串和字符串列表之间最高百分比 Levenshtein 距离的最快方法是什么。 If there is no better way than by using the two functions above or writing some other looping code, then please say so.如果没有比使用上面的两个函数或编写一些其他循环代码更好的方法，那么请说出来。

The two functions I used in specific to search for the highest distance are these:我专门用来搜索最高距离的两个函数是：

metric = process.extractOne(name, master_names)[1] / 100
metric = fuzz.ratio(name, difflib.get_close_matches(name, master_names, 1, 0)[0]) / 100

Answer 1

Through experimentation and further research I discovered that the fastest method of checking the Levenshtein ratio is through the python-Levenshtein library itself.通过实验和进一步研究，我发现检查 Levenshtein 比率的最快方法是通过python-Levenshtein库本身。 The function Levenshtein.ratio() is significantly faster (for one game the entire search takes only 0.05 seconds on average) compared to using any function in fuzzywuzzy or difflib, likely because of its simplicity and C implementation.与使用fuzzywuzzy 或difflib 中的任何函数相比，函数Levenshtein.ratio()明显更快（对于一款游戏，整个搜索平均只需要0.05 秒），这可能是因为它的简单性和C 实现。 I used this function in a for loop iterating over every name in the master list to get the best answer:我在 for 循环中使用此函数迭代主列表中的每个名称以获得最佳答案：

from Levenshtein import ratio

metric = 0
for master_name in master_names:
    new_metric = ratio(name, master_name)
    if (new_metric > metric):
        metric = new_metric

In conclusion I say that the fastest method of searching for the highest percent Levenshtein distance between a string and a list of strings is to iterate over the list of strings, use Levenshtein.ratio() to get the ratio of each string compared with the first string, and then check for the highest value ratio on each iteration.总之，我说搜索字符串和字符串列表之间最高百分比 Levenshtein 距离的最快方法是遍历字符串列表，使用Levenshtein.ratio()获取每个字符串与第一个字符串的比率字符串，然后在每次迭代中检查最高值比率。

python中搜索字符串和字符串列表之间最高百分比Levenshtein距离的最快方法是什么？

问题描述

1 个解决方案

解决方案1
3 已采纳 2020-03-01 18:55:17

python中搜索字符串和字符串列表之间最高百分比Levenshtein距离的最快方法是什么？

问题描述

1 个解决方案

解决方案1 3 已采纳 2020-03-01 18:55:17

解决方案1
3 已采纳 2020-03-01 18:55:17