简体   繁体   English

python中搜索字符串和字符串列表之间最高百分比Levenshtein距离的最快方法是什么?

[英]What is the fastest method in python of searching for the highest percent Levenshtein distance between a string and a list of strings?

I'm writing a program that compares a smaller list of game titles to a master list of many games to see which games in the smaller list more closely match with the titles of the games in the master list than others.我正在编写一个程序,将较小的游戏名称列表与许多游戏的主列表进行比较,以查看较小列表中的哪些游戏与主列表中的游戏名称比其他游戏更匹配。 In order to do this, I've been checking the Levenshtein distance (in percent form) between each game in the smaller list and every game in the master list and taking the maximum of all of these values (the lower the maximum percentage, the more unique the game has to be) using both the difflib and the fuzzywuzzy modules.为了做到这一点,我一直在检查较小列表中的每个游戏和主列表中的每个游戏之间的 Levenshtein 距离(以百分比形式),并取所有这些值中的最大值(最大百分比越低,游戏必须更加独特)同时使用difflibfuzzywuzzy模块。 The problem that I'm having is that a typical search using either process.extractOne() or difflib.get_close_matches() takes about 5+ seconds per game (with 38000+ strings in the master list), and I have about 4500 games to search through (5 * 4500 is about 6 hours and 15 minutes, which I don't have time for).我遇到的问题是,使用process.extractOne()difflib.get_close_matches()的典型搜索每场比赛大约需要 5 秒以上(主列表中有 38000 多个字符串),我有大约 4500 场比赛搜索(5 * 4500 大约需要 6 小时 15 分钟,我没有时间)。

In hopes of finding a better and faster method of searching through a list of strings, I'm asking here what the fastest method in python of searching for the highest percent Levenshtein distance between a string and a list of strings is.为了找到一种更好更快的搜索字符串列表的方法,我在这里询问 Python 中搜索字符串和字符串列表之间最高百分比 Levenshtein 距离的最快方法是什么。 If there is no better way than by using the two functions above or writing some other looping code, then please say so.如果没有比使用上面的两个函数或编写一些其他循环代码更好的方法,那么请说出来。

The two functions I used in specific to search for the highest distance are these:我专门用来搜索最高距离的两个函数是:

metric = process.extractOne(name, master_names)[1] / 100
metric = fuzz.ratio(name, difflib.get_close_matches(name, master_names, 1, 0)[0]) / 100

Through experimentation and further research I discovered that the fastest method of checking the Levenshtein ratio is through the python-Levenshtein library itself.通过实验和进一步研究,我发现检查 Levenshtein 比率的最快方法是通过python-Levenshtein库本身。 The function Levenshtein.ratio() is significantly faster (for one game the entire search takes only 0.05 seconds on average) compared to using any function in fuzzywuzzy or difflib, likely because of its simplicity and C implementation.与使用fuzzywuzzy 或difflib 中的任何函数相比,函数Levenshtein.ratio()明显更快(对于一款游戏,整个搜索平均只需要0.05 秒),这可能是因为它的简单性和C 实现。 I used this function in a for loop iterating over every name in the master list to get the best answer:我在 for 循环中使用此函数迭代主列表中的每个名称以获得最佳答案:

from Levenshtein import ratio

metric = 0
for master_name in master_names:
    new_metric = ratio(name, master_name)
    if (new_metric > metric):
        metric = new_metric

In conclusion I say that the fastest method of searching for the highest percent Levenshtein distance between a string and a list of strings is to iterate over the list of strings, use Levenshtein.ratio() to get the ratio of each string compared with the first string, and then check for the highest value ratio on each iteration.总之,我说搜索字符串和字符串列表之间最高百分比 Levenshtein 距离的最快方法是遍历字符串列表,使用Levenshtein.ratio()获取每个字符串与第一个字符串的比率字符串,然后在每次迭代中检查最高值比率。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 数字列表之间的编辑距离 - Levenshtein distance between list of number levenshtein与python中列表中项目的距离 - levenshtein distance with items in list in python 计算两个字符串之间的差异(Levenshtein距离) - Calculate the difference between 2 strings (Levenshtein distance) 计算列表Python中的levenshtein距离 - Calculating levenshtein distance within a list Python python中的字符串比较但不是Levenshtein距离(我认为) - string comparison in python but not Levenshtein distance (I think) Python-根据Levenshtein距离,将最接近的字符串从列表A分配到列表B-(最好是使用熊猫) - Python - Assign the closest string from List A to List B based on Levenshtein distance - (ideally with pandas) 与Python + Sqlite的字符串相似度(Levenshtein距离/编辑距离) - String similarity with Python + Sqlite (Levenshtein distance / edit distance) 根据 levenshtein 距离将 dataframe 列中的字符串与列表中的单词进行比较 - Compare strings in dataframe column according to levenshtein distance with words in a list 如何找到与其他 2 个字符串相似的字符串(就 Levenshtein 距离而言)? - How to find string similar to 2 other strings (in terms of Levenshtein distance)? 什么是最快的算法:在字符串列表中,删除作为另一个字符串的子字符串的所有字符串 [Python(或其他语言)] - What is the fastest algorithm: in a string list, remove all the strings which are substrings of another string [Python (or other language)]
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM