简体   繁体   English

Python中的模糊字符串匹配

[英]Fuzzy string matching in Python

I have 2 lists of over a million names with slightly different naming conventions. 我有2个超过一百万个名称的列表,它们的命名约定略有不同。 The goal here it to match those records that are similar, with the logic of 95% confidence. 目的是匹配具有95%置信度的逻辑的相似记录。

I am made aware there are libraries which I can leverage on, such as the FuzzyWuzzy module in Python. 我知道有一些库可以利用,例如Python中的FuzzyWuzzy模块。

However in terms of processing it seems it will take up too much resources having every string in 1 list to be compared to the other, which in this case seems to require 1 million multiplied by another million number of iterations. 但是,就处理而言,将一个列表中的每个字符串与另一个列表进行比较似乎会占用太多资源,在这种情况下,这似乎需要将100万乘以另一百万迭代次数。

Are there any other more efficient methods for this problem? 还有其他更有效的方法来解决此问题吗?

UPDATE: 更新:

So I created a bucketing function and applied a simple normalization of removing whitespace, symbols and converting the values to lowercase etc... 因此,我创建了一个存储桶功能并应用了一个简单的规范化功能,即删除空格,符号并将值转换为小写字母等。

for n in list(dftest['YM'].unique()):
    n = str(n)
    frame = dftest['Name'][dftest['YM'] == n]
    print len(frame)
    print n
    for names in tqdm(frame):
            closest = process.extractOne(names,frame)

By using pythons pandas, the data is loaded to smaller buckets grouped by years and then using the FuzzyWuzzy module, process.extractOne is used to get the best match. 通过使用pythons pandas,数据被加载到按年份分组的较小的存储桶中,然后使用FuzzyWuzzy模块,使用process.extractOne获得最佳匹配。

Results are still somewhat disappointing. 结果仍然有些令人失望。 During test the code above is used on a test data frame containing only 5 thousand names and takes up almost a whole hour. 在测试过程中,上面的代码用于仅包含5,000个名称的测试数据帧,并且占用了几乎一个小时的时间。

The test data is split up by. 测试数据被分割。

  • Name 名称
  • Year Month of Date of Birth 出生年份的月份

And I am comparing them by buckets where their YMs are in the same bucket. 我正在按桶比较它们的YM在同一桶中。

Could the problem be because of the FuzzyWuzzy module I am using? 问题可能是因为我正在使用FuzzyWuzzy模块吗? Appreciate any help. 感谢任何帮助。

There are several level of optimizations possible here to turn this problem from O(n^2) to a lesser time complexity. 这里有几种优化级别可以将这个问题从O(n ^ 2)转变为较小的时间复杂度。

  • Preprocessing : Sort your list in the first pass, creating an output map for each string , they key for the map can be normalized string. 预处理 :在第一遍中对列表进行排序,为每个字符串创建一个输出映射,它们的映射键可以归一化。 Normalizations may include: 规范化可能包括:

    • lowercase conversion, 小写转换,
    • no whitespaces, special characters removal, 没有空格,去除特殊字符,
    • transform unicode to ascii equivalents if possible,use unicodedata.normalize or unidecode module ) 如果可能,将unicode转换为ascii等效项,请使用unicodedata.normalizeunidecode模块)

    This would result in "Andrew H Smith" , "andrew h. smith" , "ándréw h. smith" generating same key "andrewhsmith" , and would reduce your set of million names to a smaller set of unique/similar grouped names. 这将导致"Andrew H Smith""andrew h. smith""ándréw h. smith"生成相同的密钥"andrewhsmith" ,并且将您的万余名集缩小到一个较小的一套独特的/类似的分组名。

You can use this utlity method to normalize your string (does not include the unicode part though) : 您可以使用此utlity方法来规范化您的字符串(尽管不包括unicode部分):

def process_str_for_similarity_cmp(input_str, normalized=False, ignore_list=[]):
    """ Processes string for similarity comparisons , cleans special characters and extra whitespaces
        if normalized is True and removes the substrings which are in ignore_list)
    Args:
        input_str (str) : input string to be processed
        normalized (bool) : if True , method removes special characters and extra whitespace from string,
                            and converts to lowercase
        ignore_list (list) : the substrings which need to be removed from the input string
    Returns:
       str : returns processed string
    """
    for ignore_str in ignore_list:
        input_str = re.sub(r'{0}'.format(ignore_str), "", input_str, flags=re.IGNORECASE)

    if normalized is True:
        input_str = input_str.strip().lower()
        #clean special chars and extra whitespace
        input_str = re.sub("\W", "", input_str).strip()

    return input_str
  • Now similar strings will already lie in the same bucket if their normalized key is same. 现在,如果相似的字符串的规范化密钥相同,则它们将已经位于相同的存储桶中。

  • For further comparison, you will need to compare the keys only, not the names . 为了进行进一步的比较, 您将只需要比较键,而不是名称 eg andrewhsmith and andrewhsmeeth , since this similarity of names will need fuzzy string matching apart from the normalized comparison done above. 例如andrewhsmithandrewhsmeeth ,因为除了上面完成的标准化比较之外,名称的这种相似性还需要模糊字符串匹配。

  • Bucketing : Do you really need to compare a 5 character key with 9 character key to see if that is 95% match ? Bucketing您是否真的需要将5个字符的密钥与9个字符的密钥进行比较,看是否匹配率达到95% No you do not. 你不可以。 So you can create buckets of matching your strings. 因此,您可以创建与字符串匹配的存储桶。 eg 5 character names will be matched with 4-6 character names, 6 character names with 5-7 characters etc. A n+1,n-1 character limit for an character key is a reasonably good bucket for most practical matching. 例如,将5个字符名称与4-6个字符名称匹配,将6个字符名称与5-7个字符匹配,等等。对于大多数实际匹配而言,字符键的n + 1,n-1个字符限制是一个相当不错的选择。

  • Beginning match : Most variations of names will have same first character in the normalized format ( eg Andrew H Smith , ándréw h. smith , and Andrew H. Smeeth generate keys andrewhsmith , andrewhsmith , and andrewhsmeeth respectively. They will usually not differ in the first character, so you can run matching for keys starting with a to other keys which start with a , and fall within the length buckets. This would highly reduce your matching time. No need to match a key andrewhsmith to bndrewhsmith as such a name variation with first letter will rarely exist. 开始比赛 :名称的大多数变体将以规范化格式具有相同的第一个字符(例如, Andrew H Smithándréw h. smithAndrew H. Smeeth生成andrewhsmithandrewhsmithandrewhsmeeth密钥。通常,第一个andrewhsmeeth不会有所不同字符,这样你就可以运行开始键,其对应a至与启动其他键a ,而属于长度桶之内,这将大大减低你的匹配时间,无需匹配的关键andrewhsmithbndrewhsmith与这样的名称变化首字母很少存在。

Then you can use something on the lines of this method ( or FuzzyWuzzy module ) to find string similarity percentage, you may exclude one of jaro_winkler or difflib to optimize your speed and result quality: 然后,您可以在此方法 (或FuzzyWuzzy模块)上使用某些东西来查找字符串相似度百分比,可以排除jaro_winkler或difflib中的一个来优化速度和结果质量:

def find_string_similarity(first_str, second_str, normalized=False, ignore_list=[]):
    """ Calculates matching ratio between two strings
    Args:
        first_str (str) : First String
        second_str (str) : Second String
        normalized (bool) : if True ,method removes special characters and extra whitespace
                            from strings then calculates matching ratio
        ignore_list (list) : list has some characters which has to be substituted with "" in string
    Returns:
       Float Value : Returns a matching ratio between 1.0 ( most matching ) and 0.0 ( not matching )
                    using difflib's SequenceMatcher and and jellyfish's jaro_winkler algorithms with
                    equal weightage to each
    Examples:
        >>> find_string_similarity("hello world","Hello,World!",normalized=True)
        1.0
        >>> find_string_similarity("entrepreneurship","entreprenaurship")
        0.95625
        >>> find_string_similarity("Taj-Mahal","The Taj Mahal",normalized= True,ignore_list=["the","of"])
        1.0
    """
    first_str = process_str_for_similarity_cmp(first_str, normalized=normalized, ignore_list=ignore_list)
    second_str = process_str_for_similarity_cmp(second_str, normalized=normalized, ignore_list=ignore_list)
    match_ratio = (difflib.SequenceMatcher(None, first_str, second_str).ratio() + jellyfish.jaro_winkler(unicode(first_str), unicode(second_str)))/2.0
    return match_ratio

You have to index, or normalize the strings to avoid the O(n^2) run. 您必须索引或规范化字符串,以避免O(n ^ 2)运行。 Basically, you have to map each string to a normal form, and to build a reverse dictionary with all the words linked to corresponding normal forms. 基本上,您必须将每个字符串映射为普通形式,并使用所有与相应普通形式链接的单词构建反向字典。

Let's consider that normal forms of 'world' and 'word' are the same. 让我们考虑一下“世界”和“单词”的正常形式是相同的。 So, first build a reversed dictionary of Normalized -> [word1, word2, word3], eg: 因此,首先建立一个Normalized -> [word1, word2, word3],的反向字典Normalized -> [word1, word2, word3],例如:

"world" <-> Normalized('world')
"word"  <-> Normalized('wrd')

to:

Normalized('world') -> ["world", "word"]

There you go - all the items (lists) in the Normalized dict which have more than one value - are the matched words. 在那里-归类词典中所有具有多个值的项目(列表)都是匹配的单词。

The normalization algorithm depends on data ie the words. 归一化算法取决于数据,即单词。 Consider one of the many: 考虑以下之一:

  • Soundex Soundex
  • Metaphone 变音器
  • Double Metaphone 双音位
  • NYSIIS 纽约证券交易所
  • Caverphone 卡弗电话
  • Cologne Phonetic 科隆注音
  • MRA codex MRA法典

Specific to fuzzywuzzy, note that currently process.extractOne defaults to WRatio which is by far the slowest of their algorithms, and processor defaults to utils.full_process. 特定于Fuzzywuzzy,请注意,当前process.extractOne默认为WRatio,这是其算法中最慢的,处理器默认为utils.full_process。 If you pass in say fuzz.QRatio as your scorer it will go much quicker, but not as powerful depending on what you're trying to match. 如果您像记分员那样传递fuzz.QRatio,它将更快得多,但是根据您要匹配的内容而没有那么强大。 May be just fine for names though. 虽然可能对名字没问题。 I personally have good luck with token_set_ratio which is at least somewhat quicker than WRatio. 我个人对token_set_ratio感到幸运,它至少比WRatio快一些。 You can also run utils.full_process() on all your choices beforehand and then run it with fuzz.ratio as your scorer and processor=None to skip the processing step. 您还可以事先对所有选择运行utils.full_process(),然后以fuzz.ratio作为记分器运行,并且Processor = None跳过该处理步骤。 (see below) If you're just using the basic ratio function fuzzywuzzy is probably overkill though. (请参阅下文)如果您仅使用基本比率函数,则Fuzzywuzzy可能会显得过大。 Fwiw I have a JavaScript port (fuzzball.js) where you can pre-calculate the token sets too and use those instead of recalculating each time.) 首先,我有一个JavaScript端口(fuzzball.js),您也可以在其中预先计算令牌集,并使用它们而不是每次都重新计算。)

This doesn't cut down the sheer number of comparisons but it helps. 这不会减少绝对的比较数量,但会有所帮助。 (BK-tree for this possibly? Been looking into same situation myself) (可能是BK树吗?我自己也在调查相同情况)

Also be sure to have python-Levenshtein installed so you use the faster calculation. 另外,请务必安装python-Levenshtein,以便您使用更快的计算速度。

**The behavior below may change, open issues under discussion etc.** **以下行为可能会更改,正在讨论的未解决问题等**

fuzz.ratio doesn't run full process, and the token_set and token_sort functions accept a full_process=False param, and If you don't set Processor=None the extract function will try to run full process anyway. fuzz.ratio不会运行完整进程,并且token_set和token_sort函数接受full_process = False参数,并且如果未设置Processor = None,那么提取函数将始终尝试运行完整进程。 Can use functools' partial to say pass in fuzz.token_set_ratio with full_process=False as your scorer, and run utils.full_process on your choices beforehand. 可以使用functools的局部语句来传递带有full_process = False的fuzz.token_set_ratio作为记分员,并事先对您的选择运行utils.full_process。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM