简体   繁体   中英

What is a simple fuzzy string matching algorithm in Python?

I'm trying to find some sort of a good, fuzzy string matching algorithm. Direct matching doesn't work for me — this isn't too good because unless my strings are a 100% similar, the match fails. The Levenshtein method doesn't work too well for strings as it works on a character level. I was looking for something along the lines of word level matching eg

String A: The quick brown fox.

String B: The quick brown fox jumped over the lazy dog.

These should match as all words in string A are in string B.

Now, this is an oversimplified example but would anyone know a good, fuzzy string matching algorithm that works on a word level.

I like Drew's answer .

You can use difflib to find the longest match:

>>> a = 'The quick brown fox.'
>>> b = 'The quick brown fox jumped over the lazy dog.'
>>> import difflib
>>> s = difflib.SequenceMatcher(None, a, b)
>>> s.find_longest_match(0,len(a),0,len(b))
Match(a=0, b=0, size=19) # returns NamedTuple (new in v2.6)

Or pick some minimum matching threshold. Example:

>>> difflib.SequenceMatcher(None, a, b).ratio()
0.61538461538461542

Take a look at this python library, which SeatGeek open-sourced yesterday. Obviously most of these kinds of problems are very context dependent, but it might help you.

from fuzzywuzzy import fuzz

s1 = "the quick brown fox"
s2 = "the quick brown fox jumped over the lazy dog"
s3 = "the fast fox jumped over the hard-working dog"

fuzz.partial_ratio(s1, s2)
> 100

fuzz.token_set_ratio(s2, s3)
> 73

SeatGeek website

and Github repo

If all you want to do is to test whether or not all the words in a string match another string, that's a one liner:

if not [word for word in b.split(' ') if word not in a.split(' ')]:
    print 'Match!'

If you want to score them instead of a binary test, why not just do something like:

((# of matching words) / (# of words in bigger string)) * ((# of words in smaller string) / (# of words in bigger string))

?

If you wanted to, you could get fancier and do fuzzy match on each string.

You can try this python package which uses fuzzy name matching with machine learning.

pip install hmni

Initialize a Matcher Object

import hmni
matcher = hmni.Matcher(model='latin')

Single Pair Similarity

matcher.similarity('Alan', 'Al')
# 0.6838303319889133

matcher.similarity('Alan', 'Al', prob=False)
# 1

matcher.similarity('Alan Turing', 'Al Turing', surname_first=False)
# 0.6838303319889133

Note: I have not built this package. Sharing it here because it was quite useful for my use. GitHub

You could modify the Levenshtein algorithm to compare words rather than characters. It's not a very complex algorithm and the source is available in many languages online.

Levenshtein works by comparing two arrays of chars. There is no reason that the same logic could not be applied against two arrays of strings.

I did this some time ago with C#, my previous question is here . There is starter algorith for your interest, you can easily transform it to python.

Ideas you should use writing your own algorithm is something like this:

  • Have a list with original "titles" (words/sentences you want to match with).
  • Each title item should have minimal match score on word/sentence, ignore title as well.
  • You also should have global minimal match percentage of final result.
  • You should calculate each word - word Levenshtein distance.
  • You should increase total match weight if words are going in the same order (quick brown vs quick brown, should have definitively higher weight than quick brown vs. brown quick.)

You can try FuzzySearchEngine from https://github.com/frazenshtein/fastcd/blob/master/search.py .

This fuzzy search supports only search for words and has a fixed admissible error for the word (only one substitution or transposition of two adjacent characters).

However, for example you can try something like:

import search

string = "Chapter I. The quick brown fox jumped over the lazy dog."
substr = "the qiuck broqn fox."

def fuzzy_search_for_sentences(substr, string):  
    start = None
    pos = 0
    for word in substr.split(" "):
        if not word:
            continue
        match = search.FuzzySearchEngine(word).search(string, pos=pos)
        if not match:
            return None
        if start is None:
            start = match.start()
        pos = match.end()
    return start

print(fuzzy_search_for_sentences(substr, string))

11 will be printed

Levenshtein should work ok if you compare words (strings separated by sequences of stop charactes) instead of individual letters.

def ld(s1, s2):  # Levenshtein Distance
    len1 = len(s1)+1
    len2 = len(s2)+1
    lt = [[0 for i2 in range(len2)] for i1 in range(len1)]  # lt - levenshtein_table
    lt[0] = list(range(len2))
    i = 0
    for l in lt:
        l[0] = i
        i += 1
    for i1 in range(1, len1):
        for i2 in range(1, len2):
            if s1[i1-1] == s2[i2-1]:
                v = 0
            else:
                v = 1
            lt[i1][i2] = min(lt[i1][i2-1]+1, lt[i1-1][i2]+1, lt[i1-1][i2-1]+v)
    return lt[-1][-1]

str1 = "The quick brown fox"
str2 = "The quick brown fox jumped over the lazy dog"

print("{} words need to be added, deleted or replaced to convert string 1 into string 2".format(ld(str1.split(),str2.split())))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM