简体   繁体   中英

Can I do a “string contains X” with a percentage accuracy in python?

I need to do some OCR on a large chunk of text and check if it contains a certain string but due to the inaccuracy of the OCR I need it to check if it contains something like a ~85% match for the string.

For example I may OCR a chunk of text to make sure it doesn't contain no information available but the OCR might see n0 inf0rmation available or misinterpret an number of characters.

Is there an easy way to do this in Python?

As posted by gauden , SequenceMatcher in difflib is an easy way to go. Using ratio() , returns a value between 0 and 1 corresponding to the similarity between the two strings, from the docs:

Where T is the total number of elements in both sequences, and M is the number of matches, this is 2.0*M / T. Note that this is 1.0 if the sequences are identical, and 0.0 if they have nothing in common.

example:

>>> import difflib
>>> difflib.SequenceMatcher(None,'no information available','n0 inf0rmation available').ratio()
0.91666666666666663

There is also get_close_matches , which might be useful to you, you can specify a distance cutoff and it'll return all matches within that distance from a list:

>>> difflib.get_close_matches('unicorn', ['unicycle', 'uncorn', 'corny', 
                              'house'], cutoff=0.8)
['uncorn']
>>> difflib.get_close_matches('unicorn', ['unicycle'  'uncorn', 'corny',
                              'house'], cutoff=0.5)
['uncorn', 'corny', 'unicycle']

Update: to find a partial sub-sequence match

To find close matches to a three word sequence, I would split the text into words, then group them into three word sequences, then apply difflib.get_close_matches , like this:

import difflib
text = "Here is the text we are trying to match across to find the three word
        sequence n0 inf0rmation available I wonder if we will find it?"    
words = text.split()
three = [' '.join([i,j,k]) for i,j,k in zip(words, words[1:], words[2:])]
print difflib.get_close_matches('no information available', three, cutoff=0.9)
#Oyutput:
['n0 inf0rmation available']

difflib标准库模块中的SequenceMatcher对象将直接为您提供比率:

You could compute the Levenshtein distance . Here is one Python implementation: http://pypi.python.org/pypi/python-Levenshtein/

I don't know of any available python lib that would do that out of the box, but you might find one (or find a C or C++ lib and write a Python wrapper for it).

You can also try to roll your own solution, based either on a "brute force" char by char comparison, with rules defining "proximity" between two given chars and computing the "accuracy" based on these rules (ie "o" => "0" : 90% accuracy, "o" => "w" : 1% accuracy, etc), or playing with more involved IA stuff (if you're not familiar with IA, the "Programming Collective Intelligence" book could get you started, despite the somewhat poor implementation examples).

Just to expand on fraxel's answer, this allows the finding of any arbitrary length string. Sorry for the poor formatting, SO is hard. The accuracy is the cutoff value in findWords

def joinAllInTupleList(toupe):
#joinAllInTuple( [("hello", "world"),("face","book")]) = ['hello world', 'face book']
result=[]
for i in toupe:
    #i is the tuple itself
    carry = " "
    for z in i:
        #z is an element of i
        carry+=" "+z

    result.append(carry.strip())
return result

def findWords(text,wordSequence):

#setup
words = text.split(" ")

#get a list of subLists based on the length of wordSequence
#i.e. get all wordSequence length sub-sequences in text!

result=[]
numberOfWordsInSequence = len(wordSequence.strip().split(" ")) 
for i in range(numberOfWordsInSequence):
    result.append(words[i:])

# print 'result',result
c=zip(*result)

# print 'c',c
#join each tuple to a string
joined = joinAllInTupleList(c)

return difflib.get_close_matches(wordSequence, joined, cutoff=0.72389)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM