我可以在python中以百分比精度執行“string contains X”嗎？

Question

我需要在一大塊文本上做一些OCR並檢查它是否包含某個字符串，但由於OCR的不准確性，我需要檢查它是否包含類似字符串的~85％匹配。

例如，我可以OCR一大塊文本以確保它不包含no information available但OCR可能會看到n0 inf0rmation available或誤解了多個字符。

有沒有一種簡單的方法在Python中執行此操作？

Answer 1

由於發表gauden ， SequenceMatcher在difflib是一個簡單的方法去。 使用ratio() ，從文檔中返回一個介於0和1之間的值，對應於兩個字符串之間的相似性：

其中T是兩個序列中元素的總數，M是匹配數，這是2.0 * M / T.注意，如果序列相同則為1.0，如果它們沒有任何共同點則為0.0。

例：

>>> import difflib
>>> difflib.SequenceMatcher(None,'no information available','n0 inf0rmation available').ratio()
0.91666666666666663

還有get_close_matches ，它可能對您有用，您可以指定距離截止值，它將返回列表中該距離內的所有匹配項：

>>> difflib.get_close_matches('unicorn', ['unicycle', 'uncorn', 'corny', 
                              'house'], cutoff=0.8)
['uncorn']
>>> difflib.get_close_matches('unicorn', ['unicycle'  'uncorn', 'corny',
                              'house'], cutoff=0.5)
['uncorn', 'corny', 'unicycle']

更新：找到部分子序列匹配

要找到與三個單詞序列的緊密匹配，我會將文本拆分為單詞，然后將它們分組為三個單詞序列，然后應用difflib.get_close_matches ，如下所示：

import difflib
text = "Here is the text we are trying to match across to find the three word
        sequence n0 inf0rmation available I wonder if we will find it?"    
words = text.split()
three = [' '.join([i,j,k]) for i,j,k in zip(words, words[1:], words[2:])]
print difflib.get_close_matches('no information available', three, cutoff=0.9)
#Oyutput:
['n0 inf0rmation available']

Answer 2

difflib標准庫模塊中的SequenceMatcher對象將直接為您提供比率：

Answer 3

你可以計算Levenshtein距離。 這是一個Python實現： http ： //pypi.python.org/pypi/python-Levenshtein/

Answer 4

我不知道任何可用的python lib會開箱即用，但你可能會找到一個（或找到一個C或C ++庫並為它編寫一個Python包裝器）。

你也可以嘗試推出自己的解決方案，基於char比較的“強力”字符，定義兩個給定字符之間“接近度”的規則，並根據這些規則計算“准確度”（即“o”=> “0”：90％准確度，“o”=>“w”：1％准確度等），或者玩更多涉及IA的東西（如果你不熟悉IA，那么“編程集體智慧”書可能會得到你開始了，盡管實施例有些不好）。

Answer 5

只是為了擴展fraxel的答案，這允許找到任意長度的字符串。 抱歉格式不佳，很難。 准確度是findWords中的截止值

def joinAllInTupleList(toupe):
#joinAllInTuple( [("hello", "world"),("face","book")]) = ['hello world', 'face book']
result=[]
for i in toupe:
    #i is the tuple itself
    carry = " "
    for z in i:
        #z is an element of i
        carry+=" "+z

    result.append(carry.strip())
return result

def findWords(text,wordSequence):

#setup
words = text.split(" ")

#get a list of subLists based on the length of wordSequence
#i.e. get all wordSequence length sub-sequences in text!

result=[]
numberOfWordsInSequence = len(wordSequence.strip().split(" ")) 
for i in range(numberOfWordsInSequence):
    result.append(words[i:])

# print 'result',result
c=zip(*result)

# print 'c',c
#join each tuple to a string
joined = joinAllInTupleList(c)

return difflib.get_close_matches(wordSequence, joined, cutoff=0.72389)

我可以在python中以百分比精度執行“string contains X”嗎？

問題描述

5 個解決方案

解決方案1
27 已采納 2012-06-01 11:31:59

解決方案2
6 2012-06-01 11:19:49

解決方案3
4 2012-06-01 11:16:21

解決方案4
0 2012-06-01 11:28:58

解決方案5
0 2014-01-09 06:29:33

我可以在python中以百分比精度執行“string contains X”嗎？

問題描述

5 個解決方案

解決方案1 27 已采納 2012-06-01 11:31:59

解決方案2 6 2012-06-01 11:19:49

解決方案3 4 2012-06-01 11:16:21

解決方案4 0 2012-06-01 11:28:58

解決方案5 0 2014-01-09 06:29:33

解決方案1
27 已采納 2012-06-01 11:31:59

解決方案2
6 2012-06-01 11:19:49

解決方案3
4 2012-06-01 11:16:21

解決方案4
0 2012-06-01 11:28:58

解決方案5
0 2014-01-09 06:29:33