python 中帶邊界的單詞的替換方法（與正則表達式一樣）

Question

我正在 python 中尋找更強大的替換方法，因為我正在構建一個拼寫檢查器以在 ocr-context 中輸入單詞。

假設我們在 python 中有以下文本：

text =  """
this is a text, generated using optical character recognition. 
this ls having a lot of errors because
the scanned pdf has too bad resolution.
Unfortunately, his text is very difficult to work with. 
"""

很容易意識到，正確的短語應該是“this is a text”而不是“his is a text”。 如果我執行 text.replace('his','this') ，那么我會為此替換每一個 'his'，所以我會得到像“tthis”是文本這樣的錯誤。 當我做更換。 我想替換整個單詞“this”而不是 his 或 this。 為什么不試試這個？

word_to_replace='his'
corrected_word = 'this'
corrected_text = re.sub('\b'+word_to_replace+'\b',corrected_word,text)
corrected_text

太棒了，我們做到了，但問題是……如果要更正的單詞包含特殊字符（如“|”）怎么辦。 例如，“|ights are on”而不是“lights are one”。 相信我，它發生在我身上，在那種情況下，re.sub 是一場災難。 問題是，你遇到過同樣的問題嗎？ 有什么方法可以解決這個問題嗎？ 更換是最穩健的選擇。 我嘗試了 text.replace(' '+word_to_replace+' ',' '+word_to_replace+' ') 這解決了很多問題，但仍然存在像“his is a text”這樣的短語的問題，因為替換在這里不起作用，因為 'his ' 在句子的開頭，而不是 'his' for 'this'。

python 中是否有任何替換方法像正則表達式 \b word_to_correct \b 一樣將整個單詞作為輸入？

Answer 1

幾天后，我解決了我遇到的問題。 我希望這對其他人有幫助。 如果您有任何問題或其他問題，請告訴我。


text =  """
this is a text, generated using optical character recognition. 
this ls having a lot of errors because
the scanned pdf has too bad resolution.
Unfortunately, his text is very difficult to work with. 
"""


# Asume you already have corrected your word via ocr 
# and you just have to replace it in the text (I did it with my ocr spellchecker)
# So we get the following word2correct and corrected_word (word after spellchecking system)
word2correct = 'his'
corrected_word = 'this'

#
# now we replace the word and the its context
def context_replace(old_word,new_word,text):
    # Match word between boundaries \\b\ using regex. This will capture his and its context but not this  and its context
    phrase2correct = re.findall('.{1,10}'+'\\b'+word2correct+'\\b'+'.{1,10}',text)[0]
    # Once you matched the context, input the new word 
    phrase_corrected = phrase2correct.replace(word2correct,corrected_word)
    # Now replace  the old phrase (phrase2correct) with the new one *phrase_corrected
    text = text.replace(phrase2correct,phrase_corrected)
    return text

測試 function 是否有效...

print(context_replace(old_word=word2correct,new_word=corrected_word,text=text))

Output：

this is a text, generated using optical character recognition. 
this ls having a lot of errors because
the scanned pdf has too bad resolution.
Unfortunately, this text is very difficult to work with.

它為我的目的工作。 我希望這對其他人有幫助。

python 中帶邊界的單詞的替換方法（與正則表達式一樣）

問題描述

1 個解決方案

解決方案1
0 已采納 2020-05-20 14:29:31

python 中帶邊界的單詞的替換方法（與正則表達式一樣）

問題描述

1 個解決方案

解決方案1 0 已采納 2020-05-20 14:29:31

解決方案1
0 已采納 2020-05-20 14:29:31