[英]Replacing method for words with boundaries in python (like with regex)
我正在 python 中尋找更強大的替換方法,因為我正在構建一個拼寫檢查器以在 ocr-context 中輸入單詞。
假設我們在 python 中有以下文本:
text = """
this is a text, generated using optical character recognition.
this ls having a lot of errors because
the scanned pdf has too bad resolution.
Unfortunately, his text is very difficult to work with.
"""
很容易意識到,正確的短語應該是“this is a text”而不是“his is a text”。 如果我執行 text.replace('his','this') ,那么我會為此替換每一個 'his',所以我會得到像“tthis”是文本這樣的錯誤。 當我做更換。 我想替換整個單詞“this”而不是 his 或 this。 為什么不試試這個?
word_to_replace='his'
corrected_word = 'this'
corrected_text = re.sub('\b'+word_to_replace+'\b',corrected_word,text)
corrected_text
太棒了,我們做到了,但問題是……如果要更正的單詞包含特殊字符(如“|”)怎么辦。 例如,“|ights are on”而不是“lights are one”。 相信我,它發生在我身上,在那種情況下,re.sub 是一場災難。 問題是,你遇到過同樣的問題嗎? 有什么方法可以解決這個問題嗎? 更換是最穩健的選擇。 我嘗試了 text.replace(' '+word_to_replace+' ',' '+word_to_replace+' ') 這解決了很多問題,但仍然存在像“his is a text”這樣的短語的問題,因為替換在這里不起作用,因為 'his ' 在句子的開頭,而不是 'his' for 'this'。
python 中是否有任何替換方法像正則表達式 \b word_to_correct \b 一樣將整個單詞作為輸入?
幾天后,我解決了我遇到的問題。 我希望這對其他人有幫助。 如果您有任何問題或其他問題,請告訴我。
text = """
this is a text, generated using optical character recognition.
this ls having a lot of errors because
the scanned pdf has too bad resolution.
Unfortunately, his text is very difficult to work with.
"""
# Asume you already have corrected your word via ocr
# and you just have to replace it in the text (I did it with my ocr spellchecker)
# So we get the following word2correct and corrected_word (word after spellchecking system)
word2correct = 'his'
corrected_word = 'this'
#
# now we replace the word and the its context
def context_replace(old_word,new_word,text):
# Match word between boundaries \\b\ using regex. This will capture his and its context but not this and its context
phrase2correct = re.findall('.{1,10}'+'\\b'+word2correct+'\\b'+'.{1,10}',text)[0]
# Once you matched the context, input the new word
phrase_corrected = phrase2correct.replace(word2correct,corrected_word)
# Now replace the old phrase (phrase2correct) with the new one *phrase_corrected
text = text.replace(phrase2correct,phrase_corrected)
return text
測試 function 是否有效...
print(context_replace(old_word=word2correct,new_word=corrected_word,text=text))
Output:
this is a text, generated using optical character recognition.
this ls having a lot of errors because
the scanned pdf has too bad resolution.
Unfortunately, this text is very difficult to work with.
它為我的目的工作。 我希望這對其他人有幫助。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.