簡體   English   中英

python 中帶邊界的單詞的替換方法(與正則表達式一樣)

[英]Replacing method for words with boundaries in python (like with regex)

我正在 python 中尋找更強大的替換方法,因為我正在構建一個拼寫檢查器以在 ocr-context 中輸入單詞。

假設我們在 python 中有以下文本:

text =  """
this is a text, generated using optical character recognition. 
this ls having a lot of errors because
the scanned pdf has too bad resolution.
Unfortunately, his text is very difficult to work with. 
"""

很容易意識到,正確的短語應該是“this is a text”而不是“his is a text”。 如果我執行 text.replace('his','this') ,那么我會為此替換每一個 'his',所以我會得到像“tthis”是文本這樣的錯誤。 當我做更換。 我想替換整個單詞“this”而不是 his 或 this。 為什么不試試這個?

word_to_replace='his'
corrected_word = 'this'
corrected_text = re.sub('\b'+word_to_replace+'\b',corrected_word,text)
corrected_text 

太棒了,我們做到了,但問題是……如果要更正的單詞包含特殊字符(如“|”)怎么辦。 例如,“|ights are on”而不是“lights are one”。 相信我,它發生在我身上,在那種情況下,re.sub 是一場災難。 問題是,你遇到過同樣的問題嗎? 有什么方法可以解決這個問題嗎? 更換是最穩健的選擇。 我嘗試了 text.replace(' '+word_to_replace+' ',' '+word_to_replace+' ') 這解決了很多問題,但仍然存在像“his is a text”這樣的短語的問題,因為替換在這里不起作用,因為 'his ' 在句子的開頭,而不是 'his' for 'this'。

python 中是否有任何替換方法像正則表達式 \b word_to_correct \b 一樣將整個單詞作為輸入?

幾天后,我解決了我遇到的問題。 我希望這對其他人有幫助。 如果您有任何問題或其他問題,請告訴我。


text =  """
this is a text, generated using optical character recognition. 
this ls having a lot of errors because
the scanned pdf has too bad resolution.
Unfortunately, his text is very difficult to work with. 
"""


# Asume you already have corrected your word via ocr 
# and you just have to replace it in the text (I did it with my ocr spellchecker)
# So we get the following word2correct and corrected_word (word after spellchecking system)
word2correct = 'his'
corrected_word = 'this'

#
# now we replace the word and the its context
def context_replace(old_word,new_word,text):
    # Match word between boundaries \\b\ using regex. This will capture his and its context but not this  and its context
    phrase2correct = re.findall('.{1,10}'+'\\b'+word2correct+'\\b'+'.{1,10}',text)[0]
    # Once you matched the context, input the new word 
    phrase_corrected = phrase2correct.replace(word2correct,corrected_word)
    # Now replace  the old phrase (phrase2correct) with the new one *phrase_corrected
    text = text.replace(phrase2correct,phrase_corrected)
    return text

測試 function 是否有效...

print(context_replace(old_word=word2correct,new_word=corrected_word,text=text))

Output:

this is a text, generated using optical character recognition. 
this ls having a lot of errors because
the scanned pdf has too bad resolution.
Unfortunately, this text is very difficult to work with. 

它為我的目的工作。 我希望這對其他人有幫助。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM