简体   繁体   English

Python:使用re.sub多次替换多个子字符串

[英]Python: using re.sub to replace multiple substring multiple times

I am trying to correct a text that has some very typical scanning errors (l mistaken for I and vice-versa). 我试图纠正一个有一些非常典型的扫描错误的文本(我误认为是我,反之亦然)。 Basically I would like to have the replacement string in re.sub to depend on the number of times the 'I' is detected, something like that: 基本上我想在re.sub使用替换字符串取决于检测到“I”的次数,类似于:

re.sub("(\\w+)(I+)(\\w*)", "\\g<1>l+\\g<3>", "I am stiII here.")

What's the best way to achieve this? 实现这一目标的最佳方法是什么?

Pass a function as the replacement string, as described in the docs . 将函数作为替换字符串传递,如文档所述 Your function can identify the mistake and create the best substitution based on that. 您的函数可以识别错误并基于此创建最佳替换。

def replacement(match):
    if "I" in match.group(2):
        return match.group(1) + "l" * len(match.group(2)) + match.group(3)
    # Add additional cases here and as ORs in your regex

re.sub(r"(\w+)(II+)(\w*)", replacement, "I am stiII here.")
>>> I am still here.

(note that I modified your regex so the repeated Is would appear in one group.) (注意我修改了你的正则表达式,所以重复的Is会出现在一个组中。)

你可以使用一个环视来替换I后面或之前的另一个I

print re.sub("(?<=I)I|I(?=I)", "l", "I am stiII here.")

It seems to me that you could do something like: 在我看来,你可以这样做:

def replace_L(match):
    return match.group(0).replace(match.group(1),'l'*len(match.group(1)))

string_I_want=re.sub(r'\w+(I+)\w*',replace_L,'I am stiII here.')

based on the answer proposed by DNS, I built something a bit more complicated to catch all the cases (or at least most of them), trying not to add too many errors: 基于DNS提出的答案,我构建了一些更复杂的东西来捕获所有情况(或至少大多数情况),尽量不添加太多错误:

def Irepl(matchobj):
    # Catch acronyms
    if matchobj.group(0).isupper():
        return matchobj.group(0)
    else:
        # Replace Group2 with 'l's
        return matchobj.group(1) + 'l'*len(matchobj.group(2)) + matchobj.group(3)


# Impossible to know if first letter is correct or not (possibly a name)
I_FOR_l_PATTERN = "([a-zA-HJ-Z]+?)(I+)(\w*)"
for line in lines:
    tmp_line = line.replace("l'", "I'").replace("'I", "'l").replace(" l ", " I ")
    tmp_line = re.sub("^l ", "I ", tmp_line)

    cor_line = re.sub(I_FOR_l_PATTERN, Irepl, tmp_line)

    # Loop to catch all errors in a word (iIIegaI for example)
    while cor_line != tmp_line:
        tmp_line = cor_line
        cor_line = re.sub(I_FOR_l_PATTERN, Irepl, tmp_line)

Hope this helps somebody else! 希望这有助于其他人!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM