匹配相同數量的重復字符作為捕獲組的重復

Question

我想清理一些使用python和regex從鍵盤記錄的輸入。 特別是當使用退格鍵來修復錯誤時。

例1：

[in]:  'Helloo<BckSp> world'
[out]: 'Hello world'

這可以通過以下方式完成

re.sub(r'.<BckSp>', '', 'Helloo<BckSp> world')

例2：
但是當我有幾個退格時，我不知道如何刪除完全相同數量的字符：

[in]:  'Helllo<BckSp><BckSp>o world'
[out]: 'Hello world'

（這里我想在兩個退格之前刪除'l'和'o'）。

我可以簡單地使用re.sub(r'[^>]<BckSp>', '', line)幾次，直到沒有<BckSp>但我想找到一個更優雅/更快的解決方案。

有誰知道如何做到這一點？

Answer 1

看起來Python不支持遞歸正則表達式。 如果您可以使用其他語言，您可以嘗試這樣做：

.(?R)?<BckSp>

請參閱： https ： //regex101.com/r/OirPNn/1

Answer 2

它不是很有效但你可以用re模塊做到這一點：

(?:[^<](?=[^<]*((?=(\1?))\2<BckSp>)))+\1

演示

這種方式你不必計算，模式只使用重復。

(?: 
    [^<] # a character to remove
    (?=  # lookahead to reach the corresponding <BckSp>
        [^<]* # skip characters until the first <BckSp>
        (  # capture group 1: contains the <BckSp>s
            (?=(\1?))\2 # emulate an atomic group in place of \1?+
                        # The idea is to add the <BcKSp>s already matched in the
                        # previous repetitions if any to be sure that the following
                        # <BckSp> isn't already associated with a character
            <BckSp> # corresponding <BckSp>
        )
    )
)+ # each time the group is repeated, the capture group 1 is growing with a new <BckSp>

\1 # matches all the consecutive <BckSp> and ensures that there's no more character
   # between the last character to remove and the first <BckSp>

您可以使用正則表達式模塊執行相同操作，但這次您不需要模擬所有格量詞：

(?:[^<](?=[^<]*(\1?+<BckSp>)))+\1

演示

但是使用正則表達式模塊，您也可以使用遞歸（如@Fallenhero注意到的那樣）：

[^<](?R)?<BckSp>

演示

Answer 3

由於是遞歸/子程序調用，沒有原子團/占有欲量詞在Python不支持re ，你可以刪除這些字符，隨后在循環退格鍵：

import re
s = "Helllo\b\bo world"
r = re.compile("^\b+|[^\b]\b")
while r.search(s): 
    s = r.sub("", s)
print(s)

請參閱Python演示

"^\\b+|[^\\b]\\b"模式將在字符串start處找到1+退格字符（使用^\\b+ ）， [^\\b]\\b將找到任何其他字符串的所有非重疊事件而不是退格后跟退格。

如果將退格表示為某些enitity / tag（如文字<BckSp>則采用相同的方法：

import re
s = "Helllo<BckSp><BckSp>o world"
r = re.compile("^(?:<BckSp>)+|.<BckSp>", flags=re.S)
while r.search(s): 
    s = r.sub("", s)
print(s)

看另一個Python演示

Answer 4

如果標記是單個字符，你可以只使用堆棧，它會在單次傳遞中給你結果：

s = "Helllo\b\bo world"
res = []

for c in s:
    if c == '\b':
        if res:
            del res[-1]
    else:
        res.append(c)

print(''.join(res)) # Hello world

如果標記字面上是'<BckSp>'或其他長度大於1的字符串，您可以使用replace將其替換為'\\b'並使用上面的解決方案。 這僅在您知道輸入中未出現'\\b'時才有效。 如果您無法指定替換字符，則可以使用split並處理結果：

s = 'Helllo<BckSp><BckSp>o world'
res = []

for part in s.split('<BckSp>'):
    if res:
        del res[-1]
    res.extend(part)

print(''.join(res)) # Hello world

Answer 5

稍微冗長但您可以使用此lambda函數來計算<BckSp>出現次數並使用子字符串例程來獲取最終輸出。

>>> bk = '<BckSp>'

>>> s = 'Helllo<BckSp><BckSp>o world'
>>> print re.sub(r'(.*?)((?:' + bk + ')+)', lambda x: x.group(1)[0:len(x.group(1)) - len(x.group(2))/len(bk)], s)
Hello world

>>> s = 'Helloo<BckSp> world'
>>> print re.sub(r'(.*?)((?:' + bk + ')+)', lambda x: x.group(1)[0:len(x.group(1)) - len(x.group(2))/len(bk)], s)
Hello world

>>> s = 'Helloo<BckSp> worl<BckSp>d'
>>> print re.sub(r'(.*?)((?:' + bk + ')+)', lambda x: x.group(1)[0:len(x.group(1)) - len(x.group(2))/len(bk)], s)
Hello word

>>> s = 'Helllo<BckSp><BckSp>o world<BckSp><BckSp>k'
>>> print re.sub(r'(.*?)((?:' + bk + ')+)', lambda x: x.group(1)[0:len(x.group(1)) - len(x.group(2))/len(bk)], s)
Hello work

匹配相同數量的重復字符作為捕獲組的重復

問題描述

5 個解決方案

解決方案1
2 2016-12-27 10:41:08

解決方案2
2 2016-12-27 10:44:22

解決方案3
1 2016-12-27 10:39:54

解決方案4
1 2016-12-27 10:55:25

解決方案5
1 2016-12-27 11:01:21

匹配相同數量的重復字符作為捕獲組的重復

問題描述

5 個解決方案

解決方案1 2 2016-12-27 10:41:08

解決方案2 2 2016-12-27 10:44:22

解決方案3 1 2016-12-27 10:39:54

解決方案4 1 2016-12-27 10:55:25

解決方案5 1 2016-12-27 11:01:21

解決方案1
2 2016-12-27 10:41:08

解決方案2
2 2016-12-27 10:44:22

解決方案3
1 2016-12-27 10:39:54

解決方案4
1 2016-12-27 10:55:25

解決方案5
1 2016-12-27 11:01:21