简体   繁体   中英

Python:Regex to remove more than N consecutive letters

lets say I have this string: Sayy Hellooooooo

if N = 2

I want the result to be (Using Regex): Sayy Helloo

Thank U in advance

You could build the regex dynamically for a given n , and then call sub without callback:

import re

n = 2
regex = re.compile(rf"((.)\2{{{n-1}}})\2+")

s = "Sayy Hellooooooo"
print(regex.sub(r"\1", s))  # Sayy Helloo

Explanation:

  • {{ : this double brace represents a literal brace in an f-string
  • {n-1} injects the value of n-1 , so together with the additional (double) brace-wrap, this {{{n-1}}} produces {2} when n is 3.
  • The outer capture group captures the maximum allowed repetition of a character
  • The additional \2+ captures more subsequent occurrences of that same character, so these are the characters that need removal.
  • The replacement with \1 thus reproduces the allowed repetition, but omits the additional repetition of that same character.

Another option is to use re.sub with a callback:

N = 2

result = re.sub(r'(.)\1+', lambda m: m.group(0)[:N], your_string)

You could use backreferences to mach the previous character. So (a|b)\1 would match aa or bb . In your case you would want probably any letter and any number of repetitions so ([a-zA-Z])\1{n,} for N repetitions. Then substitute it with one occurence using \1 again. So putting it all together:

import re

n=2

expression = r"([a-zA-Z])\1{"+str(n)+",}"
print(re.sub(expression,r"\1","hellooooo friiiiiend"))
# Outputs Hello friend

Attempt This Online!

Note this actually matches N+1 repetitions only, like your test cases. One item then N copies of it. If you want to match exactly N also subtract 1.

Remember to use r in front of regular expressions so you don't need to double escape backslashes.

Learn more about backreferences: https://www.regular-expressions.info/backref.html Learn more about repetition: https://www.regular-expressions.info/repeat.html

You need a regex that search for multiple occurence of the same char, that is done with (.)\1 (the \1 matches the group 1 (in the parenthesis))

To match

  • 2 occurences: (.)\1
  • 3 occurences: (.)\1\1 or (.)\1{2}
  • 4 occurences: (.)\1\1\1 or (.)\1{3}

So you can build it with an f-string and the value you want (that's a bit ugly because you have literal brackets that needs to be escaped using double brackets, and inside that the bracket to allow the value itself)

def remove_letters(value: str, count: int):
    return re.sub(rf"(.)\1{{{count}}}", "", value)


print(remove_letters("Sayy Hellooooooo", 1))  # Sa Heo
print(remove_letters("Sayy Hellooooooo", 2))  # Sayy Hello
print(remove_letters("Sayy Hellooooooo", 3))  # Sayy Hellooo

You may understand the pattern creation easier with that

r"(.)\1{" + str(count) + "}"

This seems to work:

  • When N=2 : the regex pattern is compiled to: ((\w)\2{2,})
  • When N=3 : the regex pattern is compiled to: ((\w)\2{3,})

Code:

import re
N = 2
p = re.compile(r"((\w)\2{" + str(N) + r",})")

text = "Sayy Hellooooooo"
matches = p.findall(text)

for match in matches:
    text = re.sub(match[0], match[1]*N, text)

print(text)

Output:

Sayy Helloo

Note:

Also tested with N=3 , N=4 and other text inputs.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM