简体   繁体   中英

Match a continuously repeated backreference in a given times and no more

In a simplified case, I want to extract a repeated number(3 times) from input string, but only 3 times and no more.

#match a backreference(\d here) 2 more times
#11222(333)34445 get matched and consumed, 
#then the current position moves to 11222333^34445
In [3]: re.findall(r'(\d)\1{2}','1122233334445')
Out[3]: ['2', '3', '4']

#try to exclude 11222(333)34445 by setting a non-backreference(?!\1)
#as a negative lookahead assertion, it skips the match of 
#11222^(333)34445, but get captured in the next position
#112223^(333)4445
In [4]: re.findall(r'(\d)\1{2}(?!\1)','1122233334445')
Out[4]: ['2', '3', '4']

#backreference cannot go before the referenced group
In [5]: re.findall(r'(?!\1)(\d)\1{2}(?!\1)','1122233334445')
---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
<ipython-input-5-a5837badf5bb> in <module>()
----> 1 re.findall(r'(?!\1)(\d)\1{2}(?!\1)','1122233334445')

/usr/lib/python2.7/re.pyc in findall(pattern, string, flags)
    179 
    180     Empty matches are included in the result."""
--> 181     return _compile(pattern, flags).findall(string)
    182 
    183 if sys.hexversion >= 0x02020000:

/usr/lib/python2.7/re.pyc in _compile(*key)
    249         p = sre_compile.compile(pattern, flags)
    250     except error, v:
--> 251         raise error, v # invalid expression
    252     if not bypass_cache:
    253         if len(_cache) >= _MAXCACHE:

error: bogus escape: '\\1'

But what I expect is ['2','4'] .

Thank you.

You'd need backreference in a lookbehind to find the borders between different digits, before matching the sequence without consuming which are little supported among the regex flavors. Something like (\\d)(?<!\\1.)\\1{2}(?!\\1) works in .NET but not in Python obviously .

An idea is to use The Great Trick like @hwnd commented. This is also of great performance with the downside of getting some dispensable elements. Another idea to find the boundary between two different digits as a requirement would be to capture inside a lookbehind :

(?:^|(?<=(\d))(?!\1))(\d)\2{2}(?!\2)
  • (?:^|(?<=(\\d))(?!\\1)) The part with lookbehind for finding boundaries between different digits.
  • (\\d)\\2{2}(?!\\2) 2nd capture-group captures a digit to \\2 . Followed by the same digit at least 2x - using a negative lookahead for not being followed by the same digit again.

This should give accurate matches but requires more steps for the parser. See test at regex101 .

x="1122233334445"
print [j for i,j in re.findall(r"(\d)\1{3,}|(\d)\2{2}",x) if not i]

Try this.This will give ['2', '4']

This may work:

>>> re.findall(r'(\d)\1{2}', re.sub(r'(\d)\1{3,}', '',  '1122233334445'))
['2', '4']

Remove all digits which repeated more than 3 time, then find those repeated for exactly 3 times.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM