In a simplified case, I want to extract a repeated number(3 times) from input string, but only 3 times and no more.
#match a backreference(\d here) 2 more times
#11222(333)34445 get matched and consumed,
#then the current position moves to 11222333^34445
In [3]: re.findall(r'(\d)\1{2}','1122233334445')
Out[3]: ['2', '3', '4']
#try to exclude 11222(333)34445 by setting a non-backreference(?!\1)
#as a negative lookahead assertion, it skips the match of
#11222^(333)34445, but get captured in the next position
#112223^(333)4445
In [4]: re.findall(r'(\d)\1{2}(?!\1)','1122233334445')
Out[4]: ['2', '3', '4']
#backreference cannot go before the referenced group
In [5]: re.findall(r'(?!\1)(\d)\1{2}(?!\1)','1122233334445')
---------------------------------------------------------------------------
error Traceback (most recent call last)
<ipython-input-5-a5837badf5bb> in <module>()
----> 1 re.findall(r'(?!\1)(\d)\1{2}(?!\1)','1122233334445')
/usr/lib/python2.7/re.pyc in findall(pattern, string, flags)
179
180 Empty matches are included in the result."""
--> 181 return _compile(pattern, flags).findall(string)
182
183 if sys.hexversion >= 0x02020000:
/usr/lib/python2.7/re.pyc in _compile(*key)
249 p = sre_compile.compile(pattern, flags)
250 except error, v:
--> 251 raise error, v # invalid expression
252 if not bypass_cache:
253 if len(_cache) >= _MAXCACHE:
error: bogus escape: '\\1'
But what I expect is ['2','4']
.
Thank you.
You'd need backreference in a lookbehind to find the borders between different digits, before matching the sequence without consuming which are little supported among the regex flavors. Something like (\\d)(?<!\\1.)\\1{2}(?!\\1)
works in .NET but not in Python obviously .
An idea is to use The Great Trick like @hwnd commented. This is also of great performance with the downside of getting some dispensable elements. Another idea to find the boundary between two different digits as a requirement would be to capture inside a lookbehind :
(?:^|(?<=(\d))(?!\1))(\d)\2{2}(?!\2)
(?:^|(?<=(\\d))(?!\\1))
The part with lookbehind for finding boundaries between different digits. (\\d)\\2{2}(?!\\2)
2nd capture-group captures a digit to \\2
. Followed by the same digit at least 2x - using a negative lookahead for not being followed by the same digit again. This should give accurate matches but requires more steps for the parser. See test at regex101 .
x="1122233334445"
print [j for i,j in re.findall(r"(\d)\1{3,}|(\d)\2{2}",x) if not i]
Try this.This will give ['2', '4']
This may work:
>>> re.findall(r'(\d)\1{2}', re.sub(r'(\d)\1{3,}', '', '1122233334445'))
['2', '4']
Remove all digits which repeated more than 3 time, then find those repeated for exactly 3 times.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.