I have a string:
string = u'11a2ee22b333c44d5e66e777e8888'
I want to find all k
consecutive chunks of digits where n <= k <= m
.
Using regular expression only: say for example n=2
and m=3
using (?:\\D|^)(\\d{2,3})(?:\\D|$)
re.findall(u'(?:\D|^)(\d{2,3})(?:\D|$)',u'11a2ee22b333c44d5e66e777e8888')
Gives this output:
['11', '333', '66']
Desired output:
['11', '22', '333', '44', '66', '777']
I know there are alternate solutions like:
filter(lambda x: re.match('^\d{2,3}$', x), re.split(u'\D',r'11a2ee22b333c44d5e66e777e8888'))
which gives the desired output, but I want to know what's wrong with the first approach?
It seems re.findall
goes in sequence and skips the previous part when matched, so what can be done?
Note: The result you show in your question is not what I'm getting:
>>> import re
>>> re.findall(u'(?:\D|^)(\d{2,3})(?:\D|$)',u'11a2ee22b333c44d5e66e777e8888')
[u'11', u'22', u'44', u'66']
It's still missing some of the matches you want, but not the same ones.
The problem is that even though non-capturing groups like (?:\\D|^)
and (?:\\D|$)
don't capture what they match, they still consume it.
This means that the match which yields '22'
has actually consumed:
e
, with (?:\\D|^)
– not captured (but still consumed) 22
with (\\d{2,3})
– captured b
with (?:\\D|$)
– not captured (but still consumed) … so that b
is no longer available to be matched before 333
.
You can get the result you want with lookbehind and lookahead syntax:
>>> re.findall(u'(?<!\d)\d{2,3}(?!\d)',u'11a2ee22b333c44d5e66e777e8888')
[u'11', u'22', u'333', u'44', u'66', u'777']
Here, (?<!\\d)
is a negative lookbehind, checking that the match is not preceded by a digit, and (?!\\d)
is a negative lookahead, checking that the match is not followed by a digit. Crucially, these constructions do not consume any of the string.
The various lookahead and lookbehind constructions are described in the Regular Expression Syntax section of Python's re
documentation.
lookaround regex,\\d{2,3} means 2 or 3 digits, (?=[az]) means letter after digits.
In [136]: re.findall(r'(\d{2,3})(?=[a-z])',string)
Out[136]: ['11', '22', '333', '44', '66', '777']
You could even generalize it with a function:
import re
string = "11a2ee22b333c44d5e66e777e8888"
def numbers(n,m):
rx = re.compile(r'(?<!\d)(\d{' + '{},{}'.format(n,m) + '})(?!\d)')
return rx.findall(string)
print(numbers(2,3))
# ['11', '22', '333', '44', '66', '777']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.