简体   繁体   中英

Match a specific number of digits not preceded or followed by digits

I have a string:

string = u'11a2ee22b333c44d5e66e777e8888'

I want to find all k consecutive chunks of digits where n <= k <= m .

Using regular expression only: say for example n=2 and m=3 using (?:\\D|^)(\\d{2,3})(?:\\D|$)

re.findall(u'(?:\D|^)(\d{2,3})(?:\D|$)',u'11a2ee22b333c44d5e66e777e8888')

Gives this output:

['11', '333', '66']

Desired output:

['11', '22', '333', '44', '66', '777']

I know there are alternate solutions like:

filter(lambda x: re.match('^\d{2,3}$', x), re.split(u'\D',r'11a2ee22b333c44d5e66e777e8888'))

which gives the desired output, but I want to know what's wrong with the first approach?

It seems re.findall goes in sequence and skips the previous part when matched, so what can be done?

Note: The result you show in your question is not what I'm getting:

>>> import re
>>> re.findall(u'(?:\D|^)(\d{2,3})(?:\D|$)',u'11a2ee22b333c44d5e66e777e8888')
[u'11', u'22', u'44', u'66']

It's still missing some of the matches you want, but not the same ones.

The problem is that even though non-capturing groups like (?:\\D|^) and (?:\\D|$) don't capture what they match, they still consume it.

This means that the match which yields '22' has actually consumed:

  1. e , with (?:\\D|^) – not captured (but still consumed)
  2. 22 with (\\d{2,3}) – captured
  3. b with (?:\\D|$) – not captured (but still consumed)

… so that b is no longer available to be matched before 333 .

You can get the result you want with lookbehind and lookahead syntax:

>>> re.findall(u'(?<!\d)\d{2,3}(?!\d)',u'11a2ee22b333c44d5e66e777e8888')
[u'11', u'22', u'333', u'44', u'66', u'777']

Here, (?<!\\d) is a negative lookbehind, checking that the match is not preceded by a digit, and (?!\\d) is a negative lookahead, checking that the match is not followed by a digit. Crucially, these constructions do not consume any of the string.

The various lookahead and lookbehind constructions are described in the Regular Expression Syntax section of Python's re documentation.

lookaround regex,\\d{2,3} means 2 or 3 digits, (?=[az]) means letter after digits.

In [136]: re.findall(r'(\d{2,3})(?=[a-z])',string)
Out[136]: ['11', '22', '333', '44', '66', '777']

You could even generalize it with a function:

import re

string = "11a2ee22b333c44d5e66e777e8888"

def numbers(n,m):
    rx = re.compile(r'(?<!\d)(\d{' + '{},{}'.format(n,m) + '})(?!\d)')
    return rx.findall(string)

print(numbers(2,3))
# ['11', '22', '333', '44', '66', '777']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM