简体   繁体   中英

python regex - optional match

I have bunch of strings that comes in this flavor:

#q1_a1
#q7

basically # is the sign that has to be ignored. after #, there comes a single-letter alphabet plus some number. optionally, some alphabet + number combination can be followed after _ (underbar).

here's what I came up with:

>>> pat = re.compile(r"#(.*)_?(.+)?")
>>> pat.match('#q1').groups()
('q1', None)

the problem is strings of #q1_a1 format. when I apply what I made to such strings:

>>> pat.findall('#q1_f1')
[('q1_f1', '')]

any suggestions?

As the others have said, the more specific your regex, the less likely it is to match something it shouldn't:

In [13]: re.match(r'#([A-Za-z][0-9])(?:_([A-Za-z][0-9]))?', '#q1_a1').groups()
Out[13]: ('q1', 'a1')

In [14]: re.match(r'#([A-Za-z][0-9])(?:_([A-Za-z][0-9]))?', '#q1').groups()
Out[14]: ('q1', None)

Notes:

  1. If you need to only match the entire string, surround the regex with ^ and $ .
  2. You say "some number" but your example only contains a single digit. If your regex needs to accept more than one digit, change the [0-9] to [0-9]+ .

Your ".*" matches also underscore, as the match is greedy. Better create more specific regex, to exclude underscore from the first group.

Proper regex could look like this:

#([a-z][0-9])_?([a-z][0-9])?

but you need to check, if it works for all the data you would expect.

Ps. Being more specific in regular expressions is better, as you have less false positives.

When you use .* , it greedy matches as many as possible. Try:

>>> pat = re.compile(r"#([^_]*)_?(.+)?")
>>> pat.findall('#q1_f1')
[('q1', 'f1')]

As well, it's better to write a more specific expression:

#([a-z][0-9])(?:_([a-z][0-9]))?

A simple alternative without using regex:

s = '#q7'
print s[1:].split('_')
# ['q7']

s = '#q1_a1'
print s[1:].split('_')
# ['q1', 'a1']

This is assuming all of your strings start with # . If that's not the case, then you could easily do some validation:

s = '#q1_a1'
if s.startswith('#'):
    print s[1:].split('_')
# ['q1', 'a1]

s = 'q1_a1'
if s.startswith('#'):
    print s[1:].split('_')  # Nothing is printed

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM