python regex - optional match

Question

I have bunch of strings that comes in this flavor:

#q1_a1
#q7

basically # is the sign that has to be ignored. after #, there comes a single-letter alphabet plus some number. optionally, some alphabet + number combination can be followed after _ (underbar).

here's what I came up with:

>>> pat = re.compile(r"#(.*)_?(.+)?")
>>> pat.match('#q1').groups()
('q1', None)

the problem is strings of #q1_a1 format. when I apply what I made to such strings:

>>> pat.findall('#q1_f1')
[('q1_f1', '')]

any suggestions?

Answer 1

As the others have said, the more specific your regex, the less likely it is to match something it shouldn't:

In [13]: re.match(r'#([A-Za-z][0-9])(?:_([A-Za-z][0-9]))?', '#q1_a1').groups()
Out[13]: ('q1', 'a1')

In [14]: re.match(r'#([A-Za-z][0-9])(?:_([A-Za-z][0-9]))?', '#q1').groups()
Out[14]: ('q1', None)

Notes:

If you need to only match the entire string, surround the regex with ^ and $ .
You say "some number" but your example only contains a single digit. If your regex needs to accept more than one digit, change the [0-9] to [0-9]+ .

Answer 2

Your ".*" matches also underscore, as the match is greedy. Better create more specific regex, to exclude underscore from the first group.

Proper regex could look like this:

#([a-z][0-9])_?([a-z][0-9])?

but you need to check, if it works for all the data you would expect.

Ps. Being more specific in regular expressions is better, as you have less false positives.

Answer 3

When you use .* , it greedy matches as many as possible. Try:

>>> pat = re.compile(r"#([^_]*)_?(.+)?")
>>> pat.findall('#q1_f1')
[('q1', 'f1')]

As well, it's better to write a more specific expression:

#([a-z][0-9])(?:_([a-z][0-9]))?

Answer 4

A simple alternative without using regex:

s = '#q7'
print s[1:].split('_')
# ['q7']

s = '#q1_a1'
print s[1:].split('_')
# ['q1', 'a1']

This is assuming all of your strings start with # . If that's not the case, then you could easily do some validation:

s = '#q1_a1'
if s.startswith('#'):
    print s[1:].split('_')
# ['q1', 'a1]

s = 'q1_a1'
if s.startswith('#'):
    print s[1:].split('_')  # Nothing is printed

python regex - optional match

Question

4 answers

solution1
3 2013-01-27 07:47:33

solution2
2 2013-01-27 07:39:41

solution3
1 ACCPTED

solution4
0 2013-01-27 07:45:35

python regex - optional match

Question

4 answers

solution1 3 2013-01-27 07:47:33

solution2 2 2013-01-27 07:39:41

solution3 1 ACCPTED

solution4 0 2013-01-27 07:45:35

solution1
3 2013-01-27 07:47:33

solution2
2 2013-01-27 07:39:41

solution3
1 ACCPTED

solution4
0 2013-01-27 07:45:35