Python reqular expressions non-greedy match

Question

I have this code:

import re
a = r'<b>1234</b><b>56text78</b><b>9012</b>'
print re.search(r'<b>.*?text.*?</b>', a).group()

and I am trying to match a minimal block between  and  which contains 'text' anywhere in between. This code is the best I could come up with, but it matches:

<b>1234</b><b>56text78</b>

while I need:

<b>56text78</b>

Answer 1

instead of .* use this

print re.search(r'<b>[^<]*text[^<]*</b>', a).group()

Here you say that ignore "<" character.

Answer 2

Why you're getting the output as 123456text78 when using .*?text.*? regex?

Basically regex engine scans the input from left to right. So first it takes the pattern  from the regex and try to match against the input string. Now the engine scans the input from left to right once it finds the tag  , it matches that tag. Now the engine takes the second pattern along with the following string text that is .*?text . Now it matches any character upto the first text string. Why i call it as first text means , if there are more than one text strings after  , .*?text matches upto the first text string. So 123456text will be matched. Now the engine takes the last pattern .*? and macthes upto the first  , so 123456text78 got matched.

When using this [^<]*text[^<]* regex, it asserts that the characters before the string ( text ,  ) and after the string (  , text ) are any but not of < character. So it prevents the engine from matching also the tags.

Answer 3

Why doesn't .*?text produce the desired output?

This is what regexp engine does:

Takes the first character from the search pattern, which is < , and finds it in the string, then takes the second, then the third, until it matches  .
The next step takes the whole .*?text pattern and tries to find it in the string. That's because .*? without the text part would have no sense, as it would match 0 characters. It matches 123456text part and adds it to  found in the step 1.

It actually does produce a non-greedy output, it's just non-obvious in this case. If the string was:

`<b>1234</b><b>56text78text</b><b>9012</b>`

then the greedy '.*text' match would be:

<b>1234</b><b>56text78text

and the non-greedy one '.*?text' would produce the one I was getting:

<b>1234</b><b>56text

So to answer the the initial question, the correct solution will be to exclude the '<>' characters from the search:

import re
a = r'<b>1234</b><b>56text78</b><b>9012</b>'
print re.search(r'<b>[^<>]*text.*?</b>', a).group()

Python reqular expressions non-greedy match

Question

3 answers

solution1
2 2014-09-23 17:36:42

solution2
0 2014-09-23 17:51:58

solution3
0 2014-09-24 04:25:05

Python reqular expressions non-greedy match

Question

3 answers

solution1 2 2014-09-23 17:36:42

solution2 0 2014-09-23 17:51:58

solution3 0 2014-09-24 04:25:05

solution1
2 2014-09-23 17:36:42

solution2
0 2014-09-23 17:51:58

solution3
0 2014-09-24 04:25:05