简体   繁体   中英

Python reqular expressions non-greedy match

I have this code:

import re
a = r'<b>1234</b><b>56text78</b><b>9012</b>'
print re.search(r'<b>.*?text.*?</b>', a).group()

and I am trying to match a minimal block between <b> and </b> which contains 'text' anywhere in between. This code is the best I could come up with, but it matches:

<b>1234</b><b>56text78</b>

while I need:

<b>56text78</b>

instead of .* use this

print re.search(r'<b>[^<]*text[^<]*</b>', a).group()

Here you say that ignore "<" character.

Why you're getting the output as <b>1234</b><b>56text78</b> when using <b>.*?text.*?</b> regex?

Basically regex engine scans the input from left to right. So first it takes the pattern <b> from the regex and try to match against the input string. Now the engine scans the input from left to right once it finds the tag <b> , it matches that tag. Now the engine takes the second pattern along with the following string text that is .*?text . Now it matches any character upto the first text string. Why i call it as first text means , if there are more than one text strings after <b> , .*?text matches upto the first text string. So <b>1234</b><b>56text will be matched. Now the engine takes the last pattern .*?</b> and macthes upto the first </b> , so <b>1234</b><b>56text78</b> got matched.

When using this <b>[^<]*text[^<]*</b> regex, it asserts that the characters before the string ( text , </b> ) and after the string ( <b> , text ) are any but not of < character. So it prevents the engine from matching also the tags.

Why doesn't <b>.*?text produce the desired output?

This is what regexp engine does:

  1. Takes the first character from the search pattern, which is < , and finds it in the string, then takes the second, then the third, until it matches <b> .
  2. The next step takes the whole .*?text pattern and tries to find it in the string. That's because .*? without the text part would have no sense, as it would match 0 characters. It matches 1234</b><b>56text part and adds it to <b> found in the step 1.

It actually does produce a non-greedy output, it's just non-obvious in this case. If the string was:

`<b>1234</b><b>56text78text</b><b>9012</b>`

then the greedy '<b>.*text' match would be:

<b>1234</b><b>56text78text

and the non-greedy one '<b>.*?text' would produce the one I was getting:

<b>1234</b><b>56text

So to answer the the initial question, the correct solution will be to exclude the '<>' characters from the search:

import re
a = r'<b>1234</b><b>56text78</b><b>9012</b>'
print re.search(r'<b>[^<>]*text.*?</b>', a).group()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM