简体   繁体   中英

Python non-greedy regular expression is not exactly what I expected

string: XXaaaXXbbbXXcccXXdddOO

I want to match the minimal string that begin with 'XX' and end with 'OO' .

So I write the non-greedy reg: r'XX.*?OO'

>>> str = 'XXaaaXXbbbXXcccXXdddOO'
>>> re.findall(r'XX.*?OO', str)
['XXaaaXXbbbXXcccXXdddOO']

I thought it will return ['XXdddOO'] but it was so 'greedy'.

Then I know I must be mistaken, because the qualifier above will firstly match the 'XX' and then show it's 'non-greedy'.

But I still want to figure out how can I get my result ['XXdddOO'] straightly. Any reply appreciated.

Till now, the key point is actually not about non-greedy , or in other words, it is about the non-greedy in my eyes: it should match as few characters as possible between the left qualifier(XX) and the right qualifier(OO). And of course the fact is that the string is processed from left to right.

How about:

.*(XX.*?OO)

The match will be in group 1.

Indeed, issue is not with greedy/non-greedy… Solution suggested by @devnull should work, provided you want to avoid even a single X between your XX and OO groups.

Else, you'll have to use a lookahead (ie a piece of regex that will go “scooting” the string ahead, and check whether it can be fulfilled, but without actually consuming any char). Something like that:

re.findall(r'XX(?:.(?!XX))*?OO', str)

With this negative lookahead, you match (non-greedily) any char ( . ) not followed by XX

Regex work from left to the right: non-greedy means that it will match XXaaaXXdddOO and not XXaaaXXdddOOiiiOO . If your data structure is that fixed, you could do:

XX[a-z]{3}OO

to select all patterns like XXiiiOO (it can be adjusted to fit your your needs, with XX[^X]+?OO for instance selecting everything in between the last XX pair before an OO up to that OO : for example in XXiiiXXdddFFcccOOlll it would match XXdddFFcccOO )

The behaviour is due to the fact that the string is processed from left to right. A way to avoid the problem is to use a negated character class:

XX(?:(?=([^XO]+|O(?!O)|X(?!X)))\1)+OO

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM