简体   繁体   English

Python非贪婪正则表达式与我期望的不完全一样

[英]Python non-greedy regular expression is not exactly what I expected

string: XXaaaXXbbbXXcccXXdddOO 字符串:XXaaaXXbbbXXcccXXdddOO

I want to match the minimal string that begin with 'XX' and end with 'OO' . 我想匹配以'XX'开头'OO'结尾最小字符串。

So I write the non-greedy reg: r'XX.*?OO' 所以我写了一个非贪婪的reg:r'XX。*?OO'

>>> str = 'XXaaaXXbbbXXcccXXdddOO'
>>> re.findall(r'XX.*?OO', str)
['XXaaaXXbbbXXcccXXdddOO']

I thought it will return ['XXdddOO'] but it was so 'greedy'. 我以为它将返回['XXdddOO'],但它是如此“贪婪”。

Then I know I must be mistaken, because the qualifier above will firstly match the 'XX' and then show it's 'non-greedy'. 然后我知道我一定弄错了,因为上面的限定词将首先匹配“ XX”,然后显示为“非贪婪”。

But I still want to figure out how can I get my result ['XXdddOO'] straightly. 但是我仍然想弄清楚如何才能直接获得结果['XXdddOO'] Any reply appreciated. 任何答复表示赞赏。

Till now, the key point is actually not about non-greedy , or in other words, it is about the non-greedy in my eyes: it should match as few characters as possible between the left qualifier(XX) and the right qualifier(OO). And of course the fact is that the string is processed from left to right. 到现在为止,关键点实际上并不是关于非贪婪的问题,或者换句话说,是关于我眼中的非贪婪的问题:它应该在左限定词(XX)和右限定词( OO)。当然,事实是字符串是从左到右处理的。

How about: 怎么样:

.*(XX.*?OO)

The match will be in group 1. 比赛将在第1组中。

Indeed, issue is not with greedy/non-greedy… Solution suggested by @devnull should work, provided you want to avoid even a single X between your XX and OO groups. 确实,问题不是贪婪/非贪婪……@devnull建议的解决方案应该可以工作,只要您要避免在XXOO组之间使用单个X即可。

Else, you'll have to use a lookahead (ie a piece of regex that will go “scooting” the string ahead, and check whether it can be fulfilled, but without actually consuming any char). 否则,您将不得不使用前瞻功能(即一条正则表达式,它将“侦听”前面的字符串,并检查它是否可以实现,但实际上不消耗任何字符)。 Something like that: 像这样:

re.findall(r'XX(?:.(?!XX))*?OO', str)

With this negative lookahead, you match (non-greedily) any char ( . ) not followed by XX 通过此负前瞻,您可以(非贪婪地)匹配XX任何字符( . )。

Regex work from left to the right: non-greedy means that it will match XXaaaXXdddOO and not XXaaaXXdddOOiiiOO . 正则表达式从左到右运行:非贪婪意味着它将匹配XXaaaXXdddOO而不匹配XXaaaXXdddOOiiiOO If your data structure is that fixed, you could do: 如果您的数据结构是固定的,则可以执行以下操作:

XX[a-z]{3}OO

to select all patterns like XXiiiOO (it can be adjusted to fit your your needs, with XX[^X]+?OO for instance selecting everything in between the last XX pair before an OO up to that OO : for example in XXiiiXXdddFFcccOOlll it would match XXdddFFcccOO ) 选择XXiiiOO类的所有模式(可以根据您的需要进行调整,例如使用XX[^X]+?OO选择从OO到该OO的最后XX对之间的所有内容:例如在XXiiiXXdddFFcccOOlll匹配XXdddFFcccOO

The behaviour is due to the fact that the string is processed from left to right. 该行为是由于该字符串是从左到右处理的事实。 A way to avoid the problem is to use a negated character class: 避免此问题的一种方法是使用否定的字符类:

XX(?:(?=([^XO]+|O(?!O)|X(?!X)))\1)+OO

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM