简体   繁体   English

贪婪匹配正则表达式中的负向前瞻

[英]Greedy match with negative lookahead in a regular expression

I have a regular expression in which I'm trying to extract every group of letters that is not immediately followed by a "(" symbol. For example, the following regular expression operates on a mathematical formula that includes variable names (x, y, and z) and function names (movav and movsum), both of which are composed entirely of letters but where only the function names are followed by an "(". 我有一个正则表达式,其中我试图提取不紧跟着“(”符号的每一组字母。例如,下面的正则表达式对包含变量名称(x,y,和z)和函数名称(movav和movsum),两者都完全由字母组成,但只有函数名称后跟一个“(”。

re.findall("[a-zA-Z]+(?!\()", "movav(x/2, 2)*movsum(y, 3)*z")

I would like the expression to return the array 我希望表达式返回数组

['x', 'y', 'z']

but it instead returns the array 但它改为返回数组

['mova', 'x', 'movsu', 'y', 'z']

I can see in theory why the regular expression would be returning the second result, but is there a way I can modify it to return just the array ['x', 'y', 'z'] ? 我可以在理论上看到为什么正则表达式会返回第二个结果,但有没有办法可以修改它只返回数组['x', 'y', 'z']

Add a word-boundary matcher \\b : 添加单词边界匹配器\\b

>>> re.findall(r'[a-zA-Z]+\b(?!\()', "movav(x/2, 2)*movsum(y, 3)*z")
['x', 'y', 'z']

\\b matches the empty string in between two words, so now you're looking for letters followed by a word boundary that isn't immediately followed by ( . For more details, see the re docs . \\b匹配两个单词之间的空字符串,所以现在你要查找字母后跟一个没有紧跟的单词边界( 。有关更多详细信息,请参阅re docs

Another solution which doesn't rely on word boundaries: 另一种不依赖于字边界的解决方案:

Check that the letters aren't followed by either a ( or by another letter. 检查字母是否后跟(或另一个字母)。

>>> re.findall(r'[a-zA-Z]+(?![a-zA-Z(])', "movav(x/2, 2)*movsum(y, 3)*z")
['x', 'y', 'z']

You need to limit matches to whole words. 您需要将匹配限制为整个单词。 So use \\b to match the beginning or end of a word: 因此,使用\\b匹配单词的开头或结尾:

re.findall(r"\b[a-zA-Z]+\b(?!\()", "movav(x/2, 2)*movsum(y, 3)*z")

An alternate approach: find strings of letters followed by either end-of-string or by a non-letter, non-bracket character; 另一种方法:找到字符串,后跟字符串结尾或非字母,非括号字符; then capture the letter portion. 然后捕获字母部分。

re.findall("([a-zA-Z]+)(?:[^a-zA-Z(]|$)", "movav(x/2, 2)*movsum(y, 3)*z")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM