简体   繁体   English

非贪婪正则表达式与预期不匹配

[英]Non-greedy regex not matching as expected

Given the following string as input: 给定以下字符串作为输入:

[2015/06/09 14:21:59] mod=syn|cli=192.168.1.99/49244|srv=192.168.1.100/80|subj=cli|os=Windows 7 or 8|dist=0|params=none|raw_sig=4:128+0:0:1460:8192,8:mss,nop,ws,nop,nop,sok:df,id+:0

I'm trying to match the value of subj , ie: in the above case the expected output would be cli 我正在尝试匹配subj的值,即:在上述情况下,预期输出将为cli

I don't understand why my regex is not working: 我不明白为什么我的正则表达式无法正常工作:

subj = re.match(r"(.*)subj=(.*?)|(.*)", line).group(2)

From what I can tell, the second group in here should be cli but I'm getting an empty result. 据我所知,这里的第二组应该是cli但是我得到的是空洞的结果。

The | | has special meaning in regex (Which creates alternations ) , hence escape it as 在regex中有特殊的含义(它创建交替 ),因此可以将其转义为

>> re.match(r"(.*)subj=(.*?)\|", line).group(2)
'cli'

Another Solution 另一种解决方案

You can use re.search() so that you can get rid of the groups at the start of subj and that after the | 您可以使用re.search()以便摆脱在subj开头和|后的组|

Example

>>> re.search(r"subj=(.*?)\|", line).group(1)
'cli'

Here we use group(1) since there is only one group that is being captured instead of three as in previous version. 这里我们使用group(1)因为只捕获了一个组,而不是以前版本中的三个。


Complex version 复杂版本

You can even get rid of all the capturing if you are using look arounds 如果使用环顾四周,您甚至可以摆脱所有捕获

>>> re.search(r"(?<=subj=).*?(?=\|)", line).group(0)
'cli'
  • (?<=subj=) Checks if the string matched by .*? (?<=subj=)检查字符串是否与.*?匹配 is preceded by subj . subj之前。

  • .*? Matches anything, non greedy matching. 匹配任何内容,非贪婪匹配。

  • (?=\\|) Check if this anything is followed by a | (?=\\|)检查是否这种后跟| .

You need to escape | 您需要逃脱| .. Use the following: ..使用以下命令:

subj = re.match(r"(.*)subj=(.*?)\|(.*)", line).group(2)
                                ^

Regex101 正则表达式101

I'd recommend using the following regex, because it will provide better performance with two additions/substitutions: 我建议使用以下正则表达式,因为它将通过两个加法/替换来提供更好的性能:

  • adding the beginning of line character ^ 添加行字符^的开头
  • adding the negating group [^\\|]* is faster than (.*)? 添加否定组[^\\|]*(.*)?更快(.*)?

Code

subj = re.match(r"^.*\|subj=([^\|]*)", line).group(1)

regex: 正则表达式:

^.*\|subj=([^\|]*)

正则表达式可视化

Debuggex Demo Debuggex演示

The pipe sign | 管道标志| needs to be escaped, like so: 需要逃脱,像这样:

subj = re.match(r"(.*)subj=(.*?)\\|(.*)", s).group(2)

I would use a negated class [^|]* with re.search for better performance: 我将对re.search使用否定的类[^|]*以获得更好的性能:

import re
p = re.compile(r'^(.*)subj=([^|]*)\|(.*)$')
test_str = "[2015/06/09 14:21:59] mod=syn|cli=192.168.1.99/49244|srv=192.168.1.100/80|subj=cli|os=Windows 7 or 8|dist=0|params=none|raw_sig=4:128+0:0:1460:8192,8:mss,nop,ws,nop,nop,sok:df,id+:0"
print re.search(p, test_str).group(2)

See IDEONE demo IDEONE演示

Note I am not using both lazy and greedy quantifiers in the regex (it is not advisable usually). 注意我不在正则表达式中同时使用惰性和贪婪量词(通常不建议这样做)。

The pipe symbol must be escaped to be treated as a literal | 必须将管道符号转义以将其视为文字| symbol. 符号。

REGEX EXPLANATION : 正则表达式说明

  • ^ - Start of string ^ -字符串开头
  • (.*) - The first capturing group that matches characters from the beginning up to (.*) -从头到尾匹配字符的第一个捕获组
  • subj= - A literal string subj= subj= -文字字符串subj=
  • ([^|]*) - The second capturing group matching any characters other than a literal pipe (inside a character class, it does not need escaping) ([^|]*) -第二个捕获组匹配文字管道以外的任何字符(在字符类内部,不需要转义)
  • \\| - A literal pipe (must be escaped) -文字管道(必须转义)
  • (.*) - The third capturing group (if you need to get the string after up to the end. (.*) -第三个捕获组(如果需要从头到尾获取字符串)。
  • $ - End of string $ -字符串结尾

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM