[英]Non-greedy regex not matching as expected
Given the following string as input: 给定以下字符串作为输入:
[2015/06/09 14:21:59] mod=syn|cli=192.168.1.99/49244|srv=192.168.1.100/80|subj=cli|os=Windows 7 or 8|dist=0|params=none|raw_sig=4:128+0:0:1460:8192,8:mss,nop,ws,nop,nop,sok:df,id+:0
I'm trying to match the value of subj
, ie: in the above case the expected output would be cli
我正在尝试匹配
subj
的值,即:在上述情况下,预期输出将为cli
I don't understand why my regex is not working: 我不明白为什么我的正则表达式无法正常工作:
subj = re.match(r"(.*)subj=(.*?)|(.*)", line).group(2)
From what I can tell, the second group in here should be cli
but I'm getting an empty result. 据我所知,这里的第二组应该是
cli
但是我得到的是空洞的结果。
The |
|
has special meaning in regex (Which creates alternations ) , hence escape it as 在regex中有特殊的含义(它创建交替 ),因此可以将其转义为
>> re.match(r"(.*)subj=(.*?)\|", line).group(2)
'cli'
Another Solution 另一种解决方案
You can use re.search()
so that you can get rid of the groups at the start of subj
and that after the |
您可以使用
re.search()
以便摆脱在subj
开头和|
后的组|
Example 例
>>> re.search(r"subj=(.*?)\|", line).group(1)
'cli'
Here we use group(1)
since there is only one group that is being captured instead of three as in previous version. 这里我们使用
group(1)
因为只捕获了一个组,而不是以前版本中的三个。
Complex version 复杂版本
You can even get rid of all the capturing if you are using look arounds 如果使用环顾四周,您甚至可以摆脱所有捕获
>>> re.search(r"(?<=subj=).*?(?=\|)", line).group(0)
'cli'
(?<=subj=)
Checks if the string matched by .*?
(?<=subj=)
检查字符串是否与.*?
匹配 is preceded by subj
. 在
subj
之前。
.*?
Matches anything, non greedy matching. 匹配任何内容,非贪婪匹配。
(?=\\|)
Check if this anything is followed by a |
(?=\\|)
检查是否这种事后跟|
. 。
You need to escape |
您需要逃脱
|
.. Use the following: ..使用以下命令:
subj = re.match(r"(.*)subj=(.*?)\|(.*)", line).group(2)
^
I'd recommend using the following regex, because it will provide better performance with two additions/substitutions: 我建议使用以下正则表达式,因为它将通过两个加法/替换来提供更好的性能:
^
^
的开头 [^\\|]*
is faster than (.*)?
[^\\|]*
比(.*)?
更快(.*)?
Code 码
subj = re.match(r"^.*\|subj=([^\|]*)", line).group(1)
regex: 正则表达式:
^.*\|subj=([^\|]*)
The pipe sign |
管道标志
|
needs to be escaped, like so: 需要逃脱,像这样:
subj = re.match(r"(.*)subj=(.*?)\\|(.*)", s).group(2)
I would use a negated class [^|]*
with re.search
for better performance: 我将对
re.search
使用否定的类[^|]*
以获得更好的性能:
import re
p = re.compile(r'^(.*)subj=([^|]*)\|(.*)$')
test_str = "[2015/06/09 14:21:59] mod=syn|cli=192.168.1.99/49244|srv=192.168.1.100/80|subj=cli|os=Windows 7 or 8|dist=0|params=none|raw_sig=4:128+0:0:1460:8192,8:mss,nop,ws,nop,nop,sok:df,id+:0"
print re.search(p, test_str).group(2)
See IDEONE demo 见IDEONE演示
Note I am not using both lazy and greedy quantifiers in the regex (it is not advisable usually). 注意我不在正则表达式中同时使用惰性和贪婪量词(通常不建议这样做)。
The pipe symbol must be escaped to be treated as a literal |
必须将管道符号转义以将其视为文字
|
symbol. 符号。
REGEX EXPLANATION : 正则表达式说明 :
^
- Start of string ^
-字符串开头 (.*)
- The first capturing group that matches characters from the beginning up to (.*)
-从头到尾匹配字符的第一个捕获组 subj=
- A literal string subj=
subj=
-文字字符串subj=
([^|]*)
- The second capturing group matching any characters other than a literal pipe (inside a character class, it does not need escaping) ([^|]*)
-第二个捕获组匹配文字管道以外的任何字符(在字符类内部,不需要转义) \\|
- A literal pipe (must be escaped) (.*)
- The third capturing group (if you need to get the string after up to the end. (.*)
-第三个捕获组(如果需要从头到尾获取字符串)。 $
- End of string $
-字符串结尾
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.