简体   繁体   中英

Non-greedy regex not matching as expected

Given the following string as input:

[2015/06/09 14:21:59] mod=syn|cli=192.168.1.99/49244|srv=192.168.1.100/80|subj=cli|os=Windows 7 or 8|dist=0|params=none|raw_sig=4:128+0:0:1460:8192,8:mss,nop,ws,nop,nop,sok:df,id+:0

I'm trying to match the value of subj , ie: in the above case the expected output would be cli

I don't understand why my regex is not working:

subj = re.match(r"(.*)subj=(.*?)|(.*)", line).group(2)

From what I can tell, the second group in here should be cli but I'm getting an empty result.

The | has special meaning in regex (Which creates alternations ) , hence escape it as

>> re.match(r"(.*)subj=(.*?)\|", line).group(2)
'cli'

Another Solution

You can use re.search() so that you can get rid of the groups at the start of subj and that after the |

Example

>>> re.search(r"subj=(.*?)\|", line).group(1)
'cli'

Here we use group(1) since there is only one group that is being captured instead of three as in previous version.


Complex version

You can even get rid of all the capturing if you are using look arounds

>>> re.search(r"(?<=subj=).*?(?=\|)", line).group(0)
'cli'
  • (?<=subj=) Checks if the string matched by .*? is preceded by subj .

  • .*? Matches anything, non greedy matching.

  • (?=\\|) Check if this anything is followed by a | .

You need to escape | .. Use the following:

subj = re.match(r"(.*)subj=(.*?)\|(.*)", line).group(2)
                                ^

Regex101

I'd recommend using the following regex, because it will provide better performance with two additions/substitutions:

  • adding the beginning of line character ^
  • adding the negating group [^\\|]* is faster than (.*)?

Code

subj = re.match(r"^.*\|subj=([^\|]*)", line).group(1)

regex:

^.*\|subj=([^\|]*)

正则表达式可视化

Debuggex Demo

The pipe sign | needs to be escaped, like so:

subj = re.match(r"(.*)subj=(.*?)\\|(.*)", s).group(2)

I would use a negated class [^|]* with re.search for better performance:

import re
p = re.compile(r'^(.*)subj=([^|]*)\|(.*)$')
test_str = "[2015/06/09 14:21:59] mod=syn|cli=192.168.1.99/49244|srv=192.168.1.100/80|subj=cli|os=Windows 7 or 8|dist=0|params=none|raw_sig=4:128+0:0:1460:8192,8:mss,nop,ws,nop,nop,sok:df,id+:0"
print re.search(p, test_str).group(2)

See IDEONE demo

Note I am not using both lazy and greedy quantifiers in the regex (it is not advisable usually).

The pipe symbol must be escaped to be treated as a literal | symbol.

REGEX EXPLANATION :

  • ^ - Start of string
  • (.*) - The first capturing group that matches characters from the beginning up to
  • subj= - A literal string subj=
  • ([^|]*) - The second capturing group matching any characters other than a literal pipe (inside a character class, it does not need escaping)
  • \\| - A literal pipe (must be escaped)
  • (.*) - The third capturing group (if you need to get the string after up to the end.
  • $ - End of string

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM