简体   繁体   中英

Previous group match in Python regex

I try to capture fragments of string that looks like %a , %b , etc. and replace them with some values. Additionally, I want to be able to escape % character by typing %% .

In an example string %d%%f%x%%%g I want to match %%f %% ( %d , %x , %g ). %%f %% %d%x%g )。

My regular expression looks like this:

(?:[^%]|^)(?:%%)*(%[a-z])
  • (?:[^%]|^) - match to the beginning of the line or the character different from %
  • (?:%%)* - match to 0 or more occurrences of %% (escaped % )
  • (%[az]) - proper match to %a , %b , etc. patterns

First two elements are added to support escaping of % character.

However, when running the regexp on example string the last fragment ( %g ) is not found:

>>> import re
>>> pat = re.compile("(?:[^%]|^)(?:%%)*(%[a-z])")
>>> pat.findall("%d%%f%x%%%g")
['%d', '%x']

but after adding a character before %%%g , it starts to work fine:

>>> pat.findall("%d%%f%x %%%g")
['%d', '%x', '%g']

It looks like x is not matched again to [^%] after matching to the group (%[az]) . How can I change the regexp to force it to check the last character of previous match again? I read about \\G , but it didn't help.

Why it didn't pick the %g ?

To pick the %g , it must have to have %% before it. And even before that it must have to have a non-% character, or at the beginning of the string. So, x%%%g could have a match for you. But this x was picked during previous matching(ie when printing %x ).

In simple, you have overlapping on your regex matching. So you can overcome this using following one. I am placing your regex inside the (?= ... )

pat = re.compile("(?=(?:[^%]|^)(?:%%)*(%[a-z]))")

You need to construct your regex a little differently:

>>> import re
>>> regex = re.compile(r"(?:[^%]|%%)*(%[a-z])")
>>> regex.findall("%d%%f%x%%%g")
['%d', '%x', '%g']

Explanation:

(?:      # Start of a non-capturing group:
 [^%]    # Either match any character except %
|        # or
 %%      # match an "escaped" %.
)*       # Do this any number of times.
(        # Match and capture in group 1:
 %[a-z]  # % followed by a lowercase ASCII alphanumeric
)        # End of capturing group

It seems to me that you want to catch only every portion %x that is preceded by an even number of % .

If so, the pattern is "(?<!%)(?:%%)*(%[az])"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM