简体   繁体   中英

Is regex always greedy even when I give it look ahead and look behind requirements?

I have an re.sub program which substitutes certain values in between commas in text_string :

re.sub('(?:(?<=\,)|(?<=^))[^\w\d\r\n\t]*(HUN)[^\w\d\r\n\t]*(?=(?:\,|$))','',text_string,flags=re.IGNORECASE)

which replaces HUN with nothing.

I try this on many files. Sometimes the files are huge, sometimes they are small. Occasionally, I will get a MemoryError from the re.py library. What is the best way to split up this execution so that I will not get a MemoryError ?

I'm afraid that the regex is looking at the ENTIRE string first (eg in if text_string is t,w,g,g,hun,t,w ), before looking between the commas, instead of just looking between the commas (ie in a non-greedy way). Does anyone know how this actually gets evaluated?

If the string is super long, does the regex know to evaluate just between the commas in a non-greedy way? Thanks.

Your pattern is really weird.

  • (?:(?<=\\,)|(?<=^)) - This can be just turned into a regular non-capturing group (?:,|^)
  • [^\\w\\d] - since \\w already matches \\d , \\d is redundant
  • [^\\w\\r\\n\\t]* - matches punctuation(!) and thus , , too. It makes it hard for the regex engine to analyze strings that have many comma-separated values before your hun .
  • (?=(?:,|$)) - the lookahead make sense if you plan to match overlapping strings, otherwise, you can replace it with (?:,|$) .

I suggest:

r"(?i)(?:,|^)[^\w\r\n\t]*(HUN)[^\w\r\n\t]*(?=(?:,|$))"

See regex demo

Python demo :

import re
s = ",WWWWWW,hun,hun,WWWWW,"
print re.sub(r"(?i)((?:,|^)[^\w\r\n\t]*)HUN([^\w\r\n\t]*)(?=(?:,|$))", r"\1\2", s)
# => ,WWWWWW,,,WWWWW,

You can do it in a faster way without regex like this:

s = 't,w,g,g,hun,t,w'
res = ','.join(['' if x.lower()=='hun' else x for x in s.split(',')])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM