I have an re.sub
program which substitutes certain values in between commas in text_string
:
re.sub('(?:(?<=\,)|(?<=^))[^\w\d\r\n\t]*(HUN)[^\w\d\r\n\t]*(?=(?:\,|$))','',text_string,flags=re.IGNORECASE)
which replaces HUN
with nothing.
I try this on many files. Sometimes the files are huge, sometimes they are small. Occasionally, I will get a MemoryError
from the re.py
library. What is the best way to split up this execution so that I will not get a MemoryError
?
I'm afraid that the regex is looking at the ENTIRE string first (eg in if text_string
is t,w,g,g,hun,t,w
), before looking between the commas, instead of just looking between the commas (ie in a non-greedy way). Does anyone know how this actually gets evaluated?
If the string is super long, does the regex know to evaluate just between the commas in a non-greedy way? Thanks.
Your pattern is really weird.
(?:(?<=\\,)|(?<=^))
- This can be just turned into a regular non-capturing group (?:,|^)
[^\\w\\d]
- since \\w
already matches \\d
, \\d
is redundant [^\\w\\r\\n\\t]*
- matches punctuation(!) and thus ,
, too. It makes it hard for the regex engine to analyze strings that have many comma-separated values before your hun
. (?=(?:,|$))
- the lookahead make sense if you plan to match overlapping strings, otherwise, you can replace it with (?:,|$)
. I suggest:
r"(?i)(?:,|^)[^\w\r\n\t]*(HUN)[^\w\r\n\t]*(?=(?:,|$))"
See regex demo
import re
s = ",WWWWWW,hun,hun,WWWWW,"
print re.sub(r"(?i)((?:,|^)[^\w\r\n\t]*)HUN([^\w\r\n\t]*)(?=(?:,|$))", r"\1\2", s)
# => ,WWWWWW,,,WWWWW,
You can do it in a faster way without regex like this:
s = 't,w,g,g,hun,t,w'
res = ','.join(['' if x.lower()=='hun' else x for x in s.split(',')])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.