Consider the regular expression
^(?:\s*(?:[\%\#].*)?\n)*\s*function\s
It is intended to match Octave/MATLAB script files that start with a function definition.
However, the performance of this regular expression is incredibly slow, and I'm not entirely sure why. For example, if I try evaluating it in Python,
>>> import re, time
>>> r = re.compile(r"^(?:\s*(?:[\%\#].*)?\n)*\s*function\s")
>>> t0=time.time(); r.match("\n"*15); print(time.time()-t0)
0.0178489685059
>>> t0=time.time(); r.match("\n"*20); print(time.time()-t0)
0.532235860825
>>> t0=time.time(); r.match("\n"*25); print(time.time()-t0)
17.1298530102
In English, that last line is saying that my regular expression takes 17 seconds to evaluate on a simple string containing 25 newline characters!
What is it about my regex that is making it so slow, and what could I do to fix it?
EDIT: To clarify, I would like my regex to match the following string containing comments:
# Hello world
function abc
including any amount of whitespace, but not
x = 10
function abc
because then the string does not start with "function". Note that comments can start with either "%" or with "#".
Replace your \\s
with [\\t\\f ]
so they don't catch newlines. This should only be done by the whole non-capturing group (?:[\\t\\f ]*(?:[\\%\\#].*)?\\n)
.
The problem is that you have three greedy consumers that all match '\\n'
( \\s*
, (...\\n)*
and again \\s*
).
In your last timing example, they will try out all strings a
, b
and c
(one for each consumer) that make up 25*'\\n'
or any substring d
it begins with, say e
is what is ignored, then d+e == 25*'\\n'
.
Now find all combinations of a
, b
, c
and e
so that a+b+c+e == d+e == 25*'\\n'
considering also the empty string for one or more variables. It's too late for me to do the maths right now but I bet the number is huge :D
By the way regex101 is a great site to try out regular expressions. They automatically break up expressions and explain their parts and they even provide a debugger.
To speedup you can use this regex:
p = re.compile(r"^\s*function\s", re.MULTILINE)
Since you're not actually capturing lines starting with #
or %
anyway, you can use MULTILINE
mode and start matching from the same line where function
keyword is found.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.