How to improve the performance of this regular expression?

Question

Consider the regular expression

^(?:\s*(?:[\%\#].*)?\n)*\s*function\s

It is intended to match Octave/MATLAB script files that start with a function definition.

However, the performance of this regular expression is incredibly slow, and I'm not entirely sure why. For example, if I try evaluating it in Python,

>>> import re, time
>>> r = re.compile(r"^(?:\s*(?:[\%\#].*)?\n)*\s*function\s")
>>> t0=time.time(); r.match("\n"*15); print(time.time()-t0)
0.0178489685059
>>> t0=time.time(); r.match("\n"*20); print(time.time()-t0)
0.532235860825
>>> t0=time.time(); r.match("\n"*25); print(time.time()-t0)
17.1298530102

In English, that last line is saying that my regular expression takes 17 seconds to evaluate on a simple string containing 25 newline characters!

What is it about my regex that is making it so slow, and what could I do to fix it?

EDIT: To clarify, I would like my regex to match the following string containing comments:

# Hello world
function abc

including any amount of whitespace, but not

x = 10
function abc

because then the string does not start with "function". Note that comments can start with either "%" or with "#".

Answer 1

Replace your \\s with [\\t\\f ] so they don't catch newlines. This should only be done by the whole non-capturing group (?:[\\t\\f ]*(?:[\\%\\#].*)?\\n) .
The problem is that you have three greedy consumers that all match '\\n' ( \\s* , (...\\n)* and again \\s* ).
In your last timing example, they will try out all strings a , b and c (one for each consumer) that make up 25*'\\n' or any substring d it begins with, say e is what is ignored, then d+e == 25*'\\n' .
Now find all combinations of a , b , c and e so that a+b+c+e == d+e == 25*'\\n' considering also the empty string for one or more variables. It's too late for me to do the maths right now but I bet the number is huge :D

By the way regex101 is a great site to try out regular expressions. They automatically break up expressions and explain their parts and they even provide a debugger.

Answer 2

To speedup you can use this regex:

p = re.compile(r"^\s*function\s", re.MULTILINE)

Since you're not actually capturing lines starting with # or % anyway, you can use MULTILINE mode and start matching from the same line where function keyword is found.

How to improve the performance of this regular expression?

Question

2 answers

solution1
2 ACCPTED 2015-09-02 19:12:23

solution2
0 2015-09-02 06:28:30

How to improve the performance of this regular expression?

Question

2 answers

solution1 2 ACCPTED 2015-09-02 19:12:23

solution2 0 2015-09-02 06:28:30

solution1
2 ACCPTED 2015-09-02 19:12:23

solution2
0 2015-09-02 06:28:30