I have List collection with around 35,000 strings
Typical string looks like this:
"<i>füüs</i>ampri tähis;lüh ld-st<i>anno</i>, aastal;<i>maj</i> lüh pr-st<i>argent</i>, raha (kursisedelitel)"
Basically this string contains bunch of words in Estonian :)
I need to allow user to perform RegExp search on 35,000 strings
If I perform search using /ab.*/ expression, then search takes less than a second
If I perform search using /.*ab/ expression, then search takes around 10 seconds
My question is: How can I make second search faster (less then 1.5 seconds)?
Thank You very much
It's how regular expressions are processed that makes them perform so different. To explain that based on your examples:
/.*ab/
This expression consists on two sub-expressions, the .*
and the literal ab
. This would be processed as follows: In the normal greedy mode, where every quantor and thus the match is expanded to its maximum, the .*
will first match the whole string. Then it will try to match the following ab
. But as it is not possible (we're already at the end of the string) backtracking will be used to find the last point where both sub-expressions match. So the match of .*
is reduced by one character and again the ab
is tested. If the whole expression cannot be matched, the match of .*
again will be reduced by one character until the whole expression is matched. In the worst case there is no ab
in the string and the algorithm will do n +1 backtracks and additional tests for ab
until it can determine that a match is impossible.
/ab.*/
This expression consists of two sub-expressions too. But here the order is changed, first the literal ab
and than the .*
. This is processed as follows: The algorithm first tries to find the literal ab
by comparing one character by another. If there is a match, it tries to find a match for .*
that is obvious easy.
The main difference between those two regular expressions is, that the second has the static part at the beginning and the variable part at the end. This makes no backtracking necessary.
Try some regular expression analysis tool such as RegexBuddy to see the difference visually.
Use compiled regular expressions for better performance
http://en.csharp-online.net/CSharp_Regular_Expression_Recipes —Compiling_Regular_Expressions
copy paste the full url, looks like there is rendering problem with this link.
There are two possible modifications I can suggest for the slow .*ab
expression.
I performed my tests with this test string "1234567890 ab 1234567890" using the benchmarking feature in Regex Hero.
A . 5X faster than original
^.*ab
RegexOptions.None
or
B . 8X faster than original
.*ab
RegexOptions.RightToLeft
Sometimes experimentation pays off. The RightToLeft
option was my "ah ha!" moment. That essentially returns the same performance as your other ab.*
expression by preventing the massive backtracking from ever occurring.
I got this crazy idea that you could also store the strings in reverse order and search those strings with /ba. *
/ if the user enter /. *
ab/.
Your second expression will match 'ab' and all characters before it (except the new line). You can try searching only /ab/, get the index of the match and as a result concat the part of the string preceeding the match with match.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.