简体   繁体   中英

Why is my regex so much slower compiled than interpreted?

I have a large and complex C# regex that runs OK when interpreted, but is a bit slow. I'm trying to speed this up by setting RegexOptions.Compiled , and this seems to take about 30 seconds for the first time and instantly after that. I'm trying to negate this by compiling the regex to an assembly first, so my app can be as fast as possible.

My problem is when the compiling delay takes place, whether it's compiled in the app:

Regex myComplexRegex = new Regex(regexText, RegexOptions.Compiled);
MatchCollection matches = myComplexRegex.Matches(searchText);
foreach (Match match in matches) // <--- when the one-time long delay kicks in
{

} 

or using Regex.CompileToAssembly in advance:

MatchCollection matches = new CompiledAssembly.ComplexRegex().Matches(searchText);
foreach (Match match in matches) // <--- when the one-time long delay kicks in
{

} 

This is making compiling to an assembly basically useless, as I still get the delay on the first foreach call. What I want is for all the compiling delay to be done at compile time instead (at the Regex.CompileToAssembly call), and not at runtime. Where am I going wrong ?

(The code I'm using to compile to an assembly is similar to http://www.dijksterhuis.org/regular-expressions-advanced/ , if that's relevant ).

Edit:

Should I be using new when calling the compiled assembly in new CompiledAssembly.ComplexRegex().Matches(searchText); ? It gives a "object reference required" error without it though.

Update 2

Thanks for the answers/comments. The regex that I'm using is pretty long but basically straightforward, a list of thousands of words each separated by |. I can't see it'd be a backtracking problem really. The subject string can be just one letter long, and it can still cause the compilation delay. For a RegexOptions.Compiled regex, it'll take over 10 seconds to execute when the regex contains 5000 words. For comparison, the non-compiled version of the regex can take 30,000+ words and still execute just about instantly.

After doing a lot of testing on this, what I think I've found out is:

  • Don't use RegexOptions.Compiled when your regex has many alternatives - it can be extremely slow to to compile.
  • .Net will use lazy evaluation for regex when possible, and AFAI can see this extends (at least to some extent) to regex compilation too. A regex will be fully compiled only when it has to be, and there seems to be no way of forcing compilation ahead of time.
  • Regex.CompileToAssembly would be much more useful if the regexes could be forced to be fully compiled, it seems to be verging on being pointless as it is.

Please correct me if I'm wrong or missing something!

When using RegexOptions.Compiled , you should make sure to re-use the Regex object. It doesn't seem like you are doing this.

RegexOptions.Compiled is a trade-off. The initial construction of the Regex will be slower, because code is compiled on-the-fly, but each match should be faster. If your regular expression changes at run-time, there will probably be no benefit from using RegexOptions.Compiled, although it might depend on the actual expression involved.

Update, per the comments

If your actual code looks like the one you have posted, you are not taking any advantage of CompileToAssembly , as you are creating new, on-the-fly compiled instances of Regex each time that piece of code runs. In order to take advantage of CompileToAssembly, you will need to compile the Regex first; then take the generated assembly and reference it in your project. You should then instantiate the generated, strongly-typed Regex types generated.

In the example you link to, he has a regular expression named FindTCPIP, which gets compiled into a type named FindCTPIP. When this needs to be used, one should create a new instance of this specific type, such as:

TheRegularExpressions.FindTCPIP MatchTCP = new TheRegularExpressions.FindTCPIP();

To force initialization you can call Match against an empty string. On top of that you can use ngen to create a native image of the expression to speed up the process even further. But probably most importantly, it's essentially just as fast to throw 30.000 string.IndexOf's or string.Contains or Regex.Match statements against a given text, than compiling a ginormous big expression to Match against a single text. Since that requires a lot less compilation, jitting etc, as the state machine is a lot simpler.

Another thing you could consider is to tokenize the text and intersect it with the list of words you're after.

Try using Regex.CompileToAssembly , then link to the assembly so that you can construct the Regex objects. RegexOptions.Compiled is a runtime option, the regex would still get re-compiled every time you run the application.

A very probable cause when investigating a slow regex is that it backtracks too much. This is solved by rewriting the regex so that the number of backtracking is non existent or minimal.

Can you post the regex and a sample input where it is slow.

Personally I didn't have the need to compile a regex although its interesting to see some actual numbers about performance if you have taken this path.

After extensive testing of my own, I can confirm the suspicions of mikel are essentially correct. Even when using Regex.CompileToAssembly() and statically linking the resultant DLL into the application, there is a substantial initial delay on the first practical matching call (at least for patterns involving many ORed alternatives). Moreover, the initial delay on the first matching call depends on what text you match against. For example, matching against an empty string or some other arbitrary text will cause less of an initial delay, but you will still get additional delays later on when actual positive matches are first encountered in new text. The only way to fully guarantee future matches will all be lightning fast is to initially force a positive match at runtime with text that does indeed match. Of course this gives the maximum initial delay possible (in exchange for all future matches being lightning fast).

I dug deeper in order to understand this better. For each regex compiled into the assembly, a triplet of classes are written with the following naming template: { RegexName , RegexNameFactoryN , RegexNameRunnerN }. A reference to the RegexNameFactoryN class is instantiated at time of RegexName ctor, but the RegexNameRunnerN class is not. See the private factory and runnerref fields in the base Regex class. runnerref is a cached weak reference to a RegexNameRunnerN object. After various experiments with reflection, I can confirm that the ctors of all 3 of these compiled classes are fast and the RegexNameFactoryN.CreateInstance() function (which returns the initial RegexNameRunnerN reference) is also fast. The initial delay occurs somewhere within RegexRunner.Scan() , or it's call tree, and is thus likely outside the reach of the compiled MSIL generated by Regex.CompileToAssembly() since this call tree involves numerous non-abstract functions. This is very unfortunate and means the C# Regex compilation process performance benefits only extend so far: At runtime there will always be some substantial delay at the first time a positive match is encountered (at least for this class of many-ORed patterns).

I theorize that this has to do with how the Nondeterministic Finite Automaton (NFA) engine performs some of it's own internal caching/instantiations at runtime as the pattern is processed.

jessehouwing 's suggestion of ngen is interesting and could possibly improve performance. I have not tested it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM