简体   繁体   中英

Large string regex replace performance in C#

I'm facing a problem with Regex performance in C#.

I need to replace on a very large string (270k charachters, don't ask why..). The regex matches about 3k times.

private static Regex emptyCSSRulesetRegex = new Regex(@"[^\};\{]+\{\s*\}", RegexOptions.Compiled | RegexOptions.Singleline);

public string ReplaceEmptyCSSRulesets(string css) {
  return emptyCSSRulesetRegex.Replace(css, string.Empty);
}

The string I pass to the method looks something like this:

.selector-with-statements{border:none;}.selector-without-statements{}.etc{}

Currently the replace process takes up 1500ms in C#, but when I do exactly the same in Javascript it only takes 100ms .

The Javascript code I used for timing:

console.time('reg replace');
myLargeString.replace(/[^\};\{]+\{\s*\}/g,'');
console.timeEnd('reg replace');

I also tried to do the replacing by looping over the matches in reverse order and replace the string in a StringBuilder. That was not helping.

I'm surprised by the performance difference between C# and Javascript in this case, and I think there I'm doing something wrong but I cannot think of anything.

I can't really explain the difference of time between Javascript and C# (*) . But you can try to improve the performance of your pattern (that produces a lot of backtracking):

private static Regex emptyCSSRulesetRegex = new Regex(@"(?<keep>[^};{]+)(?:{\s*}(?<keep>))?", RegexOptions.Compiled);

public string ReplaceEmptyCSSRulesets(string css) {
    return emptyCSSRulesetRegex.Replace(css, @"${keep}");
}

One of the problems of your original pattern is that when curly brackets are not empty (or not filled with whitespaces), the regex engine will continue to test each positions before the opening curly bracket (with always the same result). Example: with the string abcd{1234} your pattern will be tested starting on a , then b ...

The pattern I suggests will consume abcd even if it is not followed by empty curly brackets, so the positions of bcd are not tested.

abcd is captured in the group named keep but when empty curly brackets are found, the capture group is overwritten by an empty capture group.

You can have an idea of the number of steps needed for the two patterns (check the debugger) :

original pattern

new pattern

Note: your original pattern can be improved if you enclose [^}{;]+ in an atomic group. This change will divide the number of steps needed by 2 (compared to the original), but even with that, the number of steps stays high for the previously explained reason.

(*) it's possible that the javascript regex engine is smart enough to not retry all these positions, but it's only an assumption.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM