简体   繁体   中英

Improving Performance of a Regular Expression on a Large String

I am currently using a regular expression in my code in order to grab a large string from a rich text document. The regular expression finds any embedded images and parses them into a byte array that I can convert into a LinkedResource. I need to convert an RTF from a RichTextBox in my application into a valid HTML document, and then into a MIME-encoded message that can be automatically sent.

The problem with the regular is expression is that the string section of the image is very large, so I feel like the regular expression is trying to match many possibilities within the entire string when, in reality, I only need to look at the beginning and end of the section. The regular expression below is contained within a larger regular expression as an optional clause, such as someRegexStringA + "|" + imageRegexString + "|" + "someRegexStringB" someRegexStringA + "|" + imageRegexString + "|" + "someRegexStringB" someRegexStringA + "|" + imageRegexString + "|" + "someRegexStringB" .

What can I do to ensure that there is less checking within large strings so that my application doesn't appear to freeze when parsing large amounts of image data?

// The Regex itself
private static string imageRegexString = @"(?<imageCheck>\\pict)"                  // Look for the opening image tag
                                       + @"(?:\\picwgoal(?<widthNumber>[0-9]+))"   // Read the size of the image's width
                                       + @"(?:\\pichgoal(?<heightNumber>[0-9]+))"  // Read the size of the image's height
                                       + @"(?:\\pngblip(\r|\n))"                   // The image is the newline after this portion of the opening tag and information
                                       + @"(?<imageData>(.|\r|\n)+?)"              // Read the bitmap
                                       + @"(?:}+)";                                // Look for closing braces

// The expression is compiled so it doesn't take as much time during runtime
private static Regex myRegularExpression = new Regex(imageRegexString, RegexOptions.Compiled);

// Iterate through each image in the document
foreach(Match image in myRegularExpression.Matches(myDocument))
{
    // Read the image height and width
    int imageWidth = int.Parse(image.Groups["widthNumber"].Value);
    int imageHeight = int.Parse(image.Groups["heightNumber"].Value);

    // Process the image
    ProcessImageData(image.Groups["imageData"].Value);
}

First, I vaguely remember having an InfoPath form with a Rich Text Editor that could be exported to HTML - so you may want to look at that (though we still had to attach the images separately)

As for your pattern: it is pretty straightforward, there is only one suspicious line:

(?<imageData>(.|\r|\n)+?)

This has several potential problems:

  • +? is lazy, and for long strings causes a lot of backtracking, which may be inefficient.
  • .|\\r|\\n also seems pretty inefficient. You can use the SingleLine modifier (or inline (?s:...) ).
    By the way, . already matches \\r .
  • (.|\\r|\\n) - This is a capturing group , unlike the (?:...) group you use elsewhere. I suspect this is killing you - in .Net, each character is saved in a stack as a Capture . You don't want that.

I'd suggest this instead, with a possessive group , just to be safe:

(?<imageData>(?>[^}]+))

Of course, it is also possible the pattern is slow because of the other alternations: someRegexStringA or someRegexStringB .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM