简体   繁体   中英

What regex should be used to match a multiline log message?

I'm writing a batch file that processes a log file of my application.

The log file may contain messages whose start match the regex ^.{24}\\[ERROR followed by some consecutive lines that I need to find. The end of a log message will be denoted by the next match of the regex ^.{24}\\[[AZ

Currently I'm using the Regex (?m)^.{24}\\[ERROR(.*\\r?\\n?.)*?^.{24}\\[[AZ] to find such messages. But the performance is very poor as it is currently already running multiple minutes for a few MB log file.

The complete batch file I'm using is:

@Echo off

powershell -Command "& {[System.Text.RegularExpressions.RegEx]::Matches([System.IO.File]::ReadAllText('application.log'), '(?m)^.{24}\[ERROR(.*\r?\n?.)*?^.{24}\[[A-Z]') | Set-Content result.txt}"

What regex should I use to match the log messages as described above?

The point is that your regex contains a (.*\\r?\\n?.)*? section inside, containing nested optional (that is, matching an empty text) subpatterns. Once quantified in a group, they have the regex engine try a lot of combinations before admitting there is no match, thus, leading to catastrophical backtracking or timeout issues.

One of the solutions is just to use lazy dot matching pattern with the DOTALL modifier:

(?ms)^.{24}\[ERROR(.*?)^.{24}\[[A-Z]

See the regex demo

The .NET regex engine handles the subpattern much better than PCRE, Python re, JavaScript.

However, lazy matching costs performance, and it is best practice to unroll it. I suggest

(?m)^.{24}\[ERROR(.*(?:\n(?!.{24}\[[A-Z]).*)*)\n.{24}\[[A-Z]

See another regex demo

Note that these 2 are equivalent in what they match, but differ in how they match. While the first tries to match the trailing part of the pattern and expanding 1 char by one upon failure, the unrolled pattern just grabs text portions up to a newline, and all newlines that have no 24 non-newline symbols followed with [ and an uppercase ASCII letter, which is faster .

RegexHero.net test:

在此输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM