简体   繁体   中英

Regex to analyse log file (multiline)

I need to analyse the result files created by a third party tool. Therefore I tried to create a small c# tool that should load the file content and execute a regular expression. The content looks like that:

[1] <Download> 13.01.2016 15:16:47
[ 

Name: foobar.tif

Status:              0 (ok)
]

[2] <Download> 13.01.2016 15:17:50
[
Name: foobar2.tif

Error: 7100: No file found!

]

[3] <Upload> 13.01.2016 15:17:53
[

Name: Company.tif

Size: 3476
Error: 7200: Unauthorized!

]

...

I tried to create a regex pattern, that matches this type of content. In this example 3 Matches including the 4 groups I need to check (The index 1, 2 or 3; The Task Download/Upload, The name of the file and the value of status or error). All other Information like the time stamp or optional "Size" attribute can be ignored.

This is what I have come up with:

(?<Index>\[[0-9]+\]) (?<TaskName><[\w]+>)

But right now this only matches the Index and task Name and I am not sure how to proceed getting the "Name" and "Status" or "Error" value as they are in another line.

EDIT:

Okay, I tried to work through your responses and this is what i have come up with so far:

\[(?<Index>[0-9]+?)\]\s<(?<Task>\w+?)>.+\n+\[[\s.]+Name\:\s(?<Name>.+)(?<Content>[\s\S]+?)\]

Now I am getting the Index, the Taskname and Name. Is the approach ok so far? Next I will try to also get the error/status into a group.

Regex patterns always capture across multiple lines by default. That behavior can be overridden, but if that's what you want to do, then there is nothing special that you need to do to make it capture across multiple lines. However, there are some character classes that take new-line characters into account. Most notably, the . character class matches all characters except new-lines. Therefore, if you want to capture any character including new-lines, you can't just use .* because that will only match until the end of the current line.

You could use (.|\\n)* but, it's preferable, when possible, to use a negated character class. For instance, if you need to get the values inside the brackets in the following example:

 [Value One] some
 random

 data
 [Value Two]

You could use (\\[(?<value>[^]]*)\\][^[]*)* . Notice that [^]]* is used as the pattern for the value inside the brackets and [^[]* is used as the pattern for everything outside of the brackets. A negated character class just means that it matches any character that isn't in the list. For instance [^abc] will match any character that is not a , b , or c . So, [^[] just means any character that is not an open square-bracket. Since new-line characters are not square-brackets, it will match a new-line character as well as any other kind of character.

The reason I said that a negated character class was preferable to something like (.|\\n)* was because, in order to use (.|\\n)* , you'd have to make the * repetition lazy (eg (.|\\n)*?\\[ ). Lazy (ie not greedy) repetitions cause lots of backtracking, so they harm performance. For that reason, it's best to use negated character classes in place of lazy repetitions whenever possible.

You can do all the work in one single regex, but I think it would be very hard to write and manage. May I suggest to split it in two different regex? You can use this one to get the index, the Download/Upload field and the description in different groups:

\[([1-9]+?)\]\s<\w+?>.+\n\[([\s\S]+?)\]

Then you can get the group containing the message and apply to it this regex:

Name:\s(.+?)\n[\s\S]*?(Error:|Status:)\s+?(.+?)$

Before you use the regex above be sure to use Trim() on the string that contains the message, otherwise the regex may not work properly.

Here is some C# code to use the regex:

Regex regex1 = new Regex("\\[([1-9]+?)\\]\\s<\\w+?>.+\\n\\[([\\s\\S]+?)\\]");
            MatchCollection matches = regex1.Matches(logMessage);

            foreach (Match match in matches)
            {
                String indexField = match.Groups[1].Value;
                String message = match.Groups[2].Value.Trim();
                if (String.IsNullOrEmpty(message) == false)
                {
                    Regex regex2 = new Regex("Name:\\s(.+?)\\n[\\s\\S]*?(Error:|Status:)\\s+?(.+?)$");
                    Match messageMatch = regex2.Match(message);
                    String name = messageMatch.Groups[1].Value.Trim();
                    String statusError = messageMatch.Groups[3].Value.Trim();
                }
            }

You could come up with sth. like the following regex in free-spacing mode:

~
\[(?<index>\d+)\]\s*
<(?<task>\w+)>(?s).*?
\[(?s).*?
Name:\s*(?<filename>[^\n]+)(?s).*?
(?:Status|Error):\s*(?<status>\d+)(?s).*?
\]
~

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM