简体   繁体   English

正则表达式分析日志文件(多行)

[英]Regex to analyse log file (multiline)

I need to analyse the result files created by a third party tool. 我需要分析由第三方工具创建的结果文件。 Therefore I tried to create a small c# tool that should load the file content and execute a regular expression. 因此,我尝试创建一个小的c#工具,该工具应加载文件内容并执行正则表达式。 The content looks like that: 内容如下所示:

[1] <Download> 13.01.2016 15:16:47
[ 

Name: foobar.tif

Status:              0 (ok)
]

[2] <Download> 13.01.2016 15:17:50
[
Name: foobar2.tif

Error: 7100: No file found!

]

[3] <Upload> 13.01.2016 15:17:53
[

Name: Company.tif

Size: 3476
Error: 7200: Unauthorized!

]

...

I tried to create a regex pattern, that matches this type of content. 我试图创建一个正则表达式模式,以匹配这种类型的内容。 In this example 3 Matches including the 4 groups I need to check (The index 1, 2 or 3; The Task Download/Upload, The name of the file and the value of status or error). 在此示例中,3个匹配项包括我需要检查的4个组(索引1、2或3;任务下载/上传,文件名以及状态或错误的值)。 All other Information like the time stamp or optional "Size" attribute can be ignored. 时间戳或可选的“大小”属性之类的所有其他信息都可以忽略。

This is what I have come up with: 这是我想出的:

(?<Index>\[[0-9]+\]) (?<TaskName><[\w]+>)

But right now this only matches the Index and task Name and I am not sure how to proceed getting the "Name" and "Status" or "Error" value as they are in another line. 但是现在这仅与索引和任务名称匹配,我不确定如何继续获取“名称”和“状态”或“错误”值,因为它们在另一行中。

EDIT: 编辑:

Okay, I tried to work through your responses and this is what i have come up with so far: 好的,我尝试通过您的回复进行操作,到目前为止,我的想法是:

\[(?<Index>[0-9]+?)\]\s<(?<Task>\w+?)>.+\n+\[[\s.]+Name\:\s(?<Name>.+)(?<Content>[\s\S]+?)\]

Now I am getting the Index, the Taskname and Name. 现在,我得到了索引,任务名称和名称。 Is the approach ok so far? 到目前为止,方法还可以吗? Next I will try to also get the error/status into a group. 接下来,我还将尝试将错误/状态归为一组。

Regex patterns always capture across multiple lines by default. 默认情况下,正则表达式模式始终跨多行捕获。 That behavior can be overridden, but if that's what you want to do, then there is nothing special that you need to do to make it capture across multiple lines. 可以覆盖该行为,但是如果您要这样做,则无需做任何特殊操作即可使其跨多行捕获。 However, there are some character classes that take new-line characters into account. 但是,有些字符类会考虑换行符。 Most notably, the . 最值得注意的是. character class matches all characters except new-lines. 字符类匹配换行符以外的所有字符。 Therefore, if you want to capture any character including new-lines, you can't just use .* because that will only match until the end of the current line. 因此,如果要捕获包括换行符在内的任何字符,则不能仅使用.*因为这将一直匹配到当前行的末尾。

You could use (.|\\n)* but, it's preferable, when possible, to use a negated character class. 您可以使用(.|\\n)*但是,如果可能的话,最好使用否定的字符类。 For instance, if you need to get the values inside the brackets in the following example: 例如,如果需要在以下示例中获取括号内的值:

 [Value One] some
 random

 data
 [Value Two]

You could use (\\[(?<value>[^]]*)\\][^[]*)* . 您可以使用(\\[(?<value>[^]]*)\\][^[]*)* Notice that [^]]* is used as the pattern for the value inside the brackets and [^[]* is used as the pattern for everything outside of the brackets. 请注意, [^]]*用作方括号内的值的模式, [^[]*用作方括号内的所有值的模式。 A negated character class just means that it matches any character that isn't in the list. 否定的字符类仅表示它与列表中没有的任何字符匹配。 For instance [^abc] will match any character that is not a , b , or c . 例如[^abc]将匹配不是abc任何字符。 So, [^[] just means any character that is not an open square-bracket. 因此, [^[]仅表示不是方括号的任何字符。 Since new-line characters are not square-brackets, it will match a new-line character as well as any other kind of character. 由于换行符不是方括号,因此它将与换行符以及任何其他种类的字符匹配。

The reason I said that a negated character class was preferable to something like (.|\\n)* was because, in order to use (.|\\n)* , you'd have to make the * repetition lazy (eg (.|\\n)*?\\[ ). 我说否定的字符类优于(.|\\n)*原因是因为,为了使用(.|\\n)* ,您必须使*重复是惰性的(例如(.|\\n)*?\\[ )。 Lazy (ie not greedy) repetitions cause lots of backtracking, so they harm performance. 懒惰(即不贪心)重复会导致大量回溯,因此会损害性能。 For that reason, it's best to use negated character classes in place of lazy repetitions whenever possible. 因此,最好尽可能使用否定的字符类代替延迟重复。

You can do all the work in one single regex, but I think it would be very hard to write and manage. 您可以在一个正则表达式中完成所有工作,但是我认为编写和管理它非常困难。 May I suggest to split it in two different regex? 我可以建议将其拆分为两个不同的正则表达式吗? You can use this one to get the index, the Download/Upload field and the description in different groups: 您可以使用此索引来获取索引,“下载/上传”字段以及不同组中的描述:

\[([1-9]+?)\]\s<\w+?>.+\n\[([\s\S]+?)\]

Then you can get the group containing the message and apply to it this regex: 然后,您可以获取包含消息的组并将其应用于此正则表达式:

Name:\s(.+?)\n[\s\S]*?(Error:|Status:)\s+?(.+?)$

Before you use the regex above be sure to use Trim() on the string that contains the message, otherwise the regex may not work properly. 在使用上述正则表达式之前,请确保在包含消息的字符串上使用Trim(),否则该正则表达式可能无法正常工作。

Here is some C# code to use the regex: 这是一些使用正则表达式的C#代码:

Regex regex1 = new Regex("\\[([1-9]+?)\\]\\s<\\w+?>.+\\n\\[([\\s\\S]+?)\\]");
            MatchCollection matches = regex1.Matches(logMessage);

            foreach (Match match in matches)
            {
                String indexField = match.Groups[1].Value;
                String message = match.Groups[2].Value.Trim();
                if (String.IsNullOrEmpty(message) == false)
                {
                    Regex regex2 = new Regex("Name:\\s(.+?)\\n[\\s\\S]*?(Error:|Status:)\\s+?(.+?)$");
                    Match messageMatch = regex2.Match(message);
                    String name = messageMatch.Groups[1].Value.Trim();
                    String statusError = messageMatch.Groups[3].Value.Trim();
                }
            }

You could come up with sth. 你可能想出某事。 like the following regex in free-spacing mode: 像下面的正则表达式在自由模式下一样:

~
\[(?<index>\d+)\]\s*
<(?<task>\w+)>(?s).*?
\[(?s).*?
Name:\s*(?<filename>[^\n]+)(?s).*?
(?:Status|Error):\s*(?<status>\d+)(?s).*?
\]
~

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM