[英]Regex to match pattern until next occurence of it
I have following data: 我有以下数据:
2018-03-20 23:28:47 INFO This is an info sample(can be multiline with new line characters)
2018-03-20 23:28:47 INFO This is an info sample(can be multiline with new line characters)
2018-03-20 23:28:47 DEBUG This is a debug sample(can be multiline with new line characters) {
'x':1,
'y':2,
'z':3,
'w':4
}
2018-03-20 23:28:47 INFO This is an info sample(can be multiline with new line characters)
2018-03-20 23:28:47 DEBUG This is a debug sample(can be multiline with new line characters){
'a':5,
'b':6,
'c':7,
'd':8
}
I've to extract all DEBUG statements and for that I am using this regex (\\d{4}\\-\\d{2}\\-\\d{2}\\ \\d{2}\\:\\d{2}\\:\\d{2}\\ DEBUG(.|\\n|\\r)*?)(?=\\d{4}\\-\\d{2}\\-\\d{2}\\ \\d{2}\\:\\d{2}\\:\\d{2})
but it is omitting the last DEBUG statement. 我必须提取所有DEBUG语句,为此,我正在使用此正则表达式(\\d{4}\\-\\d{2}\\-\\d{2}\\ \\d{2}\\:\\d{2}\\:\\d{2}\\ DEBUG(.|\\n|\\r)*?)(?=\\d{4}\\-\\d{2}\\-\\d{2}\\ \\d{2}\\:\\d{2}\\:\\d{2})
但它省略了最后一个DEBUG语句。 What should be the regex to obtain following output? 正则表达式应该如何获得以下输出?
2018-03-20 23:28:47 DEBUG This is a debug sample(can be multiline with new line characters) {
'x':1,
'y':2,
'z':3,
'w':4
}
2018-03-20 23:28:47 DEBUG This is a debug sample(can be multiline with new line characters){
'a':5,
'b':6,
'c':7,
'd':8
}
I suggest: 我建议:
(?m)
) 在行的开头锚定匹配项以使其更安全(使用(?m)
) \\Z
(same as Ken suggests in the comments) 通过在字符串\\Z
末尾添加替代项来解决当前问题(与Ken在评论中建议的相同) (.|\\r|\\n)*?
替换效率很低的(.|\\r|\\n)*?
pattern with .*?
.*?
模式.*?
and adding a DOTALL modifier (?s)
并添加一个DOTALL修饰符(?s)
The whole fix will look like 整个修复程序看起来像
(?sm)^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} DEBUG\s*(.*?)(?=[\r\n]+\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}|\Z)
See the regex demo . 参见regex演示 。
Details 细节
(?sm)
- DOTALL and MULTILINE options on (?sm)
-上的DOTALL和MULTILINE选项 ^
- start of a line ^
-一行的开始 \\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}
- a timestamp like pattern \\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}
-类似于模式的时间戳 DEBUG
- a literal substring DEBUG
文字子字符串 \\s*
- 0+ whitespaces \\s*
-0+空格 (.*?)
- Group 1: any 0+ chars, as few as possible, up to but excluding (.*?)
-组1:任何0个以上的字符,尽可能少,最多但不包括 (?=[\\r\\n]+\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}|\\Z)
- a positive lookahead that requires either (?=[\\r\\n]+\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}|\\Z)
-a积极的前瞻要求
[\\r\\n]+\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}
- one or more CR or LF symbol(s) followed with a timestamp like pattern [\\r\\n]+\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}
-一个或多个CR或LF符号( s)后跟类似时间戳的模式 |
- or - 要么 \\Z
- the very end of the string \\Z
字符串的结尾 If you are sure that all the paragraphs with DEBUG
will end with }
, you can use: 如果您确定所有带DEBUG
的段落都以}
结尾,则可以使用:
r"(.*DEBUG[\s\S]*?\})"
If DEBUG
may or may not have {}
, the following regex should do the trick: 如果DEBUG
可能有{}
或没有{}
,则以下正则表达式可以解决问题:
r"(.*DEBUG.*(?!=\{|\n))(\{[\s\S]*?\})?"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.