简体   繁体   English

正则表达式匹配模式直到下一次出现

[英]Regex to match pattern until next occurence of it

I have following data: 我有以下数据:

2018-03-20 23:28:47 INFO This is an info sample(can be multiline with new line characters)
2018-03-20 23:28:47 INFO This is an info sample(can be multiline with new line characters)
2018-03-20 23:28:47 DEBUG This is a debug sample(can be multiline with new line characters) {
  'x':1,
  'y':2,
  'z':3,
  'w':4
}
2018-03-20 23:28:47 INFO This is an info sample(can be multiline with new line characters)
2018-03-20 23:28:47 DEBUG This is a debug sample(can be multiline with new line characters){
  'a':5,
  'b':6,
  'c':7,
  'd':8
}

I've to extract all DEBUG statements and for that I am using this regex (\\d{4}\\-\\d{2}\\-\\d{2}\\ \\d{2}\\:\\d{2}\\:\\d{2}\\ DEBUG(.|\\n|\\r)*?)(?=\\d{4}\\-\\d{2}\\-\\d{2}\\ \\d{2}\\:\\d{2}\\:\\d{2}) but it is omitting the last DEBUG statement. 我必须提取所有DEBUG语句,为此,我正在使用此正则表达式(\\d{4}\\-\\d{2}\\-\\d{2}\\ \\d{2}\\:\\d{2}\\:\\d{2}\\ DEBUG(.|\\n|\\r)*?)(?=\\d{4}\\-\\d{2}\\-\\d{2}\\ \\d{2}\\:\\d{2}\\:\\d{2})但它省略了最后一个DEBUG语句。 What should be the regex to obtain following output? 正则表达式应该如何获得以下输出?

2018-03-20 23:28:47 DEBUG This is a debug sample(can be multiline with new line characters) {
  'x':1,
  'y':2,
  'z':3,
  'w':4
}
2018-03-20 23:28:47 DEBUG This is a debug sample(can be multiline with new line characters){
  'a':5,
  'b':6,
  'c':7,
  'd':8
}

I suggest: 我建议:

  • Anchor the matches at the start of the line to make it safer (by using (?m) ) 在行的开头锚定匹配项以使其更安全(使用(?m)
  • Fix the current issue by adding an alternative with the very end of the string \\Z (same as Ken suggests in the comments) 通过在字符串\\Z末尾添加替代项来解决当前问题(与Ken在评论中建议的相同)
  • Replace a very inefficient (.|\\r|\\n)*? 替换效率很低的(.|\\r|\\n)*? pattern with .*? .*?模式.*? and adding a DOTALL modifier (?s) 并添加一个DOTALL修饰符(?s)

The whole fix will look like 整个修复程序看起来像

(?sm)^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} DEBUG\s*(.*?)(?=[\r\n]+\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}|\Z)

See the regex demo . 参见regex演示

Details 细节

  • (?sm) - DOTALL and MULTILINE options on (?sm) -上的DOTALL和MULTILINE选项
  • ^ - start of a line ^ -一行的开始
  • \\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2} - a timestamp like pattern \\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2} -类似于模式的时间戳
  • DEBUG - a literal substring DEBUG文字子字符串
  • \\s* - 0+ whitespaces \\s* -0+空格
  • (.*?) - Group 1: any 0+ chars, as few as possible, up to but excluding (.*?) -组1:任何0个以上的字符,尽可能少,最多但不包括
  • (?=[\\r\\n]+\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}|\\Z) - a positive lookahead that requires either (?=[\\r\\n]+\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}|\\Z) -a积极的前瞻要求
    • [\\r\\n]+\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2} - one or more CR or LF symbol(s) followed with a timestamp like pattern [\\r\\n]+\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2} -一个或多个CR或LF符号( s)后跟类似时间戳的模式
    • | - or - 要么
    • \\Z - the very end of the string \\Z字符串的结尾

If you are sure that all the paragraphs with DEBUG will end with } , you can use: 如果您确定所有带DEBUG的段落都以}结尾,则可以使用:

r"(.*DEBUG[\s\S]*?\})"

If DEBUG may or may not have {} , the following regex should do the trick: 如果DEBUG可能有{}或没有{} ,则以下正则表达式可以解决问题:

r"(.*DEBUG.*(?!=\{|\n))(\{[\s\S]*?\})?"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM