[英]Remove double blank lines in regex group matching
Situation: I have some text and can only use one regex group to reach the goal.情况:我有一些文本,只能使用一个正则表达式组来达到目标。 I need to cut the text after more than 5 "=" and remove double blank lines.
我需要在超过 5 个“=”之后剪切文本并删除双空行。
This is the regex for matching the text.这是匹配文本的正则表达式。 The programming language is Java.
编程语言是Java。 It's matching everything before a new line with 5 or more "="
它匹配具有 5 个或更多“=”的新行之前的所有内容
([^]+?)\n[=]{5,}
Now I need to replace all double empty lines in the matching group.现在我需要替换匹配组中的所有双空行。 I have no possibility to change the Java code, the only thing I can change is the matching group from the result and the regex itself.
我无法更改 Java 代码,我唯一可以更改的是结果中的匹配组和正则表达式本身。
Sample Text:示例文本:
Hello World
你好,世界
this is text.
这是文字。
Cheers
干杯
================
================
Unimportant text
不重要的文字
should result in:应该导致:
Hello World
你好,世界
this is text.
这是文字。
Cheers
干杯
The Java code is the following, but can't be changed: Java 代码如下,但不能更改:
String regex = "([\\s|\\S]+?)\n[=]{5,}";
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
for (int i = 0; i < matcher.groupCount(); i++) {
System.out.println("Group " + i + ":\n" + matcher.group(i));
}
}
only the regex can be changed只能更改正则表达式
I don't believe that regular expressions are capable of intelligently doing this in a single pass (2 passes is cake).我不相信正则表达式能够在一次传递中智能地做到这一点(两次传递是蛋糕)。
However, I've devised something a bit ugly.. A standard repeat quantifier won't do because you want to modify the subcontents and you don't have access to the underlying java.但是,我设计了一些有点难看的东西。标准的重复量词不会做,因为您想修改子内容并且您无权访问底层 java。
(?:([\s\S]*?)(?:(\n\n)\n\n)?)(?:([\s\S]*?)(\n\n)\n\n)?([\s\S]*?)={5,}[\s\S]*
It captures everything before the first four "blank lines" as $1, it captures the first two newlines as $2, for use replacing later.它将前四个“空行”之前的所有内容捕获为 $1,将前两个换行符捕获为 $2,以便稍后替换。
The next group is the same except that it is followed by a ?
下一组是相同的,只是它后面跟着一个
?
quantifier meaning 0 or 1 times, and thus optional.量词表示 0 或 1 次,因此是可选的。 This group captures the content as $3 and the newlines as $4.
该组将内容捕获为 $3,将换行符捕获为 $4。
Finally the last group is content at the end, $5.最后最后一组是内容,5 美元。
You can repeat this this group as many times as you like.您可以根据需要多次重复此组。
Here's a version with four repetitions following the same pattern, groups $1,$3,$5,$7,$9 contain the contents between the excessive newlines, and $2,$4,$6,$8,$10 contain the two newlines, and $11 contains the contents.这是一个按照相同模式重复四次的版本,组 $1,$3,$5,$7,$9 包含过多换行符之间的内容,$2,$4,$6,$8,$10 包含两个换行符,$11 包含内容。
(?:([\s\S]*?)(?:(\n\n)\n\n)?)(?:([\s\S]*?)(\n\n)\n\n)?(?:([\s\S]*?)(\n\n)\n\n)?(?:([\s\S]*?)(\n\n)\n\n)?(?:([\s\S]*?)(\n\n)\n\n)?([\s\S]*?)={5,}[\s\S]*
In the case of using the regex immediately above.在使用正则表达式的情况下。 Your replace would look something like
$1$2$3$4$5$6$7$8$9$10$11
.你的替换看起来像
$1$2$3$4$5$6$7$8$9$10$11
。
It's not pretty, for sure, but it's working with what you have.当然,它并不漂亮,但它可以与你所拥有的一起工作。
Finally, an explanation of the first regex (since the second is the same with more repetitions.最后,对第一个正则表达式的解释(因为第二个与更多重复相同。
(?: # Opens NCG1
( # Opens CG1
[\s\S]*? # Character class (any of the characters within)
# A character class and negated character class, common expression meaning any character.
# * repeats zero or more times
# ? as few times as possible
) # Closes CG1
(?: # Opens NCG2
( # Opens CG2
\n # Token: \n (newline)
\n # Token: \n (newline)
) # Closes CG2
\n # Token: \n (newline)
\n # Token: \n (newline)
)? # Closes NCG2
# ? repeats zero or one times
) # Closes NCG1
# begin repeat section
(?: # Opens NCG3
( # Opens CG3
[\s\S]*? # Character class (any of the characters within)
# A character class and negated character class, common expression meaning any character.
) # Closes CG3
( # Opens CG4
\n # Token: \n (newline)
\n # Token: \n (newline)
) # Closes CG4
\n # Token: \n (newline)
\n # Token: \n (newline)
)? # Closes NCG3
# end repeat section
( # Opens CG5
[\s\S]*? # Character class (any of the characters within)
) # Closes CG5
={5,} # Literal =
# Repeats 5 or more times
[\s\S]* # Character class (any of the characters within)
# * repeats zero or more times
try {
String resultString = YOURSTRING.replaceAll("(?ism)[=]{5,}.*", "");
resultString = resultString.replaceAll("(?ism)^\\s+$", "");
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
} catch (IllegalArgumentException ex) {
// Syntax error in the replacement text (unescaped $ signs?)
} catch (IndexOutOfBoundsException ex) {
// Non-existent backreference used the replacement text
}
The first regex replaces [=]{5,}
(5 or more =), and all text after.第一个正则表达式替换
[=]{5,}
(5 个或更多 =),以及之后的所有文本。
The second will clean blank lines.第二个将清除空行。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.