删除正则表达式组匹配中的双空行

Question

Situation: I have some text and can only use one regex group to reach the goal.情况：我有一些文本，只能使用一个正则表达式组来达到目标。 I need to cut the text after more than 5 "=" and remove double blank lines.我需要在超过 5 个“=”之后剪切文本并删除双空行。

This is the regex for matching the text.这是匹配文本的正则表达式。 The programming language is Java.编程语言是Java。 It's matching everything before a new line with 5 or more "="它匹配具有 5 个或更多“=”的新行之前的所有内容

([^]+?)\n[=]{5,}

Now I need to replace all double empty lines in the matching group.现在我需要替换匹配组中的所有双空行。 I have no possibility to change the Java code, the only thing I can change is the matching group from the result and the regex itself.我无法更改 Java 代码，我唯一可以更改的是结果中的匹配组和正则表达式本身。

Sample Text:示例文本：

Hello World你好，世界

this is text.这是文字。

Cheers干杯

================ ================

Unimportant text不重要的文字

should result in:应该导致：

Hello World你好，世界

this is text.这是文字。

Cheers干杯

The Java code is the following, but can't be changed: Java 代码如下，但不能更改：

String regex = "([\\s|\\S]+?)\n[=]{5,}";
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
    for (int i = 0; i < matcher.groupCount(); i++) {
         System.out.println("Group " + i + ":\n" + matcher.group(i));
    }
}

only the regex can be changed只能更改正则表达式

Answer 1

I don't believe that regular expressions are capable of intelligently doing this in a single pass (2 passes is cake).我不相信正则表达式能够在一次传递中智能地做到这一点（两次传递是蛋糕）。

However, I've devised something a bit ugly.. A standard repeat quantifier won't do because you want to modify the subcontents and you don't have access to the underlying java.但是，我设计了一些有点难看的东西。标准的重复量词不会做，因为您想修改子内容并且您无权访问底层 java。

(?:([\s\S]*?)(?:(\n\n)\n\n)?)(?:([\s\S]*?)(\n\n)\n\n)?([\s\S]*?)={5,}[\s\S]*

It captures everything before the first four "blank lines" as $1, it captures the first two newlines as $2, for use replacing later.它将前四个“空行”之前的所有内容捕获为 $1，将前两个换行符捕获为 $2，以便稍后替换。

The next group is the same except that it is followed by a ?下一组是相同的，只是它后面跟着一个? quantifier meaning 0 or 1 times, and thus optional.量词表示 0 或 1 次，因此是可选的。 This group captures the content as $3 and the newlines as $4.该组将内容捕获为 $3，将换行符捕获为 $4。

Finally the last group is content at the end, $5.最后最后一组是内容，5 美元。

You can repeat this this group as many times as you like.您可以根据需要多次重复此组。

Here's a version with four repetitions following the same pattern, groups $1,$3,$5,$7,$9 contain the contents between the excessive newlines, and $2,$4,$6,$8,$10 contain the two newlines, and $11 contains the contents.这是一个按照相同模式重复四次的版本，组 $1,$3,$5,$7,$9 包含过多换行符之间的内容，$2,$4,$6,$8,$10 包含两个换行符，$11 包含内容。

(?:([\s\S]*?)(?:(\n\n)\n\n)?)(?:([\s\S]*?)(\n\n)\n\n)?(?:([\s\S]*?)(\n\n)\n\n)?(?:([\s\S]*?)(\n\n)\n\n)?(?:([\s\S]*?)(\n\n)\n\n)?([\s\S]*?)={5,}[\s\S]*

In the case of using the regex immediately above.在使用正则表达式的情况下。 Your replace would look something like $1$2$3$4$5$6$7$8$9$10$11 .你的替换看起来像$1$2$3$4$5$6$7$8$9$10$11 。

It's not pretty, for sure, but it's working with what you have.当然，它并不漂亮，但它可以与你所拥有的一起工作。

Finally, an explanation of the first regex (since the second is the same with more repetitions.最后，对第一个正则表达式的解释（因为第二个与更多重复相同。

 (?:                      # Opens NCG1
     (                    # Opens CG1
         [\s\S]*?         # Character class (any of the characters within)
                            # A character class and negated character class, common expression meaning any character.
                            # * repeats zero or more times
                            # ? as few times as possible
     )                    # Closes CG1
     (?:                  # Opens NCG2
         (                # Opens CG2
             \n           # Token: \n (newline)
             \n           # Token: \n (newline)
         )                # Closes CG2
         \n               # Token: \n (newline)
         \n               # Token: \n (newline)
     )?                   # Closes NCG2
                            # ? repeats zero or one times
 )                        # Closes NCG1
 # begin repeat section
 (?:                      # Opens NCG3
     (                    # Opens CG3
         [\s\S]*?         # Character class (any of the characters within)
                            # A character class and negated character class, common expression meaning any character.
     )                    # Closes CG3
     (                    # Opens CG4
         \n               # Token: \n (newline)
         \n               # Token: \n (newline)
     )                    # Closes CG4
     \n                   # Token: \n (newline)
     \n                   # Token: \n (newline)
 )?                       # Closes NCG3
 # end repeat section
 (                        # Opens CG5
     [\s\S]*?             # Character class (any of the characters within)
 )                        # Closes CG5
 ={5,}                    # Literal =
                            # Repeats 5 or more times
 [\s\S]*                  # Character class (any of the characters within)
                            # * repeats zero or more times

Answer 2

try {
    String resultString = YOURSTRING.replaceAll("(?ism)[=]{5,}.*", "");
    resultString = resultString.replaceAll("(?ism)^\\s+$", "");
} catch (PatternSyntaxException ex) {
    // Syntax error in the regular expression
} catch (IllegalArgumentException ex) {
    // Syntax error in the replacement text (unescaped $ signs?)
} catch (IndexOutOfBoundsException ex) {
    // Non-existent backreference used the replacement text
}

The first regex replaces [=]{5,} (5 or more =), and all text after.第一个正则表达式替换[=]{5,} （5 个或更多 =），以及之后的所有文本。
The second will clean blank lines.第二个将清除空行。

删除正则表达式组匹配中的双空行

问题描述

2 个解决方案

解决方案1
1 已采纳 2015-05-04 15:05:27

解决方案2
0 2015-05-04 12:25:10

删除正则表达式组匹配中的双空行

问题描述

2 个解决方案

解决方案1 1 已采纳 2015-05-04 15:05:27

解决方案2 0 2015-05-04 12:25:10

解决方案1
1 已采纳 2015-05-04 15:05:27

解决方案2
0 2015-05-04 12:25:10