简体   繁体   English

删除正则表达式组匹配中的双空行

[英]Remove double blank lines in regex group matching

Situation: I have some text and can only use one regex group to reach the goal.情况:我有一些文本,只能使用一个正则表达式组来达到目标​​。 I need to cut the text after more than 5 "=" and remove double blank lines.我需要在超过 5 个“=”之后剪切文本并删除双空行。

This is the regex for matching the text.这是匹配文本的正则表达式。 The programming language is Java.编程语言是Java。 It's matching everything before a new line with 5 or more "="它匹配具有 5 个或更多“=”的新行之前的所有内容

([^]+?)\n[=]{5,}

Now I need to replace all double empty lines in the matching group.现在我需要替换匹配组中的所有双空行。 I have no possibility to change the Java code, the only thing I can change is the matching group from the result and the regex itself.我无法更改 Java 代码,我唯一可以更改的是结果中的匹配组和正则表达式本身。

Sample Text:示例文本:

Hello World你好,世界

this is text.这是文字。


Cheers干杯

================ ================

Unimportant text不重要的文字

should result in:应该导致:

Hello World你好,世界

this is text.这是文字。

Cheers干杯

The Java code is the following, but can't be changed: Java 代码如下,但不能更改:

String regex = "([\\s|\\S]+?)\n[=]{5,}";
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
    for (int i = 0; i < matcher.groupCount(); i++) {
         System.out.println("Group " + i + ":\n" + matcher.group(i));
    }
}

only the regex can be changed只能更改正则表达式

I don't believe that regular expressions are capable of intelligently doing this in a single pass (2 passes is cake).我不相信正则表达式能够在一次传递中智能地做到这一点(两次传递是蛋糕)。

However, I've devised something a bit ugly.. A standard repeat quantifier won't do because you want to modify the subcontents and you don't have access to the underlying java.但是,我设计了一些有点难看的东西。标准的重复量词不会做,因为您想修改子内容并且您无权访问底层 java。

(?:([\s\S]*?)(?:(\n\n)\n\n)?)(?:([\s\S]*?)(\n\n)\n\n)?([\s\S]*?)={5,}[\s\S]*

It captures everything before the first four "blank lines" as $1, it captures the first two newlines as $2, for use replacing later.它将前四个“空行”之前的所有内容捕获为 $1,将前两个换行符捕获为 $2,以便稍后替换。

The next group is the same except that it is followed by a ?下一组是相同的,只是它后面跟着一个? quantifier meaning 0 or 1 times, and thus optional.量词表示 0 或 1 次,因此是可选的。 This group captures the content as $3 and the newlines as $4.该组将内容捕获为 $3,将换行符捕获为 $4。

Finally the last group is content at the end, $5.最后最后一组是内容,5 美元。

You can repeat this this group as many times as you like.您可以根据需要多次重复此组。

Here's a version with four repetitions following the same pattern, groups $1,$3,$5,$7,$9 contain the contents between the excessive newlines, and $2,$4,$6,$8,$10 contain the two newlines, and $11 contains the contents.这是一个按照相同模式重复四次的版本,组 $1,$3,$5,$7,$9 包含过多换行符之间的内容,$2,$4,$6,$8,$10 包含两个换行符,$11 包含内容。

(?:([\s\S]*?)(?:(\n\n)\n\n)?)(?:([\s\S]*?)(\n\n)\n\n)?(?:([\s\S]*?)(\n\n)\n\n)?(?:([\s\S]*?)(\n\n)\n\n)?(?:([\s\S]*?)(\n\n)\n\n)?([\s\S]*?)={5,}[\s\S]*

In the case of using the regex immediately above.在使用正则表达式的情况下。 Your replace would look something like $1$2$3$4$5$6$7$8$9$10$11 .你的替换看起来像$1$2$3$4$5$6$7$8$9$10$11

It's not pretty, for sure, but it's working with what you have.当然,它并不漂亮,但它可以与你所拥有的一起工作。

Finally, an explanation of the first regex (since the second is the same with more repetitions.最后,对第一个正则表达式的解释(因为第二个与更多重复相同。

 (?:                      # Opens NCG1
     (                    # Opens CG1
         [\s\S]*?         # Character class (any of the characters within)
                            # A character class and negated character class, common expression meaning any character.
                            # * repeats zero or more times
                            # ? as few times as possible
     )                    # Closes CG1
     (?:                  # Opens NCG2
         (                # Opens CG2
             \n           # Token: \n (newline)
             \n           # Token: \n (newline)
         )                # Closes CG2
         \n               # Token: \n (newline)
         \n               # Token: \n (newline)
     )?                   # Closes NCG2
                            # ? repeats zero or one times
 )                        # Closes NCG1
 # begin repeat section
 (?:                      # Opens NCG3
     (                    # Opens CG3
         [\s\S]*?         # Character class (any of the characters within)
                            # A character class and negated character class, common expression meaning any character.
     )                    # Closes CG3
     (                    # Opens CG4
         \n               # Token: \n (newline)
         \n               # Token: \n (newline)
     )                    # Closes CG4
     \n                   # Token: \n (newline)
     \n                   # Token: \n (newline)
 )?                       # Closes NCG3
 # end repeat section
 (                        # Opens CG5
     [\s\S]*?             # Character class (any of the characters within)
 )                        # Closes CG5
 ={5,}                    # Literal =
                            # Repeats 5 or more times
 [\s\S]*                  # Character class (any of the characters within)
                            # * repeats zero or more times
try {
    String resultString = YOURSTRING.replaceAll("(?ism)[=]{5,}.*", "");
    resultString = resultString.replaceAll("(?ism)^\\s+$", "");
} catch (PatternSyntaxException ex) {
    // Syntax error in the regular expression
} catch (IllegalArgumentException ex) {
    // Syntax error in the replacement text (unescaped $ signs?)
} catch (IndexOutOfBoundsException ex) {
    // Non-existent backreference used the replacement text
}

The first regex replaces [=]{5,} (5 or more =), and all text after.第一个正则表达式替换[=]{5,} (5 个或更多 =),以及之后的所有文本。
The second will clean blank lines.第二个将清除空行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM