简体   繁体   中英

Remove double blank lines in regex group matching

Situation: I have some text and can only use one regex group to reach the goal. I need to cut the text after more than 5 "=" and remove double blank lines.

This is the regex for matching the text. The programming language is Java. It's matching everything before a new line with 5 or more "="


Now I need to replace all double empty lines in the matching group. I have no possibility to change the Java code, the only thing I can change is the matching group from the result and the regex itself.

Sample Text:

Hello World

this is text.



Unimportant text

should result in:

Hello World

this is text.


The Java code is the following, but can't be changed:

String regex = "([\\s|\\S]+?)\n[=]{5,}";
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
    for (int i = 0; i < matcher.groupCount(); i++) {
         System.out.println("Group " + i + ":\n" + matcher.group(i));

only the regex can be changed

I don't believe that regular expressions are capable of intelligently doing this in a single pass (2 passes is cake).

However, I've devised something a bit ugly.. A standard repeat quantifier won't do because you want to modify the subcontents and you don't have access to the underlying java.


It captures everything before the first four "blank lines" as $1, it captures the first two newlines as $2, for use replacing later.

The next group is the same except that it is followed by a ? quantifier meaning 0 or 1 times, and thus optional. This group captures the content as $3 and the newlines as $4.

Finally the last group is content at the end, $5.

You can repeat this this group as many times as you like.

Here's a version with four repetitions following the same pattern, groups $1,$3,$5,$7,$9 contain the contents between the excessive newlines, and $2,$4,$6,$8,$10 contain the two newlines, and $11 contains the contents.


In the case of using the regex immediately above. Your replace would look something like $1$2$3$4$5$6$7$8$9$10$11 .

It's not pretty, for sure, but it's working with what you have.

Finally, an explanation of the first regex (since the second is the same with more repetitions.

 (?:                      # Opens NCG1
     (                    # Opens CG1
         [\s\S]*?         # Character class (any of the characters within)
                            # A character class and negated character class, common expression meaning any character.
                            # * repeats zero or more times
                            # ? as few times as possible
     )                    # Closes CG1
     (?:                  # Opens NCG2
         (                # Opens CG2
             \n           # Token: \n (newline)
             \n           # Token: \n (newline)
         )                # Closes CG2
         \n               # Token: \n (newline)
         \n               # Token: \n (newline)
     )?                   # Closes NCG2
                            # ? repeats zero or one times
 )                        # Closes NCG1
 # begin repeat section
 (?:                      # Opens NCG3
     (                    # Opens CG3
         [\s\S]*?         # Character class (any of the characters within)
                            # A character class and negated character class, common expression meaning any character.
     )                    # Closes CG3
     (                    # Opens CG4
         \n               # Token: \n (newline)
         \n               # Token: \n (newline)
     )                    # Closes CG4
     \n                   # Token: \n (newline)
     \n                   # Token: \n (newline)
 )?                       # Closes NCG3
 # end repeat section
 (                        # Opens CG5
     [\s\S]*?             # Character class (any of the characters within)
 )                        # Closes CG5
 ={5,}                    # Literal =
                            # Repeats 5 or more times
 [\s\S]*                  # Character class (any of the characters within)
                            # * repeats zero or more times
try {
    String resultString = YOURSTRING.replaceAll("(?ism)[=]{5,}.*", "");
    resultString = resultString.replaceAll("(?ism)^\\s+$", "");
} catch (PatternSyntaxException ex) {
    // Syntax error in the regular expression
} catch (IllegalArgumentException ex) {
    // Syntax error in the replacement text (unescaped $ signs?)
} catch (IndexOutOfBoundsException ex) {
    // Non-existent backreference used the replacement text

The first regex replaces [=]{5,} (5 or more =), and all text after.
The second will clean blank lines.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM