简体   繁体   中英

Remove double blank lines in regex group matching

Situation: I have some text and can only use one regex group to reach the goal. I need to cut the text after more than 5 "=" and remove double blank lines.

This is the regex for matching the text. The programming language is Java. It's matching everything before a new line with 5 or more "="

([^]+?)\n[=]{5,}

Now I need to replace all double empty lines in the matching group. I have no possibility to change the Java code, the only thing I can change is the matching group from the result and the regex itself.

Sample Text:

Hello World

this is text.


Cheers

================

Unimportant text

should result in:

Hello World

this is text.

Cheers

The Java code is the following, but can't be changed:

String regex = "([\\s|\\S]+?)\n[=]{5,}";
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
    for (int i = 0; i < matcher.groupCount(); i++) {
         System.out.println("Group " + i + ":\n" + matcher.group(i));
    }
}

only the regex can be changed

I don't believe that regular expressions are capable of intelligently doing this in a single pass (2 passes is cake).

However, I've devised something a bit ugly.. A standard repeat quantifier won't do because you want to modify the subcontents and you don't have access to the underlying java.

(?:([\s\S]*?)(?:(\n\n)\n\n)?)(?:([\s\S]*?)(\n\n)\n\n)?([\s\S]*?)={5,}[\s\S]*

It captures everything before the first four "blank lines" as $1, it captures the first two newlines as $2, for use replacing later.

The next group is the same except that it is followed by a ? quantifier meaning 0 or 1 times, and thus optional. This group captures the content as $3 and the newlines as $4.

Finally the last group is content at the end, $5.

You can repeat this this group as many times as you like.

Here's a version with four repetitions following the same pattern, groups $1,$3,$5,$7,$9 contain the contents between the excessive newlines, and $2,$4,$6,$8,$10 contain the two newlines, and $11 contains the contents.

(?:([\s\S]*?)(?:(\n\n)\n\n)?)(?:([\s\S]*?)(\n\n)\n\n)?(?:([\s\S]*?)(\n\n)\n\n)?(?:([\s\S]*?)(\n\n)\n\n)?(?:([\s\S]*?)(\n\n)\n\n)?([\s\S]*?)={5,}[\s\S]*

In the case of using the regex immediately above. Your replace would look something like $1$2$3$4$5$6$7$8$9$10$11 .

It's not pretty, for sure, but it's working with what you have.

Finally, an explanation of the first regex (since the second is the same with more repetitions.

 (?:                      # Opens NCG1
     (                    # Opens CG1
         [\s\S]*?         # Character class (any of the characters within)
                            # A character class and negated character class, common expression meaning any character.
                            # * repeats zero or more times
                            # ? as few times as possible
     )                    # Closes CG1
     (?:                  # Opens NCG2
         (                # Opens CG2
             \n           # Token: \n (newline)
             \n           # Token: \n (newline)
         )                # Closes CG2
         \n               # Token: \n (newline)
         \n               # Token: \n (newline)
     )?                   # Closes NCG2
                            # ? repeats zero or one times
 )                        # Closes NCG1
 # begin repeat section
 (?:                      # Opens NCG3
     (                    # Opens CG3
         [\s\S]*?         # Character class (any of the characters within)
                            # A character class and negated character class, common expression meaning any character.
     )                    # Closes CG3
     (                    # Opens CG4
         \n               # Token: \n (newline)
         \n               # Token: \n (newline)
     )                    # Closes CG4
     \n                   # Token: \n (newline)
     \n                   # Token: \n (newline)
 )?                       # Closes NCG3
 # end repeat section
 (                        # Opens CG5
     [\s\S]*?             # Character class (any of the characters within)
 )                        # Closes CG5
 ={5,}                    # Literal =
                            # Repeats 5 or more times
 [\s\S]*                  # Character class (any of the characters within)
                            # * repeats zero or more times
try {
    String resultString = YOURSTRING.replaceAll("(?ism)[=]{5,}.*", "");
    resultString = resultString.replaceAll("(?ism)^\\s+$", "");
} catch (PatternSyntaxException ex) {
    // Syntax error in the regular expression
} catch (IllegalArgumentException ex) {
    // Syntax error in the replacement text (unescaped $ signs?)
} catch (IndexOutOfBoundsException ex) {
    // Non-existent backreference used the replacement text
}

The first regex replaces [=]{5,} (5 or more =), and all text after.
The second will clean blank lines.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM