简体   繁体   中英

Regex to split String on pattern but with a minimum number of characters

I want to split a long text stored in a String variable following those rules:

  1. Split on a dot (.)
  2. The Substrings should have a minimum length of 30 (for example).

Take this example:

"The boy ate the apple. The sun is shining high in the sky. The answer to life the universe and everything is forty two, said the big computer."

let's say the minimum length I want is 30.

The result splits obtained would be:

  • "The boy ate the apple. The sun is shining high in the sky."
  • "The answer to life the universe and everything is forty two, said the big computer."

I don't want to take "The boy ate the apple." as a split because it's less than 30 characters.

2 ways I thought of:

  1. Loop through all the characters and add them to a String builder. And whenever I reach a dot (.) I check if my String builder is more than the minimum I split it, otherwise I continue.
  2. Split on all dots (.), and then loop through the splits. if one of the Splitted strings is smaller than the minimum, I concatenate it with the one after.

But I am looking if this can be done directly by using a Regex to split and test the minimum number of characters before a match.

Thanks

Instead of using split, you could also match your values using a capturing group. To make the dot also match a newline you could use Pattern.DOTALL

\s*(.{30}[^.]*\.|.+$)

In Java:

String regex = "\\s*(.{30}[^.]*\\.|.+$)";

Explanation

  • \\s* Match 0_ times a whitespace character
  • ( Capturing group
    • .{30} Match any character 30 times
    • [^.]* Match 0+ times not a dot using a negated character class
    • \\. Match literally
    • | Or
    • .+$ Match 1+ times any character until the end of the string.
  • ) Close capturing group

Regex demo | Java demo

This should do the job:

"\W*+(.{30,}?)\W*\."

Test: https://regex101.com/r/aavcme/3

  • \\W*+ takes as much as non-word character to trim spaces between sentences
  • . matches any character (I guess you want to match any kind of character in your sentences)
  • {30,} asserts the minimum length of the match (30)
  • ? means "as few as possible"
  • \\. matches the dot separating the sentences (assuming that you always have a dot at the end of a sentence, even the last one)

Instead of using the split method, try matching with the following regexp: \\S.{29,}?[.]

Demo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM