I want to split a long text stored in a String variable following those rules:
Take this example:
"The boy ate the apple. The sun is shining high in the sky. The answer to life the universe and everything is forty two, said the big computer."
let's say the minimum length I want is 30.
The result splits obtained would be:
I don't want to take "The boy ate the apple." as a split because it's less than 30 characters.
2 ways I thought of:
But I am looking if this can be done directly by using a Regex to split and test the minimum number of characters before a match.
Thanks
Instead of using split, you could also match your values using a capturing group. To make the dot also match a newline you could use Pattern.DOTALL
\s*(.{30}[^.]*\.|.+$)
In Java:
String regex = "\\s*(.{30}[^.]*\\.|.+$)";
Explanation
\\s*
Match 0_ times a whitespace character (
Capturing group
.{30}
Match any character 30 times [^.]*
Match 0+ times not a dot using a negated character class \\.
Match literally|
Or.+$
Match 1+ times any character until the end of the string. )
Close capturing group This should do the job:
"\W*+(.{30,}?)\W*\."
Test: https://regex101.com/r/aavcme/3
\\W*+
takes as much as non-word character to trim spaces between sentences .
matches any character (I guess you want to match any kind of character in your sentences){30,}
asserts the minimum length of the match (30) ?
means "as few as possible"\\.
matches the dot separating the sentences (assuming that you always have a dot at the end of a sentence, even the last one)Instead of using the split method, try matching with the following regexp: \\S.{29,}?[.]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.