简体   繁体   中英

Separating text by paragraph using regex

I'm trying to get starting and ending index positions of paragraphs in an assortment of text. I'm using the Pattern and Matcher classes and am having some issues understanding how to set my pattern up. Currently I'm using

Pattern p = Pattern.compile(".+", Pattern.MULTILINE);

to separate the paragraphs. This works, however the new line character gets stripped out. Is there a way to keep the newline character? I would like...

"This is paragraph1\nThis is paragraph2\nThis is paragraph3\n"

to separate to something like this...

"This is paragraph1\n"
"This is paragraph2\n"
"This is paragraph3\n"

As I said before, right now the new lines get stripped which means my indices for paragraphs after the first to be off. I think the Pattern.MULTILINE is stripping out the newline as it accepts everything before it so I think I would need to change that and update my regex.

Thoughts?

You just need to match the line break (with \\R construct) after 1+ chars other than line break chars:

String s = "This is paragraph1\r\nThis is paragraph2\nThis is paragraph3\n";
List<String> items = new ArrayList<>();
Matcher m = Pattern.compile(".+\\R").matcher(s);
while (m.find()) {
    items.add(m.group());
}
System.out.println(items);

See the Java demo

Output:

["This is paragraph1\n", "This is paragraph2\n", "This is paragraph3\n"]

If the line break is optional, add the ? quantifier after \\\\R : ".+\\\\R?"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM