简体   繁体   中英

How to split a string, including punctuation marks?

I need to split a string (in Java) with punctuation marks being stored in the same array as words:

String sentence = "In the preceding examples, classes derived from...";
String[] split = sentence.split(" ");

I need split array to be:

split[0] - "In"
split[1] - "the"
split[2] - "preceding"
split[3] - "examples"
split[4] - ","
split[5] - "classes"
split[6] - "derived"
split[7] - "from"
split[8] - "..."

Is there any elegant solution?

You need a look arounds:

String[] split = sentence.split(" ?(?<!\\G)((?<=[^\\p{Punct}])(?=\\p{Punct})|\\b) ?");

Look arounds assert , but (importantly here) don't consume the input when matching.


Some test code:

String sentence = "Foo bar, baz! Who? Me...";
String[] split = sentence.split(" ?(?<!\\G)((?<=[^\\p{Punct}])(?=\\p{Punct})|\\b) ?");
Arrays.stream(split).forEach(System.out::println);

Output;

Foo
bar
,
baz
!
Who
?
Me
...

You may try by replacing triple dots with ellipsis character first:

    String sentence = "In the preceding examples, classes derived from...";
    String[] split = sentence.replace("...", "…").split(" +|(?=,|\\p{Punct}|…)");

Afterwards you can leave it as it is or convert it back by running replace("…", "...") on entire array.

I believe this method will do what you want

public static List<String> split(String str) {
    Pattern pattern = Pattern.compile("(\\w+)|(\\.{3})|[^\\s]");
    Matcher matcher = pattern.matcher(str);
    List<String> list = new ArrayList<String>();
    while (matcher.find()) {
        list.add(matcher.group());
    }
    return list;
}

It will split a string into

  1. Consecutive word characters
  2. Ellipsis ...
  3. Anything else separated by a space

For this example

"In the preceding examples, classes.. derived from... Hello, World! foo!bar"

The list will be

[0] In
[1] the
[2] preceding
[3] examples
[4] ,
[5] classes
[6] .
[7] .
[8] derived
[9] from
[10] ...
[11] Hello
[12] ,
[13] World
[14] !
[15] foo
[16] !
[17] bar

For now I will say that easiest and probably cleanest way to achieve what you want is to focus on finding data you want in array, rather than finding place to split your text on.

I am saying this because split introduces a lot of problems like for instance:

  • split(" +|(?=\\\\p{Punct})"); will split only on space and before punctuation character, which means that text like "abc" def will be split into "abc " def . So as you see it doesn't split after " in "abc .

  • previous problem can be solved easily by adding another |(?<=\\\\p{Punct}) condition like split(" +|(?=\\\\p{Punct})|(?<=\\\\p{Punct})") , but we still didn't solve all of your problems because of ... . So we need to figure out way to prevent splitting in between these dots .|.|. .

    • To do it we could try excluding . from \\p{Punct} and trying to handle it separately but this would make our regex quite complex.
    • Other way to do it could be replacing ... with some unique string, adding this string in our split logic and after all replacing it back to ... in our result array. But this approach would also require from us to know what string will never be possible to have in your text, so we will need to generate it each time we parse text.
  • Another possible problem is that pre-java-8 regex engine will generate empty element at start of your result array if punctuation will be first character like " . So in Java 7 "foo" bar string split on (?=\\p{Punct) will result in [ , "foo, " bar] elements. To avoid this problem you would need to add regex like (?!^) to prevent splitting at start of the string.

Anyway these solutions looks overly complex.


So instead of split method consider using find method from Matcher class and focus on what you want to have in result array.

Try using pattern like this one: [.]{3}|\\p{Punct}|[\\S&&\\P{Punct}]+"

  • [.]{3} will match ...
  • \\p{Punct} will match single punctuation character which according to documentation is one of !"#$%&'()*+,-./:;<=>?@[]^_`{|}~

    ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \\ ] ^ _ ` { | } ~

  • [\\S&&\\P{Punct}]+ will match one or more characters which are
    • \\S not whitespaces
    • && and
    • \\P{Punct} not punctuation characters ( \\P{foo} is negation of \\p{foo} ).

Demo:

String sentence = "In (the) preceding examples, classes derived from...";
Pattern p = Pattern.compile("[.]{3}|\\p{Punct}|[\\S&&\\P{Punct}]+");
Matcher m = p.matcher(sentence);
while(m.find()){
    System.out.println(m.group());
}

Output:

In
(
the
)
preceding
examples
,
classes
derived
from
...

You could sanitize the string replacing, say "," with " ,", and so on for all punctuation marks you care to distinguish.

In the particular case of "..." you can do:

// there can be series of dots
sentence.replace(".", " .").replace(". .", "..")

Then you split.

EDIT: replaced single quotes with double quotes.

For your particular case the two main challenges are the ordering (eg first punctuation and then word or the other way around) and the ... punctuation.

The rest you can easily implement it using

\p{Punct}

like this:

Pattern.compile("\p{Punct}");

Regarding the two mentioned challenges:

1.Ordering: You can try the following:

private static final Pattern punctuation = Pattern.compile("\\p{Punct}");
private static final Pattern word = Pattern.compile("\\w");

public static void main(String[] args) {
    String sentence = "In the preceding examples, classes derived from...";
    String[] split = sentence.split(" ");
    List<String> result = new LinkedList<>();

    for (String s : split) {
        List<String> withMarks = splitWithPunctuationMarks(s);
        result.addAll(withMarks);
    }
}

private static void List<String> splitWithPunctuationMarks(String s) {
    Map<Integer, String> positionToString = new TreeMap<>();
    Matcher punctMatcher = punctuation.matcher(s);
    while (punctMatcher.find()) {
        positionToString.put(punctMatcher.start(), punctMatcher.group())
    }
    Matcher wordMatcher = // ... same as before
    // Then positionToString.values() will contain the 
    // ordered words and punctuation characters.
}
  1. ... You can try to look back for previous occurrences of the . character at (currentIndex - 1) every time you find it.

another example here. this solution probably works for all combinations.

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class App {

    public static void main(String[] args) {    
        String sentence = "In the preceding examples, classes derived from...";
        List<String> list = splitWithPunctuation(sentence);
        System.out.println(list);
    }

    public static List<String> splitWithPunctuation(String sentence) {
        Pattern p = Pattern.compile("([^a-zA-Z\\d\\s]+)");
        String[] split = sentence.split(" ");
        List<String> list = new ArrayList<>();

        for (String s : split) {
            Matcher matcher = p.matcher(s);
            boolean found = false;
            int i = 0;
            while (matcher.find()) {
                found = true;
                list.add(s.substring(i, matcher.start()));
                list.add(s.substring(matcher.start(), matcher.end()));
                i = matcher.end();
            }

            if (found) {
                if (i < s.length())
                    list.add(s.substring(i, s.length()));
            } else
                list.add(s);
        }

        return list;
    }
}

Output:

In
the
preceding
examples
,
classes
derived
from 
...

A more complex example:

String sentence = "In the preced^^^in## examp!les, classes derived from...";
List<String> list = splitWithPunctuation(sentence);
System.out.println(list);

Output:

In
the
preced
^^^
in
##
examp
!
les
,
classes
derived
from
...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM