简体   繁体   中英

How to split a sentence into words and punctuations in java

I want to split a given sentence of type string into words and I also want punctuation to be added to the list.

For example, if the sentence is: "Sara's dog 'bit' the neighbor."
I want the output to be: [Sara's, dog, ', bit, ', the, neighbour, .]

With string.split(" ") I can split the sentence in words by space, but I want the punctuation also to be in the result list.

    String text="Sara's dog 'bit' the neighbor."  
    String list = text.split(" ")
    the printed result is [Sara's, dog,'bit', the, neighbour.]
    I don't know how to combine another regex with the above split method to separate punctuations also.

Some of the reference I have already tried but didn't work out

1. Splitting strings through regular expressions by punctuation and whitespace etc in java

2. How to split sentence to words and punctuation using split or matcher?

Example input and outputs

String input1="Holy cow! screamed Jane."

String[] output1 = [Holy,cow,!,screamed,Jane,.] 

String input2="Select your 'pizza' topping {pepper and tomato} follow me."

String[] output2 = [Select,your,',pizza,',topping,{,pepper,and,tomato,},follow,me,.]

Instead of trying to come up with a pattern to split on, this challenge is easier to solve by coming up with a pattern of the elements to capture.

Although it's more code than a simple split() , it can still be done in a single statement in Java 9+:

String regex = "[\\p{L}\\p{M}\\p{N}]+(?:\\p{P}[\\p{L}\\p{M}\\p{N}]+)*|[\\p{P}\\p{S}]";
String[] parts = Pattern.compile(regex).matcher(s).results().map(MatchResult::group).toArray(String[]::new);

In Java 8 or earlier, you would write it like this:

List<String> parts = new ArrayList<>();
Matcher m = Pattern.compile(regex).matcher(s);
while (m.find()) {
    parts.add(m.group());
}

Explanation

\\p{L} is Unicode letters , \\\\p{N} is Unicode numbers , and \\\\p{M} is Unicode marks (eg accents). Combined, they are here treated as characters in a "word".

\\p{P} is Unicode punctuation . A "word" can have single punctuation characters embedded inside the word. The pattern before | matches a "word", given that definition.

\\p{S} is Unicode symbol . Punctuation that is not embedded inside a "word", and symbols, are matched individually. That is the pattern after the | .

That leaves Unicode categories Z ( separator ) and C ( other ) uncovered, which means that any such character is skipped.

Test

public class Test {
    public static void main(String[] args) {
        test("Sara's dog 'bit' the neighbor.");
        test("Holy cow! screamed Jane.");
        test("Select your 'pizza' topping {pepper and tomato} follow me.");
    }
    private static void test(String s) {
        String regex = "[\\p{L}\\p{M}\\p{N}]+(?:\\p{P}[\\p{L}\\p{M}\\p{N}]+)*|[\\p{P}\\p{S}]";
        String[] parts = Pattern.compile(regex).matcher(s).results().map(MatchResult::group).toArray(String[]::new);
        System.out.println(Arrays.toString(parts));
    }
}

Output

[Sara's, dog, ', bit, ', the, neighbor, .]
[Holy, cow, !, screamed, Jane, .]
[Select, your, ', pizza, ', topping, {, pepper, and, tomato, }, follow, me, .]
Arrays.stream( s.split("((?<=[\\s\\p{Punct}])|(?=[\\s\\p{Punct}]))") )
.filter(ss -> !ss.trim().isEmpty())
.collect(Collectors.toList())

Reference:

How to split a string, but also keep the delimiters?

Regular Expressions on Punctuation

ArrayList<String> chars = new ArrayList<String>();
String str = "Hello my name is bob";
String tempStr = "";
for(String cha : str.toCharArray()){
  if(cha.equals(" ")){
    chars.add(tempStr);
    tempStr = "";
  }
  //INPUT WHATEVER YOU WANT FOR PUNCTATION WISE
  else if(cha.equals("!") || cha.equals(".")){
    chars.add(cha);
  }
  else{
    tempStr = tempStr + cha;
  }
}
chars.add(str.substring(str.lastIndexOf(" "));

That? It should add every single word, assuming there is spaces for each word in the sentence. for !'s, and .'s, you would have to do a check for that as well. Quite simple.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM