简体   繁体   English

如何在Java中将句子拆分为单词和标点符号

[英]How to split a sentence into words and punctuations in java

I want to split a given sentence of type string into words and I also want punctuation to be added to the list. 我想将字符串类型的给定句子拆分为单词,并且我还希望将标点符号添加到列表中。

For example, if the sentence is: "Sara's dog 'bit' the neighbor." 例如,如果句子是: “萨拉的狗'咬'邻居”。
I want the output to be: [Sara's, dog, ', bit, ', the, neighbour, .] 我希望输出为: [Sara's,dog,',bit,',the,neighbour ,.]

With string.split(" ") I can split the sentence in words by space, but I want the punctuation also to be in the result list. 使用string.split(“”)可以按空格将单词拆分成单词,但我希望标点符号也出现在结果列表中。

    String text="Sara's dog 'bit' the neighbor."  
    String list = text.split(" ")
    the printed result is [Sara's, dog,'bit', the, neighbour.]
    I don't know how to combine another regex with the above split method to separate punctuations also.

Some of the reference I have already tried but didn't work out 我已经尝试过但没有解决的一些参考资料

1. Splitting strings through regular expressions by punctuation and whitespace etc in java 1. 在Java中使用标点符号和空格等通过正则表达式拆分字符串

2. How to split sentence to words and punctuation using split or matcher? 2. 如何使用拆分或匹配器将句子拆分为单词和标点符号?

Example input and outputs 输入和输出示例

String input1="Holy cow! screamed Jane."

String[] output1 = [Holy,cow,!,screamed,Jane,.] 

String input2="Select your 'pizza' topping {pepper and tomato} follow me."

String[] output2 = [Select,your,',pizza,',topping,{,pepper,and,tomato,},follow,me,.]

Instead of trying to come up with a pattern to split on, this challenge is easier to solve by coming up with a pattern of the elements to capture. 与其尝试提​​出一种模式,不如通过提出一种要捕获的元素的模式来解决该挑战。

Although it's more code than a simple split() , it can still be done in a single statement in Java 9+: 尽管它比简单的split()更多的代码,但仍可以在Java 9+中的单个语句中完成:

String regex = "[\\p{L}\\p{M}\\p{N}]+(?:\\p{P}[\\p{L}\\p{M}\\p{N}]+)*|[\\p{P}\\p{S}]";
String[] parts = Pattern.compile(regex).matcher(s).results().map(MatchResult::group).toArray(String[]::new);

In Java 8 or earlier, you would write it like this: 在Java 8或更早版本中,您可以这样编写:

List<String> parts = new ArrayList<>();
Matcher m = Pattern.compile(regex).matcher(s);
while (m.find()) {
    parts.add(m.group());
}

Explanation 说明

\\p{L} is Unicode letters , \\\\p{N} is Unicode numbers , and \\\\p{M} is Unicode marks (eg accents). \\p{L}是Unicode 字母\\\\p{N}是Unicode 数字 ,而\\\\p{M}是Unicode 标记 (例如重音符号)。 Combined, they are here treated as characters in a "word". 结合起来,它们在这里被视为“单词”中的字符。

\\p{P} is Unicode punctuation . \\p{P}是Unicode 标点符号 A "word" can have single punctuation characters embedded inside the word. “单词”可以在单词内部嵌入单个标点符号。 The pattern before | 之前的模式| matches a "word", given that definition. 在给定定义的情况下,匹配一个“单词”。

\\p{S} is Unicode symbol . \\p{S}是Unicode 符号 Punctuation that is not embedded inside a "word", and symbols, are matched individually. 未嵌入在“单词”中的标点符号和符号分别进行匹配。 That is the pattern after the | 那是|之后的模式| .

That leaves Unicode categories Z ( separator ) and C ( other ) uncovered, which means that any such character is skipped. 这样就不会发现Unicode类别Z分隔符 )和C其他 )类别,这意味着将跳过任何此类字符。

Test 测试

public class Test {
    public static void main(String[] args) {
        test("Sara's dog 'bit' the neighbor.");
        test("Holy cow! screamed Jane.");
        test("Select your 'pizza' topping {pepper and tomato} follow me.");
    }
    private static void test(String s) {
        String regex = "[\\p{L}\\p{M}\\p{N}]+(?:\\p{P}[\\p{L}\\p{M}\\p{N}]+)*|[\\p{P}\\p{S}]";
        String[] parts = Pattern.compile(regex).matcher(s).results().map(MatchResult::group).toArray(String[]::new);
        System.out.println(Arrays.toString(parts));
    }
}

Output 输出量

[Sara's, dog, ', bit, ', the, neighbor, .]
[Holy, cow, !, screamed, Jane, .]
[Select, your, ', pizza, ', topping, {, pepper, and, tomato, }, follow, me, .]
Arrays.stream( s.split("((?<=[\\s\\p{Punct}])|(?=[\\s\\p{Punct}]))") )
.filter(ss -> !ss.trim().isEmpty())
.collect(Collectors.toList())

Reference: 参考:

How to split a string, but also keep the delimiters? 如何拆分字符串,但还要保留定界符?

Regular Expressions on Punctuation 标点的正则表达式

ArrayList<String> chars = new ArrayList<String>();
String str = "Hello my name is bob";
String tempStr = "";
for(String cha : str.toCharArray()){
  if(cha.equals(" ")){
    chars.add(tempStr);
    tempStr = "";
  }
  //INPUT WHATEVER YOU WANT FOR PUNCTATION WISE
  else if(cha.equals("!") || cha.equals(".")){
    chars.add(cha);
  }
  else{
    tempStr = tempStr + cha;
  }
}
chars.add(str.substring(str.lastIndexOf(" "));

That? 那? It should add every single word, assuming there is spaces for each word in the sentence. 假定句子中每个单词都有空格,则应添加每个单词。 for !'s, and .'s, you would have to do a check for that as well. 对于!和。,您也必须对此进行检查。 Quite simple. 非常简单。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM