拆分字符串并用标点符号和空格分隔

Question

I have some strings, for example: I: am a string, with "punctuation". 我有一些字符串，例如： I: am a string, with "punctuation". I want to split the string like: 我想分割字符串，如：

["I", ":", "am", "a", "string", ",", "with", "\"", "punctuation", "\"", "."]

I tried text.split("[\\\\p{Punct}\\\\s]+") but the result is I, am, a, string, with, punctuation ... 我试过text.split("[\\\\p{Punct}\\\\s]+")但结果是I, am, a, string, with, punctuation ...

I found this solution but Java doesn't allow me to split by \\w . 我找到了这个解决方案，但Java不允许我用\\w分割。

Answer 1

Use this regex: 使用这个正则表达式：

"\\s+|(?=\\p{Punct})|(?<=\\p{Punct})"

The result on your string: 你的字符串的结果：

["I", ":", "am", "a", "string", ",", "with", "", "\"", "punctuation", "\"", "."]

Unfortunately, there is an extra element, the "" after the with. 不幸的是，有一个额外的元素， ""之后的"" 。 These extra elements only occur (and always occur) when there is a punctation character after a whitespace character, so this could be fixed by doing myString.replaceAll("\\\\s+(?=\\\\p{Punct})", "").split(regex); 这些额外的元素只有在空白字符后面有一个标点字符时才会出现（并且总是会出现），所以这可以通过执行myString.replaceAll("\\\\s+(?=\\\\p{Punct})", "").split(regex); instead of myString.split(regex); 而不是myString.split(regex); (ie strip out the whitespace before splitting) （即在拆分之前去除空白）

How this works: 这是如何工作的：

\\\\s+ splits on a group of whitespace, so if the characters are whitespace characters, we will remove those characters and split at that location. \\\\s+拆分一组空格，因此如果字符是空白字符，我们将删除这些字符并在该位置拆分。 _{(note: I am assuming that a string of hello world should result in ["hello", "world"] rather than ["hello", "", "world"] )} _{（注意：我假设一串hello world应该导致["hello", "world"]而不是["hello", "", "world"] ）}
(?=\\\\p{Punct}) is a lookahead that splits if the next character is a punctuation character, but it doesn't remove the character. (?=\\\\p{Punct})是一个前瞻，如果下一个字符是标点字符，则会分割，但它不会删除该字符。
(?<=\\\\p{Punct}) is a lookbehind that splits if the last character is a punctuation character. (?<=\\\\p{Punct})是一个(?<=\\\\p{Punct}) ，如果最后一个字符是标点字符，则会分裂。

EDIT: 编辑：

In response to your comment , this regex should allow punctuation within words: 在回复您的评论时，此正则表达式应允许在单词内标点符号：

"\\s+|(?=\\W\\p{Punct}|\\p{Punct}\\W)|(?<=\\W\\p{Punct}|\\p{Punct}\\W})"

For this one, you don't need to use the replaceAll , simply do myString.split(regex) . 对于这个，你不需要使用replaceAll ，只需要执行myString.split(regex) 。

How it works: 这个怎么运作：

This regex is very similar, but the lookarounds changed. 这个正则表达式非常相似，但外观改变了。 \\\\W\\\\p{Punct} matches a non-word character followed by a punctuation character. \\\\W\\\\p{Punct}匹配一个非单词字符，后跟一个标点字符。 \\\\p{Punct}\\\\W matches a punctuation character followed by a non-word character. \\\\p{Punct}\\\\W匹配标点字符后跟非单词字符。 So each lookaround matches iff there is a punctuation character which is not in the middle of a word. 因此，如果有一个标点符号不在单词的中间，则每个环视匹配。

Answer 2

Or try this, collect in an ArrayList: 或者尝试这个，收集一个ArrayList：

    String s = "I: am a string, with \"punctuation\".";
    Pattern pat = Pattern.compile( "\\w+|\\S" );

    Matcher mat = pat.matcher( s );
    while( mat.find() ){
        System.out.print( mat.group() +  "/" );
    }
    System.out.println();

Output: 输出：

 I/:/am/a/string/,/with/"/punctuation/"/./

拆分字符串并用标点符号和空格分隔

问题描述

2 个解决方案

解决方案1
7 已采纳 2014-06-14 18:12:34

解决方案2
0 2014-06-14 18:16:12

拆分字符串并用标点符号和空格分隔

问题描述

2 个解决方案

解决方案1 7 已采纳 2014-06-14 18:12:34

解决方案2 0 2014-06-14 18:16:12

解决方案1
7 已采纳 2014-06-14 18:12:34

解决方案2
0 2014-06-14 18:16:12