[英]Split a string and separate by punctuation and whitespace
I have some strings, for example: I: am a string, with "punctuation".
我有一些字符串,例如:
I: am a string, with "punctuation".
I want to split the string like: 我想分割字符串,如:
["I", ":", "am", "a", "string", ",", "with", "\"", "punctuation", "\"", "."]
I tried text.split("[\\\\p{Punct}\\\\s]+")
but the result is I, am, a, string, with, punctuation
... 我试过
text.split("[\\\\p{Punct}\\\\s]+")
但结果是I, am, a, string, with, punctuation
...
I found this solution but Java doesn't allow me to split by \\w
. 我找到了这个解决方案,但Java不允许我用
\\w
分割。
Use this regex: 使用这个正则表达式:
"\\s+|(?=\\p{Punct})|(?<=\\p{Punct})"
The result on your string: 你的字符串的结果:
["I", ":", "am", "a", "string", ",", "with", "", "\"", "punctuation", "\"", "."]
Unfortunately, there is an extra element, the ""
after the with. 不幸的是,有一个额外的元素,
""
之后的""
。 These extra elements only occur (and always occur) when there is a punctation character after a whitespace character, so this could be fixed by doing myString.replaceAll("\\\\s+(?=\\\\p{Punct})", "").split(regex);
这些额外的元素只有在空白字符后面有一个标点字符时才会出现(并且总是会出现),所以这可以通过执行
myString.replaceAll("\\\\s+(?=\\\\p{Punct})", "").split(regex);
instead of myString.split(regex);
而不是
myString.split(regex);
(ie strip out the whitespace before splitting) (即在拆分之前去除空白)
How this works: 这是如何工作的:
\\\\s+
splits on a group of whitespace, so if the characters are whitespace characters, we will remove those characters and split at that location. \\\\s+
拆分一组空格,因此如果字符是空白字符,我们将删除这些字符并在该位置拆分。 (note: I am assuming that a string of hello world
should result in ["hello", "world"]
rather than ["hello", "", "world"]
) hello world
应该导致["hello", "world"]
而不是["hello", "", "world"]
) (?=\\\\p{Punct})
is a lookahead that splits if the next character is a punctuation character, but it doesn't remove the character. (?=\\\\p{Punct})
是一个前瞻,如果下一个字符是标点字符,则会分割,但它不会删除该字符。 (?<=\\\\p{Punct})
is a lookbehind that splits if the last character is a punctuation character. (?<=\\\\p{Punct})
是一个(?<=\\\\p{Punct})
,如果最后一个字符是标点字符,则会分裂。 EDIT: 编辑:
In response to your comment , this regex should allow punctuation within words: 在回复您的评论时 ,此正则表达式应允许在单词内标点符号:
"\\s+|(?=\\W\\p{Punct}|\\p{Punct}\\W)|(?<=\\W\\p{Punct}|\\p{Punct}\\W})"
For this one, you don't need to use the replaceAll
, simply do myString.split(regex)
. 对于这个,你不需要使用
replaceAll
,只需要执行myString.split(regex)
。
How it works: 这个怎么运作:
This regex is very similar, but the lookarounds changed. 这个正则表达式非常相似,但外观改变了。
\\\\W\\\\p{Punct}
matches a non-word character followed by a punctuation character. \\\\W\\\\p{Punct}
匹配一个非单词字符,后跟一个标点字符。 \\\\p{Punct}\\\\W
matches a punctuation character followed by a non-word character. \\\\p{Punct}\\\\W
匹配标点字符后跟非单词字符。 So each lookaround matches iff there is a punctuation character which is not in the middle of a word. 因此,如果有一个标点符号不在单词的中间,则每个环视匹配。
Or try this, collect in an ArrayList: 或者尝试这个,收集一个ArrayList:
String s = "I: am a string, with \"punctuation\".";
Pattern pat = Pattern.compile( "\\w+|\\S" );
Matcher mat = pat.matcher( s );
while( mat.find() ){
System.out.print( mat.group() + "/" );
}
System.out.println();
Output: 输出:
I/:/am/a/string/,/with/"/punctuation/"/./
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.