[英]How do I count the number of words in a text using regex?
I want to split text into words, to count the number of words. 我想将文本分成单词,计算单词的数量。
This is how imagine it to be: 这就是想象它:
int words = text.split("[\\p{Punct}*\\p{Space}*]").length;
I've tried multiple combinations, but it seems to split into too manu parts, for example 我尝试了多种组合,但它似乎分成了太多的部分,例如
"word1 word2"
...has 8 words with this regex, I want it to be only 2. ...这个正则表达式有8个单词,我希望它只有2个。
int countWords(String input) {
return input.trim().split("\\s+").length;
}
A word is just text surrounded by whitespace. 一个单词只是由空格包围的文本。 Parsing words from a
String
can be done by calling String.split()
using "\\\\s+"
as the delimiter. 解析
String
单词可以通过使用"\\\\s+"
作为分隔符调用String.split()
来完成。
Note that "\\\\s+"
is a regular expression. 请注意,
"\\\\s+"
是正则表达式。 It matches strings which consist of at least one whitespace character (such as a space, a tab, or a newline). 它匹配由至少一个空格字符(例如空格,制表符或换行符)组成的字符串。
int words = text.trim().split("\\s+").length;
Try the following regex: 试试以下正则表达式:
[\\p{Punct}\\p{Space}]+
The problem with your current regex is that it matches exactly one character, and thus separately matches each whitespace between word1
and word2
. 当前正则表达式的问题在于它只匹配一个字符,因此分别匹配
word1
和word2
之间的每个空格。 The repetition operator placed outside the character group fixes that. 位于字符组外部的重复运算符可以修复该问题。
Use Guava , define a Splitter as Constant: 使用Guava ,将Splitter定义为常量:
private static final Splitter WORD_SPLITTER =
Splitter.on(CharMatcher.JAVA_LETTER_OR_DIGIT.negate())
.trimResults()
.omitEmptyStrings();
and use it in your code: 并在您的代码中使用它:
int words = Iterables.size(WORD_SPLITTER.split(yourString));
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.