如何使用正则表达式计算文本中的单词数？

Question

I want to split text into words, to count the number of words. 我想将文本分成单词，计算单词的数量。

This is how imagine it to be: 这就是想象它：

int words = text.split("[\\p{Punct}*\\p{Space}*]").length;

I've tried multiple combinations, but it seems to split into too manu parts, for example 我尝试了多种组合，但它似乎分成了太多的部分，例如

"word1       word2"

...has 8 words with this regex, I want it to be only 2. ...这个正则表达式有8个单词，我希望它只有2个。

Answer 1

int countWords(String input) {
   return input.trim().split("\\s+").length;
}

A word is just text surrounded by whitespace. 一个单词只是由空格包围的文本。 Parsing words from a String can be done by calling String.split() using "\\\\s+" as the delimiter. 解析String单词可以通过使用"\\\\s+"作为分隔符调用String.split()来完成。

Note that "\\\\s+" is a regular expression. 请注意， "\\\\s+"是正则表达式。 It matches strings which consist of at least one whitespace character (such as a space, a tab, or a newline). 它匹配由至少一个空格字符（例如空格，制表符或换行符）组成的字符串。

Answer 2

int words = text.trim().split("\\s+").length;

Answer 3

Try the following regex: 试试以下正则表达式：

[\\p{Punct}\\p{Space}]+

The problem with your current regex is that it matches exactly one character, and thus separately matches each whitespace between word1 and word2 . 当前正则表达式的问题在于它只匹配一个字符，因此分别匹配word1和word2之间的每个空格。 The repetition operator placed outside the character group fixes that. 位于字符组外部的重复运算符可以修复该问题。

Answer 4

Use Guava , define a Splitter as Constant: 使用Guava ，将Splitter定义为常量：

private static final Splitter WORD_SPLITTER = 
    Splitter.on(CharMatcher.JAVA_LETTER_OR_DIGIT.negate())
            .trimResults()
            .omitEmptyStrings();

and use it in your code: 并在您的代码中使用它：

int words = Iterables.size(WORD_SPLITTER.split(yourString));

如何使用正则表达式计算文本中的单词数？

问题描述

4 个解决方案

解决方案1
3 2012-05-16 16:42:50

解决方案2
3 2012-05-16 16:43:00

解决方案3
2 已采纳 2012-05-16 16:35:50

解决方案4
1 2012-05-16 16:49:45

如何使用正则表达式计算文本中的单词数？

问题描述

4 个解决方案

解决方案1 3 2012-05-16 16:42:50

解决方案2 3 2012-05-16 16:43:00

解决方案3 2 已采纳 2012-05-16 16:35:50

解决方案4 1 2012-05-16 16:49:45

解决方案1
3 2012-05-16 16:42:50

解决方案2
3 2012-05-16 16:43:00

解决方案3
2 已采纳 2012-05-16 16:35:50

解决方案4
1 2012-05-16 16:49:45