简体   繁体   English

如何使用正则表达式计算文本中的单词数?

[英]How do I count the number of words in a text using regex?

I want to split text into words, to count the number of words. 我想将文本分成单词,计算单词的数量。

This is how imagine it to be: 这就是想象它:

int words = text.split("[\\p{Punct}*\\p{Space}*]").length;

I've tried multiple combinations, but it seems to split into too manu parts, for example 我尝试了多种组合,但它似乎分成了太多的部分,例如

"word1       word2" 

...has 8 words with this regex, I want it to be only 2. ...这个正则表达式有8个单词,我希望它只有2个。

int countWords(String input) {
   return input.trim().split("\\s+").length;
}

A word is just text surrounded by whitespace. 一个单词只是由空格包围的文本。 Parsing words from a String can be done by calling String.split() using "\\\\s+" as the delimiter. 解析String单词可以通过使用"\\\\s+"作为分隔符调用String.split()来完成。

Note that "\\\\s+" is a regular expression. 请注意, "\\\\s+"是正则表达式。 It matches strings which consist of at least one whitespace character (such as a space, a tab, or a newline). 它匹配由至少一个空格字符(例如空格,制表符或换行符)组成的字符串。

int words = text.trim().split("\\s+").length;

Try the following regex: 试试以下正则表达式:

[\\p{Punct}\\p{Space}]+

The problem with your current regex is that it matches exactly one character, and thus separately matches each whitespace between word1 and word2 . 当前正则表达式的问题在于它只匹配一个字符,因此分别匹配word1word2之间的每个空格。 The repetition operator placed outside the character group fixes that. 位于字符组外部的重复运算符可以修复该问题。

Use Guava , define a Splitter as Constant: 使用Guava ,将Splitter定义为常量:

private static final Splitter WORD_SPLITTER = 
    Splitter.on(CharMatcher.JAVA_LETTER_OR_DIGIT.negate())
            .trimResults()
            .omitEmptyStrings();

and use it in your code: 并在您的代码中使用它:

int words = Iterables.size(WORD_SPLITTER.split(yourString));

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM