简体   繁体   中英

How do I count the number of words in a text using regex?

I want to split text into words, to count the number of words.

This is how imagine it to be:

int words = text.split("[\\p{Punct}*\\p{Space}*]").length;

I've tried multiple combinations, but it seems to split into too manu parts, for example

"word1       word2" 

...has 8 words with this regex, I want it to be only 2.

int countWords(String input) {
   return input.trim().split("\\s+").length;
}

A word is just text surrounded by whitespace. Parsing words from a String can be done by calling String.split() using "\\\\s+" as the delimiter.

Note that "\\\\s+" is a regular expression. It matches strings which consist of at least one whitespace character (such as a space, a tab, or a newline).

int words = text.trim().split("\\s+").length;

Try the following regex:

[\\p{Punct}\\p{Space}]+

The problem with your current regex is that it matches exactly one character, and thus separately matches each whitespace between word1 and word2 . The repetition operator placed outside the character group fixes that.

Use Guava , define a Splitter as Constant:

private static final Splitter WORD_SPLITTER = 
    Splitter.on(CharMatcher.JAVA_LETTER_OR_DIGIT.negate())
            .trimResults()
            .omitEmptyStrings();

and use it in your code:

int words = Iterables.size(WORD_SPLITTER.split(yourString));

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM