用于计算句子中单词的正则表达式

Question

public static int getWordCount(String sentence) {
    return sentence.split("(([a-zA-Z0-9]([-][_])*[a-zA-Z0-9])+)", -1).length
         + sentence.replaceAll("([[a-z][A-Z][0-9][\\W][-][_]]*)", "").length() - 1;
}

我的目的是计算句子中的单词数。 这个 function 的输入是冗长的句子。 它可能有 255 个单词。

这个词之间应该有连字符或下划线
Function should only count valid words means special character should not be counted 例如。 &&&& 或#### 不应算作一个单词。

上面的正则表达式工作正常，但是当单词 eg: co-operation 之间出现连字符或下划线时，返回的计数为 2，它应该是 1。有人可以帮忙吗？

Answer 1

请使用具有恒定内存使用量的方法，而不是使用非常昂贵的.split和.replaceAll 。

根据您的规格，您似乎寻找以下正则表达式：

[\w-]+

接下来，您可以使用此方法计算匹配数：

public static int getWordCount(String sentence) {
    Pattern pattern = Pattern.compile("[\\w-]+");
    Matcher  matcher = pattern.matcher(sentence);
    int count = 0;
    while (matcher.find())
        count++;
    return count;
}

在线jDoodle演示。

这种方法适用于（更多）常量内存：当拆分时，程序构造一个基本没用的数组，因为你从不检查数组的内容。

如果您不希望单词以连字符开头或结尾，可以使用以下正则表达式：

\w+([-]\w+)*

Answer 2

这部分([-][_])*是错误的。 符号[xyz]表示“括号内的任何一个字符”（参见http://www.regular-expressions.info/charclass.html ）。 因此，您可以按顺序准确地使用字符- 以及字符_ 。

修复您的组使其工作：

[a-zA-Z0-9]+([-_][a-zA-Z0-9]+)*

并且可以使用\\w进一步简化

\w+(-\w+)*

因为\\w匹配0..9 ， A..Z ， a..z和_ （ http://www.regular-expressions.info/shorthand.html ）所以你只需要添加- 。

Answer 3

如果你可以使用java 8：

long wordCount = Arrays.stream(sentence.split(" ")) //split the sentence into words   
.filter(s -> s.matches("[\\w-]+")) //filter only matching words
.count();

Answer 4

用 java 8

public static int getColumnCount(String row) {
    return (int) Pattern.compile("[\\w-]+")
        .matcher(row)
        .results()
        .count();
}

用于计算句子中单词的正则表达式

问题描述

4 个解决方案

解决方案1
4 2015-06-16 11:41:14

解决方案2
3 2015-06-16 11:39:31

解决方案3
2 2015-06-16 11:58:01

解决方案4
0 2022-03-17 14:41:48

用于计算句子中单词的正则表达式

问题描述

4 个解决方案

解决方案1 4 2015-06-16 11:41:14

解决方案2 3 2015-06-16 11:39:31

解决方案3 2 2015-06-16 11:58:01

解决方案4 0 2022-03-17 14:41:48

解决方案1
4 2015-06-16 11:41:14

解决方案2
3 2015-06-16 11:39:31

解决方案3
2 2015-06-16 11:58:01

解决方案4
0 2022-03-17 14:41:48