简体   繁体   中英

Fast way of counting number of occurrences of a word in a string using Java

I want to find number of times a word appears in a string in a fast and efficient way using Java.

The words are separated by space and I am looking for complete words.

Example: 
string: "the colored port should be black or white or brown"
word: "or"
output: 2

for the above example, "colored" and "port" are not counted, but "or" is counted.

I considered using substring() and contains() and iterating over the string. But then we need to check for the surrounding spaces which I suppose is not efficient. Also StringUtils.countMatches() is not efficient.

The best way I tried is splitting the string over space and iterating over the words, and then matching them against the given word :

String string = "the colored port should be black or white or brown";
String[] words = string.split(" ");
String word = "or";
int occurrences = 0;
for (int i=0; i<words.length; i++)
    if (words[i].equals(word))
        occurrences++;
System.out.println(occurrences);

But I am expecting some efficient way using Matcher and regex .

So I tested the following code:

        String string1 = "the colored port should be black or white or brown or";
        //String string2 = "the color port should be black or white or brown or";
        String word = "or";
        Pattern pattern = Pattern.compile("\\s(" + word + ")|\\s(" + word + ")|(" + word + ")\\s");
        Matcher  matcher = pattern.matcher(string1);
        //Matcher  matcher = pattern.matcher(string2);
        int count = 0;
        while (matcher.find()){
            match=matcher.group();
            count++;
        }
        System.out.println("The word \"" + word + "\" is mentioned " + count + " times.");

It is supposed to be fast enough, and gives me the right answer for string1, but not for string2 (commented). There seems to need a little change in the regex.

Any ideas?

How about this? Assuming word wont have spaces.

string.split("\\s"+word+"\\s").length - 1;
public class Test {
public static void main(String[] args) {
    String str= "the colored port should be black or white or brown";
    Pattern pattern = Pattern.compile(" or ");
    Matcher  matcher = pattern.matcher(str);

    int count = 0;
    while (matcher.find())
        count++;

    System.out.println(count);    
}

}

I experimented and evaluated three answers; split based and Matcher based (as mentioned in the question), and Collections.frequency() based (as mentioned in a comment above by @4castle). Each time I measured the total time in a loop repeated 10 million times. As a result, the split based answer tends to be the most efficient way :

String string = "the colored port should be black or white or brown";
String[] words = string.split(" ");
String word = "or";
int occurrences = 0;
for (int i=0; i<words.length; i++)
    if (words[i].equals(word))
        occurrences++;
System.out.println(occurrences);

Then there is Collections.frequency() based answer with a little longer running time (~5% slower):

String string = "the colored port should be black or white or brown or";
String word = "or";
int count = Collections.frequency(Arrays.asList(string.split(" ")), word);
System.out.println("The word \"" + word + "\" is mentioned " + count + " times.");

The Matcher based solution (mentioned in the question) is a lot slower (~5 times more running time).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM