简体   繁体   中英

counting number of occurrences of words in a text java

So I'm building a TreeMap from scratch and I'm trying to count the number of occurrences of every word in a text using Java. The text is read from a text file, but I can easily read it from there. I really don't know how to count every word, can someone help?

Imagine the text is something like:

Over time, computer engineers take advantage of each other's work and invent algorithms for new things. Algorithms combine with other algorithms to utilize the results of other algorithms, in turn producing results for even more algorithms.

Output: 
Over 1
time 1
computer 1
algotitms 5
...

If possible I want to ignore if it's upper or lower case, I want to count them both together.

EDIT: I don't want to use any sort of Map (hashMap ie) or something similiar to do this.

Break down the problem as follows (this is one potential solution - not THE solution):

  1. Split the text into words (create list or array or words).
  2. Remove punctuation marks.
  3. Create your map to collect results.
  4. Iterate over your list of words and add "1" to the value of each encountered key
  5. Display results (Iterate over the map's EntrySet )

Split the text into words

My preference is to split words by using space as a delimiter. The reason being is that, if you split using non-word characters, you may missed on some hyphenated words. I know that the use of hyphenation is being reduced, there are still plenty of words that fall under this rule; for example, middle-aged. If a word such as this is encountered, it MIGHT have to be treated as one word and not two.

Remove punctuation marks

Because of the decision above, you will need to first remove punctuation characters that might attached to your words. Keep in mind that if you use a regular expression to split the words, you might be able to accomplish this step at the same time you are doing the step above. In fact, that would be preferred so that you don't have to iterate over twice. Do both of these in a single pass. While you at it, call toLowerCase() on the input string to eliminate the ambiguity between capitalized words and lowercase words.

Create your map to collect results

This is where you are going to collect your count. Using the TreeMap implementation of the Java Map . One thing to be aware about this particular implementation is that the map is sorted according to the natural ordering of its keys . In this case, since the keys are the words from the inputted text, the keys will be arranged in alphabetical order, not by the magnitude of the count. IF sorting the entries by count is important, there is a technique where you can "reverse" the map and make the values the keys and the keys to values. However, since two or more words could have the same count, you will need to create a new map of <Integer, Set>, so that you can group together words with the same count.

Iterate over your list of words

At this point, you should have a list of words and a map structure to collect the count. Using a lambda expression, you should be able to perform a count() or your words very easily. But, if you are not familiarized or comfortable with Lambda expressions, you can use a regular looping structure to iterate over your list, do a containsKey() check to see if the word was encountered before, get() the value if the map already contains the word, and then add "1" to the previous value. Lastly, put() the new count in the map.

Display results

Again, you can use a Lambda Expression to print out the EntrySet key value pairs or simply iterate over the entry set to display the results.

Based on all of the above points, a potential solution should look like this (not using Lambda for the OPs sake)

public static void main(String[] args) {
    String text = "Over time, computer engineers take advantage of each other's work and invent algorithms for new things. Algorithms combine with other algorithms to utilize the results of other algorithms, in turn producing results for even more algorithms.";
    
    text = text.replaceAll("\\p{P}", ""); // replace all punctuations
    text = text.toLowerCase(); // turn all words into lowercase
    String[] wordArr = text.split(" "); // create list of words

    Map<String, Integer> wordCount = new TreeMap<>();
    
    // Collect the word count
    for (String word : wordArr) {
        if(!wordCount.containsKey(word)){
            wordCount.put(word, 1);
        } else {
            int count = wordCount.get(word);
            wordCount.put(word, count + 1);
        }
    }
    
    Iterator<Entry<String, Integer>> iter = wordCount.entrySet().iterator();
    
    System.out.println("Output: ");
    while(iter.hasNext()) {
        Entry<String, Integer> entry = iter.next();
        System.out.println(entry.getKey() + ": " + entry.getValue());
    }
}

This produces the following output

Output: 
advantage: 1
algorithms: 5
and: 1
combine: 1
computer: 1
each: 1
engineers: 1
even: 1
for: 2
in: 1
invent: 1
more: 1
new: 1
of: 2
other: 2
others: 1
over: 1
producing: 1
results: 2
take: 1
the: 1
things: 1
time: 1
to: 1
turn: 1
utilize: 1
with: 1
work: 1

Why did I break down the problem like this for such mundane task? Simple. I believe each of those discrete steps should be extracted into functions to improve code reusability. Yes, it is cool to use a Lambda expression to do everything at once and make your code look much simplified. But what if you need to some intermediate step over and over? Most of the time, code is duplicated to accomplish this. In reality, often a better solution is to break these tasks into methods. Some of these tasks, like transforming the input text, can be done in a single method since that activity seems to be related in nature. (There is such a thing as a method doing "too little.")

public String[] createWordList(String text) {
    return text.replaceAll("\\p{P}", "").toLowerCase().split(" ");
}

public Map<String, Integer> createWordCountMap(String[] wordArr) {
    Map<String, Integer> wordCountMap = new TreeMap<>();

    for (String word : wordArr) {
        if(!wordCountMap.containsKey(word)){
            wordCountMap.put(word, 1);
        } else {
            int count = wordCountMap.get(word);
            wordCountMap.put(word, count + 1);
        }
    }

return wordCountMap;
}

String void displayCount(Map<String, Integer> wordCountMap) {
    Iterator<Entry<String, Integer>> iter = wordCountMap.entrySet().iterator();
    
    while(iter.hasNext()) {
        Entry<String, Integer> entry = iter.next();
        System.out.println(entry.getKey() + ": " + entry.getValue());
    }
}

Now, after doing that, your main method looks more readable and your code is more reusable.

public static void main(String[] args) {
    
    WordCount wc = new WordCount();
    String text = "...";
    
    String[] wordArr = wc.createWordList(text);
    Map<String, Integer> wordCountMap = wc.createWordCountMap(wordArr);
    wc.displayCount(wordCountMap);
}

UPDATE :

One small detail I forgot to mention is that, if instead of a TreeMap a HashMap is used, the output will come sorted by count value in descending order. This is because the hashing function will use value of the entry as the hash. Therefore, you won't need to "reverse" the map for this purpose. So, after switching to HashMap , the output should be as follows:

Output: 
algorithms: 5
other: 2
for: 2
turn: 1
computer: 1
producing: 1
...

my suggestion is to use regexp and split and stream with grouping I think that's what you mean, but I'm not sure if I used too much for the list

@Test
public void testApp() {
    final String text = "Over time, computer engineers take advantage of each other's work and invent algorithms for new things. Algorithms combine with other algorithms to utilize the results of other algorithms, in turn producing results for even more algorithms.";
    final String[] split = text.split("\\W+");
    final List<String> list = new ArrayList<>();
    System.out.println("Output: ");
    for (String s : split) {
        if(!list.contains(s)){
            list.add(s.toUpperCase());
            final long count = Arrays.stream(split).filter(s::equalsIgnoreCase).count();
            System.out.println(s+" "+count);
        }
    }

}

below is a test for your example but use MAP

@Test
public void test() {
    final String text = "Over time, computer engineers take advantage of each other's work and invent algorithms for new things. Algorithms combine with other algorithms to utilize the results of other algorithms, in turn producing results for even more algorithms.";
    Map<String, Long> result = Arrays.stream(text.split("\\W+")).collect(Collectors.groupingBy(String::toLowerCase, Collectors.counting()));
    assertEquals(result.get("algorithms"), new Long(5));
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM