Java word counter

Question

I am having one problem to count words in Java.

I have a Map

Map<String,StringBuilder> files_and_text = new TreeMap<String,StringBuilder>();

String is a file name whereas StringBuilder contains the file text.

For example

StringBuilder file_tex = new StringBuilder();
StringBuilder file_tex2 = new StringBuilder();

file_text.append("some contents some file one");
files_and_tex.put("file1", file_text);

file_text2.append("test words test test words");    
files_and_tex.put("file2", file_text2);

Now I want to make a dictionary that can tell me:

         |word 1 | word 2 | word 3 ........
file 1   | 3     |    1   |  0 .........
file 2   | 6     |    2   |  9 .........
.......
.......

The words 1, 2, 3 and so on are corpus words. File 1, 2, 3 and so on, are file names. Each value in this matrix represents how many time such word occurs in current file.

I moved from C to Java recently, I know how to write messy code (structured) to solve this problem; I am wondering how to do it in pure object oriented style, specially in Java.

Note: it is not an assignment!

Answer 1

Google's Guava Libraries have some very useful utilities and data structures for this sort of problem.

To split up the file into words you can use Splitter:

Iterable<String> wordsInFile = 
   Splitter.on(' ').trimResuls().omitEmptyStrings().split(fileAsString);

To count up the occurrences of a given word, you can use Multiset:

Multiset<String> countOfEachWord = HashMultiset.create();
countOfEachWord.addAll(wordsInFile);

You could build on these two pieces to create some kind of object like a WordLookupTable. ie:

public class WordLookupTable {

  private static final Splitter SPLITTER = Splitter.on(' ').trimResults().omitEmptyStrings();  
  private final Map<String, Multiset<String>> filenameToWordCountSet = Maps.newHashMap();

  public void addFile(String filename, String fileText) {
    Multiset<String> wordsInFile = getWordSetForFile(filename);

    for (String word : SPLITTER.split(fileText)) {
      wordsInFile.add(word);

    }
  }

  // Gets the count of all words for the file
  public long getCountOfWordsForFile(String filename) {
    return getWordSetForFile(filename).size();  

  }

  public long getCountOfWordInFile(String filename, String word) {
    return getWordSetForFile(filename).count(word);
  }

  public long getCountOfWordOverAllFiles(String word) {
    long count = 0;
    for (Multiset<String> wordSet : filenameToWordCountSet.values()) {
      count += wordSet.count(word);
    }
    return count;
  }

  private Multiset<String> getWordSetForFile(String filename) {
    Multiset<String> wordsInFile = filenameToWordCountSet.get(filename);
    if(wordsInFile == null) {
      wordsInFile = HashMultiset.create();
      filenameToWordCountSet.put(filename, wordsInFile);
    }
    return wordsInFile;
  }
}

Answer 2

There are many ways you can do this, let me explain to you a way that is both efficient and easy to understand .. and of course OO.

[Step 1] You gotta have two maps one that stores file specific data and the other that stores filename and that files data. Instead of filename you can choose whatever you want.

private static HashMap<String, MutableInt> wordMap1 = new HashMap<String, MutableInt>();
private static HashMap<String, MutableInt> wordMap2 = new HashMap<String, MutableInt>();
private static HashMap<String, HashMap> fileMap = new HashMap<String, HashMap>();

[Step 2] Make the MutableInt class (technically you wanna do this first) Now you might ask what is the MutableInt, its a class that you will create so that you can increment the value for a given word as you encounter it.

Here is an example of the MutableInt class:

class MutableInt {
    int value = 1;
    public void increase () { ++value; }
    public int getValue () { return value; }
    public String toString(){
        return Integer.toString(value);
    }
}

[Step 3] Now for each word in the given file do the following:

create a new wordMap for file you are parsing
get the word from file
check if word is in wordMap using wordmap.get("word");
if output is null then you know its a new word.
put the word in the map and put a MutableInt in its value using
wordmap.put('word", new MutableInt());
if output is not null then you know there it is not a new word so increase the counter using wordMap.getValue("word).increase();
Once you are done doing this with all the words in the file you want to put the wordMap in the fileMap using fileMap.put("filename",wordMap);

Answer 3

Here is an example that should get you going:

Map<String, StringBuilder> files_and_tex = new HashMap<String, StringBuilder>();

StringBuilder file_text = new StringBuilder();
StringBuilder file_text2 = new StringBuilder();
file_text.append("some contents some file one");
files_and_tex.put("file1", file_text);

file_text2.append("test words test test words");    
files_and_tex.put("file2", file_text2);

// Maps from file-name to word to count
Map<String, Map<String, Integer>> wordCounts =
        new HashMap<String, Map<String, Integer>>();

// Go through each filename (key in files_and_tex)
for (String file : files_and_tex.keySet()) {

    // Create a map to keep track of word counts for this file
    Map<String, Integer> wc = new HashMap<String, Integer>();
    wordCounts.put(file, wc);

    Scanner s = new Scanner("" + files_and_tex.get(file));
    while (s.hasNext()) {
        String word = s.next();
        if (!wc.containsKey(word))
            wc.put(word, 0);
        wc.put(word, wc.get(word) + 1);
    }
}

// And here is how to access the resulting data
System.out.println(wordCounts.get("file1").get("file")); // prints 1
System.out.println(wordCounts.get("file2").get("test")); // prints 3

Btw, the Java convention recommends camel-case style for identifiers.

Java word counter

Question

3 answers

solution1
3 2010-12-16 16:45:36

solution2
1 2010-12-16 16:56:45

solution3
0 2010-12-16 16:44:39

Java word counter

Question

3 answers

solution1 3 2010-12-16 16:45:36

solution2 1 2010-12-16 16:56:45

solution3 0 2010-12-16 16:44:39

solution1
3 2010-12-16 16:45:36

solution2
1 2010-12-16 16:56:45

solution3
0 2010-12-16 16:44:39