简体   繁体   中英

word count frequency in document

I have a directory in which I have 1000 txt.files in it. I want to know for every word how many times it occurs in the 1000 document. So say even the word "cow" occured 100 times in X it will still be counted as one. If it occured in a different document it is incremented by one. So the maximum is 1000 if "cow" appears in every single document. How do I do this the easy way without the use of any other external library. Here's what I have so far

     private Hashtable<String, Integer> getAllWordCount()
     private Hashtable<String, Integer> getAllWordCount()
    {
        Hashtable<String, Integer> result = new Hashtable<String, Integer>();
        HashSet<String> words = new HashSet<String>();
        try {   
            for (int j = 0; j < fileDirectory.length; j++){
                File theDirectory = new File(fileDirectory[j]);
                File[] children = theDirectory.listFiles();

                for (int i = 0; i < children.length; i++){
                    Scanner scanner = new Scanner(new FileReader(children[i]));

                    while (scanner.hasNext()){
String text = scanner.next().replaceAll("[^A-Za-z0-9]", "");
                        if (words.contains(text) == false){
                            if (result.get(text) == null)
                                result.put(text, 1);
                            else
                                result.put(text, result.get(text) + 1);
                            words.add(text);
                        }
                    }
                }
                words.clear();
            }
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        System.out.println(result.size());
        return result;
    }

You also need a HashSet<String> in which you store each unique word you've read from the current file.

Then after every word read, you should check if it's in the set, if it isn't, increment the corresponding value in the result map (or add a new entry if it was empty, like you already do) and add the word to the set.

Don't forget to reset the set when you start to read a new file though.

how about this?

private Hashtable<String, Integer> getAllWordCount()
{
    Hashtable<String, Integer> result = new Hashtable<String, Integer>();
    HashSet<String> words = new HashSet<String>();
    try {   
        for (int j = 0; j < fileDirectory.length; j++){
            File theDirectory = new File(fileDirectory[j]);
            File[] children = theDirectory.listFiles();
            for (int i = 0; i < children.length; i++){
                Scanner scanner = new Scanner(new FileReader(children[i]));
                while (scanner.hasNext()){
                    String text = scanner.next().replaceAll("[^A-Za-z0-9]", "");
                    words.add(text);
                }
                for (String word : words) {
                  Integer count = result.get(word)
                  if (result.get(word) == null) {
                    result.put(word, 1);
                  } else {
                    result.put(word, result.get(word) + 1);
                  }
                }
                words.clear();
            }
        }
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    System.out.println(result.size());
    return result;
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM