計算Java中.txt文件中單詞的頻率

Question

我正在從事Comp Sci作業。 最后，程序將確定文件是用英語還是法語編寫的。 現在，我正在努力計算.txt文件中出現的單詞出現頻率的方法。

我在各自的文件夾中分別標記有1-20的英語和法語文本文件集。 該方法要求一個目錄（在這種情況下為“ docs / train / eng /”或“ docs / train / fre /”）以及程序應通過的文件數量（每個文件夾中有20個文件）。 然后，它讀取該文件，將所有單詞分開（我不必擔心大寫或標點符號），並將每個單詞以及它們在文件中的存儲次數放入HashMap中。 （關鍵字=單詞，值=頻率）。

這是我為該方法想到的代碼：

public static HashMap<String, Integer> countWords(String directory, int nFiles) {
// Declare the HashMap
HashMap<String, Integer> wordCount = new HashMap();

// this large 'for' loop will go through each file in the specified directory.
for (int k = 1; k < nFiles; k++) {
  // Puts together the string that the FileReader will refer to.
  String learn = directory + k + ".txt";

try {
  FileReader reader = new FileReader(learn);
  BufferedReader br = new BufferedReader(reader);
  // The BufferedReader reads the lines

  String line = br.readLine();


  // Split the line into a String array to loop through
  String[] words = line.split(" ");
  int freq = 0;

  // for loop goes through every word
  for (int i = 0; i < words.length; i++) {
    // Case if the HashMap already contains the key.
    // If so, just increments the value

    if (wordCount.containsKey(words[i])) {         
      wordCount.put(words[i], freq++);
    }
    // Otherwise, puts the word into the HashMap
    else {
      wordCount.put(words[i], freq++);
    }
  }
  // Catching the file not found error
  // and any other errors
}
catch (FileNotFoundException fnfe) {
  System.err.println("File not found.");
}
catch (Exception e) {
  System.err.print(e);
   }
 }
return wordCount;
}

代碼會編譯。 不幸的是，當我要求它打印20個文件的所有單詞計數的結果時，它打印了。 它完全是亂七八糟的（盡管肯定有這些詞），完全不是我需要的方法。

如果有人可以幫助我調試我的代碼，我將不勝感激。 我已經使用了很長時間了，進行一次又一次的測試，我准備放棄。

Answer 1

我原本希望這樣。 是否有意義？

if (wordCount.containsKey(words[i])) { 
  int n = wordCount.get(words[i]);    
  wordCount.put(words[i], ++n);
}
// Otherwise, puts the word into the HashMap
else {
  wordCount.put(words[i], 1);
}

如果單詞已經在哈希圖中，則我們要獲取當前計數，將其加1 ，然后用哈希圖中的新計數替換單詞。

如果單詞還沒有在哈希圖中，我們只需將其以1的計數形式放在地圖中即可。 下次我們看到相同的單詞時，我們會將計數增加到2 ，依此類推。

Answer 2

如果僅按空格分隔，則單詞中將包含其他符號（括號，標點符號等）。 例如： "This phrase, contains... funny stuff" ，如果按空格將其分割，則會得到： "This" "phrase," "contains..." "funny"和"stuff" 。

您可以通過按單詞邊界（ \\b ）分割來避免這種情況。

line.split("\\b");

順便說一下，如果和其他部分相同。 您總是將頻率增加1，這沒有任何意義。 如果單詞已經在地圖中，則要獲取當前頻率，將其加1，然后更新地圖中的頻率。 如果不是，則將其放在地圖中，值為1。

提示：始終打印/記錄異常的完整堆棧跟蹤。

Answer 3

讓我在這里結合所有好的答案。

1）拆分您的方法以分別處理一件事。 一個將文件讀入strings []，一個用於處理strings []，一個調用前兩個。

2）拆分時，請深思如何拆分。 如@ m0skit0建議，您可能應該用\\ b分割此問題。

3）按照@jas的建議，您應該首先檢查地圖上是否已經有單詞。 如果確實增加了計數，則如果未增加，則將單詞添加到地圖並將其計數設置為1。

4）要以您期望的方式打印地圖，請查看以下內容：

Map test = new HashMap();

for (Map.Entry entry : test.entrySet()){
  System.out.println(entry.getKey() + " " + entry.getValue());
}

計算Java中.txt文件中單詞的頻率

問題描述

3 個解決方案

解決方案1
3 2015-04-08 22:55:32

解決方案2
2 2015-04-08 22:54:09

解決方案3
2 已采納 2015-04-08 23:16:42

計算Java中.txt文件中單詞的頻率

問題描述

3 個解決方案

解決方案1 3 2015-04-08 22:55:32

解決方案2 2 2015-04-08 22:54:09

解決方案3 2 已采納 2015-04-08 23:16:42

解決方案1
3 2015-04-08 22:55:32

解決方案2
2 2015-04-08 22:54:09

解決方案3
2 已采納 2015-04-08 23:16:42