计算Java中.txt文件中单词的频率

Question

我正在从事Comp Sci作业。 最后，程序将确定文件是用英语还是法语编写的。 现在，我正在努力计算.txt文件中出现的单词出现频率的方法。

我在各自的文件夹中分别标记有1-20的英语和法语文本文件集。 该方法要求一个目录（在这种情况下为“ docs / train / eng /”或“ docs / train / fre /”）以及程序应通过的文件数量（每个文件夹中有20个文件）。 然后，它读取该文件，将所有单词分开（我不必担心大写或标点符号），并将每个单词以及它们在文件中的存储次数放入HashMap中。 （关键字=单词，值=频率）。

这是我为该方法想到的代码：

public static HashMap<String, Integer> countWords(String directory, int nFiles) {
// Declare the HashMap
HashMap<String, Integer> wordCount = new HashMap();

// this large 'for' loop will go through each file in the specified directory.
for (int k = 1; k < nFiles; k++) {
  // Puts together the string that the FileReader will refer to.
  String learn = directory + k + ".txt";

try {
  FileReader reader = new FileReader(learn);
  BufferedReader br = new BufferedReader(reader);
  // The BufferedReader reads the lines

  String line = br.readLine();


  // Split the line into a String array to loop through
  String[] words = line.split(" ");
  int freq = 0;

  // for loop goes through every word
  for (int i = 0; i < words.length; i++) {
    // Case if the HashMap already contains the key.
    // If so, just increments the value

    if (wordCount.containsKey(words[i])) {         
      wordCount.put(words[i], freq++);
    }
    // Otherwise, puts the word into the HashMap
    else {
      wordCount.put(words[i], freq++);
    }
  }
  // Catching the file not found error
  // and any other errors
}
catch (FileNotFoundException fnfe) {
  System.err.println("File not found.");
}
catch (Exception e) {
  System.err.print(e);
   }
 }
return wordCount;
}

代码会编译。 不幸的是，当我要求它打印20个文件的所有单词计数的结果时，它打印了。 它完全是乱七八糟的（尽管肯定有这些词），完全不是我需要的方法。

如果有人可以帮助我调试我的代码，我将不胜感激。 我已经使用了很长时间了，进行一次又一次的测试，我准备放弃。

Answer 1

我原本希望这样。 是否有意义？

if (wordCount.containsKey(words[i])) { 
  int n = wordCount.get(words[i]);    
  wordCount.put(words[i], ++n);
}
// Otherwise, puts the word into the HashMap
else {
  wordCount.put(words[i], 1);
}

如果单词已经在哈希图中，则我们要获取当前计数，将其加1 ，然后用哈希图中的新计数替换单词。

如果单词还没有在哈希图中，我们只需将其以1的计数形式放在地图中即可。 下次我们看到相同的单词时，我们会将计数增加到2 ，依此类推。

Answer 2

如果仅按空格分隔，则单词中将包含其他符号（括号，标点符号等）。 例如： "This phrase, contains... funny stuff" ，如果按空格将其分割，则会得到： "This" "phrase," "contains..." "funny"和"stuff" 。

您可以通过按单词边界（ \\b ）分割来避免这种情况。

line.split("\\b");

顺便说一下，如果和其他部分相同。 您总是将频率增加1，这没有任何意义。 如果单词已经在地图中，则要获取当前频率，将其加1，然后更新地图中的频率。 如果不是，则将其放在地图中，值为1。

提示：始终打印/记录异常的完整堆栈跟踪。

Answer 3

让我在这里结合所有好的答案。

1）拆分您的方法以分别处理一件事。 一个将文件读入strings []，一个用于处理strings []，一个调用前两个。

2）拆分时，请深思如何拆分。 如@ m0skit0建议，您可能应该用\\ b分割此问题。

3）按照@jas的建议，您应该首先检查地图上是否已经有单词。 如果确实增加了计数，则如果未增加，则将单词添加到地图并将其计数设置为1。

4）要以您期望的方式打印地图，请查看以下内容：

Map test = new HashMap();

for (Map.Entry entry : test.entrySet()){
  System.out.println(entry.getKey() + " " + entry.getValue());
}

计算Java中.txt文件中单词的频率

问题描述

3 个解决方案

解决方案1
3 2015-04-08 22:55:32

解决方案2
2 2015-04-08 22:54:09

解决方案3
2 已采纳 2015-04-08 23:16:42

计算Java中.txt文件中单词的频率

问题描述

3 个解决方案

解决方案1 3 2015-04-08 22:55:32

解决方案2 2 2015-04-08 22:54:09

解决方案3 2 已采纳 2015-04-08 23:16:42

解决方案1
3 2015-04-08 22:55:32

解决方案2
2 2015-04-08 22:54:09

解决方案3
2 已采纳 2015-04-08 23:16:42