简体   繁体   English

计算Java中.txt文件中单词的频率

[英]Counting frequency of words from a .txt file in java

I am working on a Comp Sci assignment. 我正在从事Comp Sci作业。 In the end, the program will determine whether a file is written in English or French. 最后,程序将确定文件是用英语还是法语编写的。 Right now, I'm struggling with the method that counts the frequency of words that appears in a .txt file. 现在,我正在努力计算.txt文件中出现的单词出现频率的方法。

I have a set of text files in both English and French in their respective folders labeled 1-20. 我在各自的文件夹中分别标记有1-20的英语和法语文本文件集。 The method asks for a directory (which in this case is "docs/train/eng/" or "docs/train/fre/") and for how many files that the program should go through (there are 20 files in each folder). 该方法要求一个目录(在这种情况下为“ docs / train / eng /”或“ docs / train / fre /”)以及程序应通过的文件数量(每个文件夹中有20个文件) 。 Then it reads that file, splits all the words apart (I don't need to worry about capitalization or punctuation), and puts every word in a HashMap along with how many times they were in the file. 然后,它读取该文件,将所有单词分开(我不必担心大写或标点符号),并将每个单词以及它们在文件中的存储次数放入HashMap中。 (Key = word, Value = frequency). (关键字=单词,值=频率)。

This is the code I came up with for the method: 这是我为该方法想到的代码:

public static HashMap<String, Integer> countWords(String directory, int nFiles) {
// Declare the HashMap
HashMap<String, Integer> wordCount = new HashMap();

// this large 'for' loop will go through each file in the specified directory.
for (int k = 1; k < nFiles; k++) {
  // Puts together the string that the FileReader will refer to.
  String learn = directory + k + ".txt";

try {
  FileReader reader = new FileReader(learn);
  BufferedReader br = new BufferedReader(reader);
  // The BufferedReader reads the lines

  String line = br.readLine();


  // Split the line into a String array to loop through
  String[] words = line.split(" ");
  int freq = 0;

  // for loop goes through every word
  for (int i = 0; i < words.length; i++) {
    // Case if the HashMap already contains the key.
    // If so, just increments the value

    if (wordCount.containsKey(words[i])) {         
      wordCount.put(words[i], freq++);
    }
    // Otherwise, puts the word into the HashMap
    else {
      wordCount.put(words[i], freq++);
    }
  }
  // Catching the file not found error
  // and any other errors
}
catch (FileNotFoundException fnfe) {
  System.err.println("File not found.");
}
catch (Exception e) {
  System.err.print(e);
   }
 }
return wordCount;
}

The code compiles. 代码会编译。 Unfortunately, when I asked it to print the results of all the word counts for the 20 files, it printed this . 不幸的是,当我要求它打印20个文件的所有单词计数的结果时, 它打印了 It's complete gibberish (though the words are definitely there) and is not at all what I need the method to do. 它完全是乱七八糟的(尽管肯定有这些词),完全不是我需要的方法。

If anyone could help me debug my code, I would greatly appreciate it. 如果有人可以帮助我调试我的代码,我将不胜感激。 I've been at it for ages, conducting test after test and I'm ready to give up. 我已经使用了很长时间了,进行一次又一次的测试,我准备放弃。

I would have expected something more like this. 我原本希望这样。 Does it make sense? 是否有意义?

if (wordCount.containsKey(words[i])) { 
  int n = wordCount.get(words[i]);    
  wordCount.put(words[i], ++n);
}
// Otherwise, puts the word into the HashMap
else {
  wordCount.put(words[i], 1);
}

If the word is already in the hashmap, we want to get the current count, add 1 to that and replace the word with the new count in the hashmap. 如果单词已经在哈希图中,则我们要获取当前计数,将其加1 ,然后用哈希图中的新计数替换单词。

If the word is not yet in the hashmap, we simply put it in the map with a count of 1 to start with. 如果单词还没有在哈希图中,我们只需将其以1的计数形式放在地图中即可。 The next time we see the same word we'll up the count to 2 , etc. 下次我们看到相同的单词时,我们会将计数增加到2 ,依此类推。

If you split by space only, then other signs (parenthesis, punctuation marks, etc...) will be included in the words. 如果仅按空格分隔,则单词中将包含其他符号(括号,标点符号等)。 For example: "This phrase, contains... funny stuff" , if you split it by space you get: "This" "phrase," "contains..." "funny" and "stuff" . 例如: "This phrase, contains... funny stuff" ,如果按空格将其分割,则会得到: "This" "phrase," "contains..." "funny""stuff"

You can avoid this by splitting by word boundary ( \\b ) instead. 您可以通过按单词边界( \\b )分割来避免这种情况。

line.split("\\b");

Btw your if and else parts are identical. 顺便说一下,如果和其他部分相同。 You're always incrementing freq by one, which doesn't make much sense. 您总是将频率增加1,这没有任何意义。 If the word is already in the map, you want to get the current frequency, add 1 to it, and update the frequency in the map. 如果单词已经在地图中,则要获取当前频率,将其加1,然后更新地图中的频率。 If not, you put it in the map with a value of 1. 如果不是,则将其放在地图中,值为1。

And pro tip: always print/log the full stacktrace for the exceptions. 提示:始终打印/记录异常的完整堆栈跟踪。

Let me combine all the good answers here. 让我在这里结合所有好的答案。

1) Split up your methods to handle one thing each. 1)拆分您的方法以分别处理一件事。 One to read the files into strings[], one to process the strings[], and one to call the first two. 一个将文件读入strings [],一个用于处理strings [],一个调用前两个。

2) When you split think deeply about how you want to split. 2)拆分时,请深思如何拆分。 As @m0skit0 suggest you should likely split with \\b for this problem. 如@ m0skit0建议,您可能应该用\\ b分割此问题。

3) As @jas suggested you should first check if your map already has the word. 3)按照@jas的建议,您应该首先检查地图上是否已经有单词。 If it does increment the count, if not add the word to the map and set it's count to 1. 如果确实增加了计数,则如果未增加,则将单词添加到地图并将其计数设置为1。

4) To print out the map in the way you likely expect, take a look at the below: 4)要以您期望的方式打印地图,请查看以下内容:

Map test = new HashMap();

for (Map.Entry entry : test.entrySet()){
  System.out.println(entry.getKey() + " " + entry.getValue());
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM