简体   繁体   English

读取.txt文件并在文件中返回其频率列表

[英]Read a .txt file and return a list of words with their frequency in the file

I have this so far but it only prints the .txt file to the screen: 到目前为止我有这个,但它只打印.txt文件到屏幕:

import java.io.*;

public class ReadFile {
    public static void main(String[] args) throws IOException {
        String Wordlist;
        int Frequency;

        File file = new File("file1.txt");
        BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
        String line = null;

        while( (line = br.readLine()) != null) {
            String [] tokens = line.split("\\s+");
            System.out.println(line);
        }
    }
}

Can anyone help me so it prints a word list and the words frequency? 任何人都可以帮助我,所以它打印一个单词列表和单词频率?

Does it have to be in Java? 它必须是Java吗? This does the job: 这样做的工作:

sed 's/[^A-Za-z]/\n/g' filename.txt | sort | uniq -c

Basically, turn any non-alphabetic character into a newline, sort the list of items, and let uniq count the occurrences. 基本上,将任何非字母字符转换为换行符,对项目列表进行排序,然后让uniq计算出现次数。 Just discard the first line of output, which is the number of empty lines. 只需丢弃第一行输出,即空行数。 This is fast to run, and even faster to code. 这可以快速运行,甚至可以更快地编写代码。

You can adjust the regular expression to taste, for example including digits[A-Za-z0-9] or accented character for foreign languages [A-Za-zàèìòù]. 您可以调整正则表达式,例如包括数字[A-Za-z0-9]或外语的重音字符[A-Za-zàèìòù]。

Do something like this. 做这样的事情。 I'm assuming only comma or period could occur in the file. 我假设文件中只能出现逗号或句号。 Else you'll have to remove other punctuation characters as well. 否则,您还必须删除其他标点字符。 I'm using a TreeMap so the words in the map will be stored their natural alphabetical order 我正在使用TreeMap,因此地图中的单词将以其自然字母顺序存储

  public static TreeMap<String, Integer> generateFrequencyList()
    throws IOException {
    TreeMap<String, Integer> wordsFrequencyMap = new TreeMap<String, Integer>();
    String file = "/tmp/lorem.txt";
    BufferedReader br = new BufferedReader(new FileReader(file));
    String line;
    while( (line = br.readLine()) != null){
         String [] tokens = line.split("\\s+");
      for (String token : tokens) {
        token = removePunctuation(token);
        if (!wordsFrequencyMap.containsKey(token.toLowerCase())) {
          wordsFrequencyMap.put(token.toLowerCase(), 1);
        } else {
          int count = wordsFrequencyMap.get(token.toLowerCase());
          wordsFrequencyMap.put(token.toLowerCase(), count + 1);
        }
      }
    }
    return wordsFrequencyMap;
  }

  private static String removePunctuation(String token) {
    token = token.replaceAll("[^a-zA-Z]", "");
    return token;
  }

main method for testing is shown below. 主要测试方法如下所示。 For getting the percentages, you could get count of all the words by iterating through the map and adding all the values and then do a second pass for getting the percentages. 为了获得百分比,您可以通过迭代地图并添加所有值来计算所有单词,然后再进行第二次获取百分比。 By the way, if this is part of a larger work, you could also take a look at apache commons math library for calculating Frequency distributions . 顺便说一句,如果这是一个更大的工作的一部分,你还可以看看apache commons数学库来计算频率分布 If you use their Frequency class, you can keep adding all the words to it and then get the descriptive statistics at the end. 如果您使用他们的Frequency类,您可以继续添加所有单词,然后在结尾处获取描述性统计数据。

  public static void main(String[] args) {
    try {
      int totalWords = 0;   
      TreeMap<String, Integer> freqMap = generateFrequencyList();
      for (String key : freqMap.keySet()) {
        totalWords += freqMap.get(key);
      }

      System.out.println("Word\tCount\tPercentage");
      for (String key : freqMap.keySet()) {
         System.out.println(key+"\t"+freqMap.get(key)+"\t"+((double)freqMap.get(key)*100.0/(double)totalWords));    
      }
    } catch (Exception e) {
      e.printStackTrace();
    }
  }

Create a HashMap 创建一个HashMap

HashMap<String, Integer> occurrences = new HashMap<String, Integer>();

Iterate through the array of each line 遍历每行的数组

for(String word: tokens) {
  // Do stuff
}

Then check if the word has already be read before for each word 然后检查每个单词之前是否已经读过该单词

if(occurrences.containsKey(word))
    occurrences.put(word, occurrences.get(word)+1);
else
    occurrences.put(word, 1);

Full version: 完整版本:

String Wordlist;
int Frequency;

File file = new File("file1.txt");
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(file)));

HashMap<String, int> occurrences = new HashMap<String, int>();

String line = null;

while( (line = br.readLine()) != null){
    String [] tokens = line.split("\\s+");

    for(String word: tokens) {
        if(occurences.contains(word))
            occurences.put(word, occurences.get(word)+1);
        else
            occurences.put(word, 1);
    } 
}

Might be a typo in it, haven't tested it, but this should do the job. 可能是一个错字,没有测试过,但这应该做的工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM