JAVA：用于文档比较的最佳数据结构？

Question

I am writing a program that compares multiple documents based on the words they have in common. 我正在编写一个程序，根据它们的共同词比较多个文档。 I am able to tokenize all of the words and store all of them in an ArrayList since it allows duplicates. 我能够标记所有单词并将它们全部存储在ArrayList中，因为它允许重复。 However, I am not sure that that is the best way of doing it. 但是，我不确定这是最好的方法。 I need to find the top 50 most frequent words in the ArrayList, and I am not really sure how to do that. 我需要在ArrayList中找到前50个最常用的单词，但我不确定如何做到这一点。 Is there a better data structure for this operation? 此操作是否有更好的数据结构？

Answer 1

If you just want to compare occurrences, you can use a map such as a HashMap , TreeMap , or any other implementation . 如果您只想比较发生次数，则可以使用诸如HashMap ， TreeMap或任何其他实现的地图。

The key will be the word (String), the value will be the number of occurrences (Integer). 键将是单词（String），值将是出现的次数（Integer）。 You'll go over your document, and lookup each word in the map. 您将遍历文档，并查找地图中的每个单词。 If it exists, get its current number of occurrences, and increment it by one. 如果存在，则获取其当前出现的次数，并将其增加一。 If it doesn't, insert the word with an occurrence count of zero. 如果不是，请插入出现次数为零的单词。 Here's a code snippet: 这是一个代码片段：

    HashMap<String, Integer> occurenceMap = new HashMap<>();

    for (String word : document) {
        Integer wordOccurences = occurenceMap.get(word);
        if (wordOccurences == null) {
            wordOccurences = Integer.valueOf(1);
        } else {
            wordOccurences += 1;
        }
        occurenceMap.put(word, wordOccurences);
    }

JAVA：用于文档比较的最佳数据结构？

问题描述

1 个解决方案

解决方案1
2 已采纳 2015-01-30 00:44:26

JAVA：用于文档比较的最佳数据结构？

问题描述

1 个解决方案

解决方案1 2 已采纳 2015-01-30 00:44:26

解决方案1
2 已采纳 2015-01-30 00:44:26