简体   繁体   English

JAVA:用于文档比较的最佳数据结构?

[英]JAVA: Best data structure for document comparison?

I am writing a program that compares multiple documents based on the words they have in common. 我正在编写一个程序,根据它们的共同词比较多个文档。 I am able to tokenize all of the words and store all of them in an ArrayList since it allows duplicates. 我能够标记所有单词并将它们全部存储在ArrayList中,因为它允许重复。 However, I am not sure that that is the best way of doing it. 但是,我不确定这是最好的方法。 I need to find the top 50 most frequent words in the ArrayList, and I am not really sure how to do that. 我需要在ArrayList中找到前50个最常用的单词,但我不确定如何做到这一点。 Is there a better data structure for this operation? 此操作是否有更好的数据结构?

If you just want to compare occurrences, you can use a map such as a HashMap , TreeMap , or any other implementation . 如果您只想比较发生次数,则可以使用诸如HashMapTreeMap或任何其他实现的地图

The key will be the word (String), the value will be the number of occurrences (Integer). 键将是单词(String),值将是出现的次数(Integer)。 You'll go over your document, and lookup each word in the map. 您将遍历文档,并查找地图中的每个单词。 If it exists, get its current number of occurrences, and increment it by one. 如果存在,则获取其当前出现的次数,并将其增加一。 If it doesn't, insert the word with an occurrence count of zero. 如果不是,请插入出现次数为零的单词。 Here's a code snippet: 这是一个代码片段:

    HashMap<String, Integer> occurenceMap = new HashMap<>();

    for (String word : document) {
        Integer wordOccurences = occurenceMap.get(word);
        if (wordOccurences == null) {
            wordOccurences = Integer.valueOf(1);
        } else {
            wordOccurences += 1;
        }
        occurenceMap.put(word, wordOccurences);
    }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM