简体   繁体   中英

Getting the top 5 used words from a chunk of text in java

I'm trying to get the top 5 used words from a chunk of text. I have built up a map of words which includes a value of how many times the word has been used.

Map<String,Integer> wordHits = new HashMap<String,Integer>();

for(Status status3 : statuses){

    String mdry = status3.getText();
    String[] statusSplitOnSpace = mdry.split(" ");

    for(String wordInStatus : statusSplitOnSpace){
        for(String str : statusSplitOnSpace){
                if(doesListContainWord(str)){
                incrementKeyofWordInList(str);
            }else{
                if(doesWordCountAsAWord(str)){
                    addNewWordToList(str);
                }
            }
        }
    }

Set keys = list.keySet();
for (Iterator i = keys.iterator(); i.hasNext() ;){
      String key = (String) i.next();
      String value = (String) list.get(key);
      //if(value.length()>10)
      System.out.println("Word (" + key + ") was found " + value + " times.");
      //else{
}

Assuming you have your words stored in an array, first I would transfer the words to a Map . I believe you were trying to do that but it is hard to tell with your variable names. After you do this, you can create a custom Comparator that you can utilize to sort your Map . You can do something like this:

 public class Solution {           
    public static void main(String[] args){
        String[] words = {"word1", "word1", "word2", "word3", "word4", "word5", "word5"};
        Map<String, Integer> wordCounts = new HashMap<>();
        for (String word : words){ //Transfer your words to a map
            if (wordCounts.containsKey(word)){ //If word is already in map, increase value
                wordCounts.put(word, wordCounts.get(word)+1);
            }else{ //If word is not in map, add it to the map
                wordCounts.put(word, 1);
            }
        }
        TreeMap<String, Integer> sortedWordCounts = new TreeMap<>(new ValueComparator(wordCounts));  //Sorts based off of counts
        sortedWordCounts.putAll(wordCounts); //Add to new map
        NavigableSet<String> keys = sortedWordCounts.descendingKeySet();
        for (int i=0; i<5; i++){
            System.out.println(keys.pollLast());  //This prints out the top 5 keys. 
        }
    }
}
class ValueComparator implements Comparator<String>{
    private Map<String,Integer> map;
    public ValueComparator(Map<String,Integer> map){
        this.map = map;
    }
    @Override
    public int compare(String o1, String o2) {
        if (map.get(o1)>=map.get(o2)){
            return -1;
        }else{
            return 1;
        }
    }

}

Output

word5
word1
word4
word3
word2

A TreeMap is a type of Map but sorts the map for you depending on the Comparator you initialize it with. If you do not give it a Comparator it will just sort by the keys and we do not want that. We want to sort by the values, so you have to write your own Comparator .

Here's a more novice level "manual" approach. I didn't test it, but it's got to be close...

        // Get sorted Lists of words and counts from the source Map
    List<String> sortedWordsList = new ArrayList<String>();
    List<Integer> sortedCountsList = new ArrayList<Integer>();              
    for( String word : wordCountMap.keySet() ) 
    {
        Integer wordCount = wordCountMap.get(word);

        int insertIndex=0;
        for( int i=0; i != sortedCountsList.size(); ++i )
        {
            if( wordCount > sortedCountsList.get(i) ) break;
            ++insertIndex;  
        }     
        sortedWordsList.add( insertIndex, word );
        sortedCountsList.add( insertIndex, wordCount );
    }

    // Move top 5 words into a new List
    final int TOP_WORDS_TO_FIND_COUNT = 5;        
    List<String> topWordsList = new ArrayList<String>();
    for( int i=0; i != sortedWordsList.size(); ++i )
    {
        topWordsList.add( i, sortedWordsList.get(i) );
        if( i == TOP_WORDS_TO_FIND_COUNT-1 ) break;
    }     

    // Move top 5 counts into a new List
    List<Integer> topCountsList = new ArrayList<Integer>();
    for( int i=0; i != sortedCountsList.size(); ++i )
    {
        topCountsList.add( i, sortedCountsList.get(i) );
        if( i == TOP_WORDS_TO_FIND_COUNT-1 ) break;
    }     

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM