简体   繁体   中英

Finding the number of times each word in a hashset occurs in text document

I'm implementing a Naive Bayes text classification algorithm in Java.

What I have done so far is, declare a hashset called Vocabulary which stores all the unique words from a given text file (test file).

One of the steps in the algorithm is to concatenate all the members of the test files into a single text file. This turns out to be a fairly big file with the words from each file.

Now, I have to count the number of occurrences of each word in the Vocabulary with the concatenated text file. My first guess is to keep a sort of an array structure which contains the frequencies of each word. But then again, I would have way too many entries.

Could anyone please give me better suggestions?

Use a dictionary (HashMap) where the words are the keys and the values are the number of occurrences. If the HashSet fits into memory, HashMap should as well.

您可以尝试使用Tries,并且叶节点可以存储单词的频率。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM